The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: ecube on July 21, 2010, 01:05:42 PM

Title: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 01:05:42 PM
unicode strings when read as ansi usually just show 1 character, and i'm aware unicode strings use 2 bytes instead of one, example

unicode  dw "h","i"," ","T","H","E","R"," E",0

so what's an effective way to differentiate an unknown string as uncode or ansi?
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: redskull on July 21, 2010, 02:20:35 PM
I would wager the best way would be the check that it's greater than one byte long; That doesn't really help if you actually have a one-byte string, but "close" is probably the best you can come.

Also, I make it point to clear up the myth that unicode is just "two byte ASCII".  Unicode is just arbitrary numbers ("code points"), which can be encoded multiple ways.  Windows uses the UFT-16 method of encoding, which is actually *variable length*.  It just so happens that the vast majority of "normal" letters are encoded as "2-byte ASCII", which perpetuates the myth.

-r
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 02:36:19 PM
here's what I did, seems to work perfect, it uses the 1 byte length as a factor, but more so checks the return of WideCharToMultiByte, which returns a string wth a bunch of ???'s in it, on invalid unicode strings.



;returns 1 for ansi string, 2 for unicode, and also auto converts uni to ansi strings for you

data section
mybuff db 256 dup ?
unistr  dw "A",0

CODE SECTION
start:
invoke IsUniOrAscii,'A',addr mybuff
;eax= 1
invoke IsUniOrAscii,addr unistr,addr mybuff
;eax= 2
invoke ExitProcess

IsUniOrAscii FRAME a1,obuff
LOCAL notascii:D
invoke lstrcpy,[obuff],[a1]
invoke CharAbove,[obuff],256,127
cmp eax,0
jz >
mov D[notascii],1
jmp > @unicheck
:
invoke lstrlen,[obuff]
cmp eax,1
jle > @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,[a1],[obuff],256
cmp D[notascii],1
jne >
mov eax,2
ret
:
invoke lstrlen,[obuff]
test eax,eax
jz >
invoke FindChar,[obuff],256,63;?
cmp eax,3
jge >
mov eax,2
ret
:
invoke lstrcpy,[obuff],[a1]
mov eax,1
RET
ENDF

UniToAscii FRAME szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, [szAcii], -1,[szUnicodeBuffx],[bufsizex],NULL,FALSE
ret
ENDF

FindChar FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jne > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF


CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done2
cmp al,[iChar]
jbe > @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz > @done2
inc edi
jmp <

@done2:
mov eax,edx
RET
ENDF


UPDATE and here's a complete masm version, in these tests it doesn't convert the strings correctly, because I guess they don't fall under UTF-16, but it does detect em correctly which is my main goal, you can play with the WideCharToMultiByte fields below to do different conversions anyway.

include \masm32\include\masm32rt.inc
FindChar proto :DWORD,:DWORD,:BYTE
CharAbove proto :DWORD,:DWORD,:BYTE
IsUniOrAscii proto :DWORD,:DWORD
UniToAscii proto :DWORD,:DWORD,:DWORD
.data
ascii db "ascii",0
uni db "unicode",0

thestring db 0C4h, 0B0h, 0A7h, 0E2h, 0C3h, 96h, 94h, 9Ch, 0C3h, 9Ch, 00h, 00h
thestring2 db 0C5h, 9Eh, 0C4h, 9Eh, 00h, 00h

.data?
mybuff db 256 dup (?)
.code
start:
invoke IsUniOrAscii,addr thestring,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr thestring2,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr ascii,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke ExitProcess,0

FindChar proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done
cmp al,iChar
jne @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz @done
inc edi
jmp @B

@done:
mov eax,edx
RET
FindChar endp

CharAbove proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done2
cmp al,iChar
jbe @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz @done2
inc edi
jmp @B

@done2:
mov eax,edx
RET
CharAbove endp

IsUniOrAscii proc a1,obuff
LOCAL notascii:dword
invoke lstrcpy,obuff,a1
invoke CharAbove,obuff,256,127
cmp eax,0
jz @F
mov notascii,1
jmp @unicheck
@@:
invoke lstrlen,obuff
cmp eax,1
jle @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,a1,obuff,256
cmp notascii,1
jne @F
mov eax,2
ret
@@:
invoke lstrlen,obuff
test eax,eax
jz @F
invoke FindChar,obuff,256,63;?
cmp eax,3
jge @F
mov eax,2
ret
@@:
invoke lstrcpy,obuff,a1
mov eax,1
RET
IsUniOrAscii endp

UniToAscii proc szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, szAcii, -1,szUnicodeBuffx,bufsizex,NULL,FALSE
ret
UniToAscii endp
End start
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: Yuri on July 21, 2010, 02:38:25 PM
There is an API that might help — IsTextUnicode.
Quote
The IsTextUnicode function determines whether a buffer probably contains a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi. If all specified tests are passed, the function returns TRUE; otherwise, it returns FALSE.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 02:45:34 PM
Thanks Yuri, but that function appears to be very unrealiable, not saying mine is that great without more testing but I did just test it on

teststr db 41h, 0Ah, 0Dh, 1Dh,0

and it return the right results, where as according to msdn isTextUnicode wouldn't.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: Geryon on July 21, 2010, 04:30:17 PM
Acutally your code doesn't work at all.
Why ? Because the zero bytes of a unicode string doesn't always suposed to be zero. (for instance, any language other than english)
However IsTextUnicode is using more advanced techique even though it has a few bugs.
For example, try to write this string into notepad, save and open it again
"Bush hid the facts"
IsTextUnicode will misinterpre it.
There is a few API which may help you
http://msdn.microsoft.com/en-us/library/ff563884%28VS.85%29.aspx

And as I said, IsTextUnicode is more advanced and CORRECT than your code.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 04:40:10 PM
yet the function doesn't CORRECTLY identify even the simplest strings? and the only thing you can find wrong with my function is 0 length strings? yeah, go away.

and no the function list you posted doesn't help me...they're used extensively with nt functions, infact I wouldn't be suprised if WideCharToMultiByte is just a wrapper for a couple of them.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: Geryon on July 21, 2010, 04:58:22 PM
Quote from: E^cube on July 21, 2010, 04:40:10 PM
and the only thing you can find wrong with my function is 0 length strings?
No !!!

This is a unicode string
'A', 00h, 'B', 00h, 00h, 00h

BUT THEY ARE ALSO UNICODE STRINGS

C5h, 9Eh, C4h, 9Eh, 00h, 00h
and this one
C4h, B0h, A7h, E2h, C3h, 96h, 94h, 9Ch, C3h, 9Ch, 00h, 00h

Your code can NOT identify those strings above.

Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: dedndave on July 21, 2010, 05:07:27 PM
i dunno what you guys are arguing about - lol
there is no good way to identify string types that will work for all valid strings
notice - this applies to strings - not files - files are identified

for strings, i think the key is to know whether it is unicode or not ahead of time
if you are trying to write a function that handles both types of strings, MAKE THE CALLER IDENTIFY THE TYPE
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 05:08:23 PM
why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.


invoke CharAbove,buff,len,127 ;if eax > 0, then unicode
CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jle > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF



here you go champ... this correctly identifys all your strings and ones i've tried, and so far IsTextUnicode has failed countless, how is my code so incorrect again? Daves right it's not possible for this to cater to everything unicode, but terms of general use it seems to be fine. I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.

data section
mybuff db 256 dup ?
unistr  dw "A",0

CODE SECTION
start:
invoke IsUniOrAscii,'A',addr mybuff
;eax= 1
invoke IsUniOrAscii,addr unistr,addr mybuff
;eax= 2
invoke ExitProcess

IsUniOrAscii FRAME a1,obuff
invoke lstrcpy,[obuff],[a1]
invoke CharAbove,[obuff],256,127
test eax,eax
jnz  >
invoke lstrlen,[obuff]
cmp eax,1
jle >
mov eax,1
ret
:
invoke UniToAscii,[a1],[obuff],256
invoke lstrlen,[obuff]
test eax,eax
jz >
invoke FindChar,[obuff],256,63;?
cmp eax,3
jge >
mov eax,2
ret
:
invoke lstrcpy,[obuff],[a1]
mov eax,1
RET
ENDF

UniToAscii FRAME szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, [szAcii], -1,[szUnicodeBuffx],[bufsizex],NULL,FALSE
ret
ENDF

FindChar FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jne > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF


CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done2
cmp al,[iChar]
jle > @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz > @done2
inc edi
jmp <

@done2:
mov eax,edx
RET
END
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: cork on July 21, 2010, 05:47:20 PM
Quote from: E^cube on July 21, 2010, 01:05:42 PM
unicode strings when read as ansi usually just show 1 character, and i'm aware unicode strings use 2 bytes instead of one, example

unicode  dw "h","i"," ","T","H","E","R"," E",0

so what's an effective way to differentiate an unknown string as uncode or ansi?

Which encoding of Unicode are you trying to identify? UTF-16, UTF-8, UCS-2, UCS-4?
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: Geryon on July 21, 2010, 05:47:59 PM
Quote from: E^cube on July 21, 2010, 05:08:23 PM
why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.
No because unicode standart say so
http://www.unicode.org/versions/Unicode5.2.0/
Quote from: E^cube on July 21, 2010, 05:08:23 PM
I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.
I was in this forum since it was founded.
By the way, I realize what I miss in 2.5 years in forum.
Jerk Population...
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 06:11:23 PM
Quote from: Geryon on July 21, 2010, 05:47:59 PM
Quote from: E^cube on July 21, 2010, 05:08:23 PM
why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.
No because unicode standart say so
http://www.unicode.org/versions/Unicode5.2.0/
Quote from: E^cube on July 21, 2010, 05:08:23 PM
I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.
I was in this forum since it was founded.
By the way, I realize what I miss in 2.5 years in forum.
Jerk Population...

whatever guy, you come on here posting in all caps, for no reason, insulting my code and making invalid claims, none of which you've backed up, and now you're name calling and insulting the forum.

By the way,you might of registered here a few years ago, but you've posted a total of what 5/6 thread/comments in all these years?  How about you post some useful code instead of google links?
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: redskull on July 21, 2010, 06:28:36 PM
Come on guys, must we do this?  The bottom line is that there IS NO DEFINITIVE WAY to tell what format ANY data is in, ever, just by looking.  It's just bytes in the memory.  The post was for an *effective* way to *try* and tell the difference, and as long as we are talking about average-length, english language, UTF-16 Unicode strings, and if it works, what's the problem?

-r
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: Geryon on July 21, 2010, 06:47:54 PM
Quote from: E^cube on July 21, 2010, 06:11:23 PM
whatever guy, you come on here posting in all caps, for no reason, insulting my code and making invalid claims, none of which you've backed up, and now you're name calling and insulting the forum.
I strongly recommend read messages.
Quote from: E^cube on July 21, 2010, 06:11:23 PM
By the way,you might of registered here a few years ago, but you've posted a total of what 5/6 thread/comments in all these years?  How about you post some useful code instead of google links?
I was registered here when the win32asm-board was still alive. It's around 5-10 years ago.
Everybody who is old enough to remembers me. I don't have to prove myself to you.

On the other hand, There is no logical connection between when I registered or how long I have been using asm and validity of my claims. But It's obvious, try to help you is futile.
If you say 2 + 2 = 5, I completly agreee no matter what.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 06:59:07 PM
there was a slight bug in the char above that I put on purpose to see if this guy was even testing the functions, clearly he was not, anyway here's the fixed version.

data section
mybuff db 256 dup ?
unistr  dw "A",0

CODE SECTION
start:
invoke IsUniOrAscii,'A',addr mybuff
;eax= 1
invoke IsUniOrAscii,addr unistr,addr mybuff
;eax= 2
invoke ExitProcess

IsUniOrAscii FRAME a1,obuff
LOCAL notascii:D
invoke lstrcpy,[obuff],[a1]
invoke CharAbove,[obuff],256,127
cmp eax,0
jz >
mov D[notascii],1
jmp > @unicheck
:
invoke lstrlen,[obuff]
cmp eax,1
jle > @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,[a1],[obuff],256
cmp D[notascii],1
jne >
mov eax,2
ret
:
invoke lstrlen,[obuff]
test eax,eax
jz >
invoke FindChar,[obuff],256,63;?
cmp eax,3
jge >
mov eax,2
ret
:
invoke lstrcpy,[obuff],[a1]
mov eax,1
RET
ENDF

UniToAscii FRAME szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, [szAcii], -1,[szUnicodeBuffx],[bufsizex],NULL,FALSE
ret
ENDF

FindChar FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jne > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF


CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done2
cmp al,[iChar]
jbe > @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz > @done2
inc edi
jmp <

@done2:
mov eax,edx
RET
ENDF



and here's a complete masm version,in these tests it doesn't convert the strings correctly, because I guess they don't fall under UTF-16, but it does detect em correctly which is my main goal, you can play with the WideCharToMultiByte fields below to do different conversions anyway.

include \masm32\include\masm32rt.inc
FindChar proto :DWORD,:DWORD,:BYTE
CharAbove proto :DWORD,:DWORD,:BYTE
IsUniOrAscii proto :DWORD,:DWORD
UniToAscii proto :DWORD,:DWORD,:DWORD
.data
ascii db "ascii",0
uni db "unicode",0

thestring db 0C4h, 0B0h, 0A7h, 0E2h, 0C3h, 96h, 94h, 9Ch, 0C3h, 9Ch, 00h, 00h
thestring2 db 0C5h, 9Eh, 0C4h, 9Eh, 00h, 00h

.data?
mybuff db 256 dup (?)
.code
start:
invoke IsUniOrAscii,addr thestring,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr thestring2,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr ascii,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke ExitProcess,0

FindChar proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done
cmp al,iChar
jne @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz @done
inc edi
jmp @B

@done:
mov eax,edx
RET
FindChar endp

CharAbove proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done2
cmp al,iChar
jbe @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz @done2
inc edi
jmp @B

@done2:
mov eax,edx
RET
CharAbove endp

IsUniOrAscii proc a1,obuff
LOCAL notascii:dword
invoke lstrcpy,obuff,a1
invoke CharAbove,obuff,256,127
cmp eax,0
jz @F
mov notascii,1
jmp @unicheck
@@:
invoke lstrlen,obuff
cmp eax,1
jle @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,a1,obuff,256
cmp notascii,1
jne @F
mov eax,2
ret
@@:
invoke lstrlen,obuff
test eax,eax
jz @F
invoke FindChar,obuff,256,63;?
cmp eax,3
jge @F
mov eax,2
ret
@@:
invoke lstrcpy,obuff,a1
mov eax,1
RET
IsUniOrAscii endp

UniToAscii proc szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, szAcii, -1,szUnicodeBuffx,bufsizex,NULL,FALSE
ret
UniToAscii endp
End start
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 07:28:45 PM
Quote from: Geryon on July 21, 2010, 06:47:54 PM
Quote from: E^cube on July 21, 2010, 06:11:23 PM
whatever guy, you come on here posting in all caps, for no reason, insulting my code and making invalid claims, none of which you've backed up, and now you're name calling and insulting the forum.
I strongly recommend read messages.
Quote from: E^cube on July 21, 2010, 06:11:23 PM
By the way,you might of registered here a few years ago, but you've posted a total of what 5/6 thread/comments in all these years?  How about you post some useful code instead of google links?
I was registered here when the win32asm-board was still alive. It's around 5-10 years ago.
Everybody who is old enough to remembers me. I don't have to prove myself to you.

On the other hand, There is no logical connection between when I registered or how long I have been using asm and validity of my claims. But It's obvious, try to help you is futile.
If you say 2 + 2 = 5, I completly agreee no matter what.

here are the facts
1)IsTextUnicode has failed to identify unicode strings in over 6 easy examples
2)my code so far has identified them ALL correctly
3)you have a total of < 15 posts in the time you registered here and now(2004 is earliest I see your nick)
4)I(not suprisingly) can't even find you on the other forum...
5)you've failed to provide any code what so ever, or any proof of your ridiculous claims

anyway i'm done wasting my time on you, and IMO just because you registered a nick here, that doesn't make you part of the community. Only when you contribute, will that change.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: BogdanOntanu on July 21, 2010, 08:36:07 PM
Fair warning to both of you: the attitude on Campus is supposed to be friendly... behave and stop calling eachother names.
Start using logical arguments.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: ecube on July 21, 2010, 08:46:08 PM
I don't need your warning BogdanOntanu...what you need to do is read his inital post and his use of caps and combative language.
Title: Re: what's an effective way to differentiate an ansi string from a unicode?
Post by: BogdanOntanu on July 21, 2010, 09:34:06 PM
Quote from: E^cube on July 21, 2010, 08:46:08 PM
I don't need your warning BogdanOntanu...what you need to do is read his inital post and his use of caps and combative language.

BOTH of you have some valid points and some mistakes in your claims and statements fom a logical point of view. You could learn by understanding the other's "point of view" on this issue.

More exactly Unicode can be encoded in so many way (as it was pointer out to you here) that you can NOT surely deduct if some data stream is unicode or not without other external information or hints.

However you could make some empirical functions based on partial understandings of UNicode that would work apparently for some common cases. This is hardly corect and exact given the complexity of Unicode and it's variouse encodings  but it might work in acceptable ways "for you".



As  a last hint.... if I may point you to the fact that some binary combinations are invalid in UTF-8 encoding (same goes for UTF16, etc) and because of this a string can NOT be considered as beeing "unicode" simply because it has some binary byte values above 127 in it.  Empirically this might be a hint and yes it might work in many examples but it is incorect by logic and standards.

Also some extended ASCII and codepages use the upper 128-255 binary values in order to encode specific Eastern Euopean special characters in text modes (Romanian for example... but also Hungariam Slovakian, etc) and again because of this finding an above 127 char inside a string is not a corect or certain way to detect an unicode string when in fact it could have been an extended ASCII string with special characters or ascii art included.

The best way to decide is to study Unicode standards and "code points" and speciifc code points encodings like UTF-8 and UTF16 as it was hinted to you in this thread. Another way is to check for Unicode BOM (when present).

Yes your interlocutor was slightly offensive at start but he also provided corect hints to such unicode  matters. AND a single word in CAPS does not justify your later reactions and to be honest your reactions do not justify his later reactions...

Even if he is correct in his logical statements ... still this is the Campus and beginners posting here are expected to understand hardly or to have incorect personal points of view.

Also asking for code or rejecting truth based on posts count is not exactly nice either. 

You could have maintained you corect attitude and leave it to moderators...  and in the same time learn from his hints while not having to  agree with his attitude.

Then I would have had warned only  him the "behave" and respect the Campus "standard behaviour"...

Unfortunately you have choosen to escaladate the conflict ... and I was forced by the rules to warn you both ... and evem more unfortunately you have also choosen to argue agains an moderator acting as an moderator ... and this is not acceptable.

Under this circumstances I "have to" close your thread. Sorry...