what's an effective way to differentiate an ansi string from a unicode?

Started by ecube, July 21, 2010, 01:05:42 PM

Previous topic - Next topic

ecube

unicode strings when read as ansi usually just show 1 character, and i'm aware unicode strings use 2 bytes instead of one, example

unicode  dw "h","i"," ","T","H","E","R"," E",0

so what's an effective way to differentiate an unknown string as uncode or ansi?

redskull

I would wager the best way would be the check that it's greater than one byte long; That doesn't really help if you actually have a one-byte string, but "close" is probably the best you can come.

Also, I make it point to clear up the myth that unicode is just "two byte ASCII".  Unicode is just arbitrary numbers ("code points"), which can be encoded multiple ways.  Windows uses the UFT-16 method of encoding, which is actually *variable length*.  It just so happens that the vast majority of "normal" letters are encoded as "2-byte ASCII", which perpetuates the myth.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

ecube

here's what I did, seems to work perfect, it uses the 1 byte length as a factor, but more so checks the return of WideCharToMultiByte, which returns a string wth a bunch of ???'s in it, on invalid unicode strings.



;returns 1 for ansi string, 2 for unicode, and also auto converts uni to ansi strings for you

data section
mybuff db 256 dup ?
unistr  dw "A",0

CODE SECTION
start:
invoke IsUniOrAscii,'A',addr mybuff
;eax= 1
invoke IsUniOrAscii,addr unistr,addr mybuff
;eax= 2
invoke ExitProcess

IsUniOrAscii FRAME a1,obuff
LOCAL notascii:D
invoke lstrcpy,[obuff],[a1]
invoke CharAbove,[obuff],256,127
cmp eax,0
jz >
mov D[notascii],1
jmp > @unicheck
:
invoke lstrlen,[obuff]
cmp eax,1
jle > @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,[a1],[obuff],256
cmp D[notascii],1
jne >
mov eax,2
ret
:
invoke lstrlen,[obuff]
test eax,eax
jz >
invoke FindChar,[obuff],256,63;?
cmp eax,3
jge >
mov eax,2
ret
:
invoke lstrcpy,[obuff],[a1]
mov eax,1
RET
ENDF

UniToAscii FRAME szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, [szAcii], -1,[szUnicodeBuffx],[bufsizex],NULL,FALSE
ret
ENDF

FindChar FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jne > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF


CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done2
cmp al,[iChar]
jbe > @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz > @done2
inc edi
jmp <

@done2:
mov eax,edx
RET
ENDF


UPDATE and here's a complete masm version, in these tests it doesn't convert the strings correctly, because I guess they don't fall under UTF-16, but it does detect em correctly which is my main goal, you can play with the WideCharToMultiByte fields below to do different conversions anyway.

include \masm32\include\masm32rt.inc
FindChar proto :DWORD,:DWORD,:BYTE
CharAbove proto :DWORD,:DWORD,:BYTE
IsUniOrAscii proto :DWORD,:DWORD
UniToAscii proto :DWORD,:DWORD,:DWORD
.data
ascii db "ascii",0
uni db "unicode",0

thestring db 0C4h, 0B0h, 0A7h, 0E2h, 0C3h, 96h, 94h, 9Ch, 0C3h, 9Ch, 00h, 00h
thestring2 db 0C5h, 9Eh, 0C4h, 9Eh, 00h, 00h

.data?
mybuff db 256 dup (?)
.code
start:
invoke IsUniOrAscii,addr thestring,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr thestring2,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke IsUniOrAscii,addr ascii,addr mybuff
.if eax==1
invoke MessageBox,0,addr mybuff,addr ascii,MB_ICONINFORMATION
.else
invoke MessageBox,0,addr mybuff,addr uni,MB_ICONINFORMATION
.endif

invoke ExitProcess,0

FindChar proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done
cmp al,iChar
jne @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz @done
inc edi
jmp @B

@done:
mov eax,edx
RET
FindChar endp

CharAbove proc iBuf:DWORD,iLen:DWORD,iChar:BYTE
mov edi,iBuf
mov ecx,iLen
xor eax,eax
xor edx,edx
@@:
mov al,byte ptr [edi]
cmp al,0
je @done2
cmp al,iChar
jbe @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz @done2
inc edi
jmp @B

@done2:
mov eax,edx
RET
CharAbove endp

IsUniOrAscii proc a1,obuff
LOCAL notascii:dword
invoke lstrcpy,obuff,a1
invoke CharAbove,obuff,256,127
cmp eax,0
jz @F
mov notascii,1
jmp @unicheck
@@:
invoke lstrlen,obuff
cmp eax,1
jle @unicheck
mov eax,1
ret
@unicheck:
invoke UniToAscii,a1,obuff,256
cmp notascii,1
jne @F
mov eax,2
ret
@@:
invoke lstrlen,obuff
test eax,eax
jz @F
invoke FindChar,obuff,256,63;?
cmp eax,3
jge @F
mov eax,2
ret
@@:
invoke lstrcpy,obuff,a1
mov eax,1
RET
IsUniOrAscii endp

UniToAscii proc szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, szAcii, -1,szUnicodeBuffx,bufsizex,NULL,FALSE
ret
UniToAscii endp
End start

Yuri

There is an API that might help — IsTextUnicode.
Quote
The IsTextUnicode function determines whether a buffer probably contains a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi. If all specified tests are passed, the function returns TRUE; otherwise, it returns FALSE.

ecube

Thanks Yuri, but that function appears to be very unrealiable, not saying mine is that great without more testing but I did just test it on

teststr db 41h, 0Ah, 0Dh, 1Dh,0

and it return the right results, where as according to msdn isTextUnicode wouldn't.

Geryon

Acutally your code doesn't work at all.
Why ? Because the zero bytes of a unicode string doesn't always suposed to be zero. (for instance, any language other than english)
However IsTextUnicode is using more advanced techique even though it has a few bugs.
For example, try to write this string into notepad, save and open it again
"Bush hid the facts"
IsTextUnicode will misinterpre it.
There is a few API which may help you
http://msdn.microsoft.com/en-us/library/ff563884%28VS.85%29.aspx

And as I said, IsTextUnicode is more advanced and CORRECT than your code.
"Some people have got a mental horizon of radius zero and call it their point of view." --D.Hilbert

ecube

yet the function doesn't CORRECTLY identify even the simplest strings? and the only thing you can find wrong with my function is 0 length strings? yeah, go away.

and no the function list you posted doesn't help me...they're used extensively with nt functions, infact I wouldn't be suprised if WideCharToMultiByte is just a wrapper for a couple of them.

Geryon

Quote from: E^cube on July 21, 2010, 04:40:10 PM
and the only thing you can find wrong with my function is 0 length strings?
No !!!

This is a unicode string
'A', 00h, 'B', 00h, 00h, 00h

BUT THEY ARE ALSO UNICODE STRINGS

C5h, 9Eh, C4h, 9Eh, 00h, 00h
and this one
C4h, B0h, A7h, E2h, C3h, 96h, 94h, 9Ch, C3h, 9Ch, 00h, 00h

Your code can NOT identify those strings above.

"Some people have got a mental horizon of radius zero and call it their point of view." --D.Hilbert

dedndave

i dunno what you guys are arguing about - lol
there is no good way to identify string types that will work for all valid strings
notice - this applies to strings - not files - files are identified

for strings, i think the key is to know whether it is unicode or not ahead of time
if you are trying to write a function that handles both types of strings, MAKE THE CALLER IDENTIFY THE TYPE

ecube

why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.


invoke CharAbove,buff,len,127 ;if eax > 0, then unicode
CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jle > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF



here you go champ... this correctly identifys all your strings and ones i've tried, and so far IsTextUnicode has failed countless, how is my code so incorrect again? Daves right it's not possible for this to cater to everything unicode, but terms of general use it seems to be fine. I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.

data section
mybuff db 256 dup ?
unistr  dw "A",0

CODE SECTION
start:
invoke IsUniOrAscii,'A',addr mybuff
;eax= 1
invoke IsUniOrAscii,addr unistr,addr mybuff
;eax= 2
invoke ExitProcess

IsUniOrAscii FRAME a1,obuff
invoke lstrcpy,[obuff],[a1]
invoke CharAbove,[obuff],256,127
test eax,eax
jnz  >
invoke lstrlen,[obuff]
cmp eax,1
jle >
mov eax,1
ret
:
invoke UniToAscii,[a1],[obuff],256
invoke lstrlen,[obuff]
test eax,eax
jz >
invoke FindChar,[obuff],256,63;?
cmp eax,3
jge >
mov eax,2
ret
:
invoke lstrcpy,[obuff],[a1]
mov eax,1
RET
ENDF

UniToAscii FRAME szAcii,szUnicodeBuffx,bufsizex
invoke WideCharToMultiByte,CP_ACP, 0, [szAcii], -1,[szUnicodeBuffx],[bufsizex],NULL,FALSE
ret
ENDF

FindChar FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done
cmp al,[iChar]
jne > @notchar
add edx,1
@notchar:
dec ecx
test ecx,ecx
jz > @done
inc edi
jmp <

@done:
mov eax,edx
RET
ENDF


CharAbove FRAME iBuf,iLen,iChar
mov edi,[iBuf]
mov ecx,[iLen]
xor eax,eax
xor edx,edx
:
mov al,B[edi]
cmp al,0
je > @done2
cmp al,[iChar]
jle > @notchar2
mov eax,1
ret
@notchar2:
dec ecx
test ecx,ecx
jz > @done2
inc edi
jmp <

@done2:
mov eax,edx
RET
END

cork

Quote from: E^cube on July 21, 2010, 01:05:42 PM
unicode strings when read as ansi usually just show 1 character, and i'm aware unicode strings use 2 bytes instead of one, example

unicode  dw "h","i"," ","T","H","E","R"," E",0

so what's an effective way to differentiate an unknown string as uncode or ansi?

Which encoding of Unicode are you trying to identify? UTF-16, UTF-8, UCS-2, UCS-4?

Geryon

Quote from: E^cube on July 21, 2010, 05:08:23 PM
why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.
No because unicode standart say so
http://www.unicode.org/versions/Unicode5.2.0/
Quote from: E^cube on July 21, 2010, 05:08:23 PM
I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.
I was in this forum since it was founded.
By the way, I realize what I miss in 2.5 years in forum.
Jerk Population...
"Some people have got a mental horizon of radius zero and call it their point of view." --D.Hilbert

ecube

Quote from: Geryon on July 21, 2010, 05:47:59 PM
Quote from: E^cube on July 21, 2010, 05:08:23 PM
why are they unicode strings? because you say so? regardless that's a easy fix, can just filter out anything above 127 as unicode using your ridiculous logic.
No because unicode standart say so
http://www.unicode.org/versions/Unicode5.2.0/
Quote from: E^cube on July 21, 2010, 05:08:23 PM
I don't know what it is with people that have no post count coming on here attacking me, but it's getting old.
I was in this forum since it was founded.
By the way, I realize what I miss in 2.5 years in forum.
Jerk Population...

whatever guy, you come on here posting in all caps, for no reason, insulting my code and making invalid claims, none of which you've backed up, and now you're name calling and insulting the forum.

By the way,you might of registered here a few years ago, but you've posted a total of what 5/6 thread/comments in all these years?  How about you post some useful code instead of google links?

redskull

Come on guys, must we do this?  The bottom line is that there IS NO DEFINITIVE WAY to tell what format ANY data is in, ever, just by looking.  It's just bytes in the memory.  The post was for an *effective* way to *try* and tell the difference, and as long as we are talking about average-length, english language, UTF-16 Unicode strings, and if it works, what's the problem?

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Geryon

Quote from: E^cube on July 21, 2010, 06:11:23 PM
whatever guy, you come on here posting in all caps, for no reason, insulting my code and making invalid claims, none of which you've backed up, and now you're name calling and insulting the forum.
I strongly recommend read messages.
Quote from: E^cube on July 21, 2010, 06:11:23 PM
By the way,you might of registered here a few years ago, but you've posted a total of what 5/6 thread/comments in all these years?  How about you post some useful code instead of google links?
I was registered here when the win32asm-board was still alive. It's around 5-10 years ago.
Everybody who is old enough to remembers me. I don't have to prove myself to you.

On the other hand, There is no logical connection between when I registered or how long I have been using asm and validity of my claims. But It's obvious, try to help you is futile.
If you say 2 + 2 = 5, I completly agreee no matter what.
"Some people have got a mental horizon of radius zero and call it their point of view." --D.Hilbert