Hi all,
I am looking for a function which can compare strings (it should not be case sensitive) and return a value which represents if the input strings were equal or the first one has had higher order in dictionary order or vice versa. In MASM32 library there is a function which szCmpi but it just compares two strings (it is not case sensitive either) and returned value can be used to know if they were equal or not. Actually, it is easy to find out if two strings are equal or one has higher order, but the problem is that I do not want to convert both to uppercase or lowercase to make the functions insensitive. I also know that szCmpi uses a table to be insensitive but the source code was quite confusing. Can anybody help?
D.
Danesh,
lstrcmpi win32 api is doing that and it have the advantage to take care of the language of the string which I don't think szCmpi care. But lstrcmpi is slow.
Danesh,
The algo is not that complicated, it has a table that has the "a" to "z" range and the "A" the "Z" ranges of the same value so that case comparisons are no longer a problem. By using a table "A" and "a" match because their position in the table is different but the value is the same. There should be no problem with just using it if you don't want to analyse the code.
In psuedo code:
func cisStrCmp ( string a, string b )
len = min(length(a), length(b))
for i = 1 to len
ca = tolower(a)
cb = tolower(b)
if ca > cb then return a comes after b
if cb > ca then return b comes after a
next
if length(a) < length(b) return a comes before b
if length(b) < length(a) return b comes before a
return a and b are equal
end func
I havn't done asm in a while, so I didn't write it as such, but it shouldn't be hard too... and this algorithm is O(n), I don't think it get's much faster, although I could be mistaken... (2nd thought) some optimizations when writing could be:
- store length of a and length of b for reuse at start
- make lookup-table for tolower, which would be a 0-255 array of bytes, where a = i, but [A-Z] = a-z
hope this helps.
- Chris
Danesh,
Here is an over commented version of the algo. The two character ranges for upper and lower case have the same values so that case insensitive comparisons work. The algo code reads the two byte values by zero extending them into 32 bit registers. Gets the first byte value from the table and zero extends it into a 32 bit register then does a direct BYTE comparison to the other value in the table.
Its done this way for two reasons, the table can be modified for different chaacter sets and the table method is very fast.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.486 ; force 32 bit code
.model flat, stdcall ; memory model & calling convention
option casemap :none ; case sensitive
comment * -----------------------------------------------------------
This algorithm was developed in conjunction with Jim Giordano
----------------------------------------------------------- *
.data
align 16
szCmpi_tbl \
db 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
db 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
db 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
db 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
db 64, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111 ; < 1st 26 characters
db 112,113,114,115,116,117,118,119,120,121,122, 91, 92, 93, 94, 95
db 96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111 ; < 2nd 26 characters
db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
db 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143
db 144,145,146,147,148,149,150,151,152,153,154,155,156,156,158,159
db 160,161,162,163,164,165,166,167,168,169,170,171,172,173,173,175
db 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
db 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
db 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
db 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
db 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE ; remove stack frame
OPTION EPILOGUE:NONE
align 16
szCmpi proc src:DWORD,dst:DWORD,ln:DWORD
; -------------------------------------------------------------------------
; entry point of procedure is the address of the following PUSH instruction
; -------------------------------------------------------------------------
push ebx ; preserve used registers
push esi
push edi
mov esi, [esp+4+12] ; load source adress
mov edi, [esp+8+12] ; load destination address
sub eax, eax ; zero eax as index
align 4
@@:
movzx edx, BYTE PTR [esi+eax] ; zero extend source byte into EDX
movzx ebx, BYTE PTR [edi+eax] ; zero extend dest byte into EBX
movzx ecx, BYTE PTR [edx+szCmpi_tbl] ; zero extend 1st byte table value into ECX
add eax, 1
cmp cl, [ebx+szCmpi_tbl] ; compare it to second byte in table
jne quit ; exit on mismatch with count in EAX
cmp eax, [esp+12+12] ; test index against byte count
jb @B
sub eax, eax ; set EAX to zero on match
quit:
pop edi ; restore used registers
pop esi
pop ebx
ret 12 ; balance the stack on exit
szCmpi endp
OPTION PROLOGUE:PrologueDef ; restore stack frame
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end
I've worked on bytes table to make it worked with french characters also (for szCmpi from masm32.lib).
Table
szCmpi_tbl BYTE 000h,001h,002h,003h,004h,005h,006h,007h,008h,009h,00Ah,00Bh,00Ch,00Dh,00Eh,00Fh
BYTE 010h,011h,012h,013h,014h,015h,016h,017h,018h,019h,01Ah,01Bh,01Ch,01Dh,01Eh,01Fh
BYTE 020h,021h,022h,023h,024h,025h,026h,027h,028h,029h,02Ah,02Bh,02Ch,02Dh,02Eh,02Fh
BYTE 030h,031h,032h,033h,034h,035h,036h,037h,038h,039h,03Ah,03Bh,03Ch,03Dh,03Eh,03Fh
BYTE 040h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,05Bh,05Ch,05Dh,05Eh,05Fh
BYTE 060h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,07Bh,07Ch,07Dh,07Eh,07Fh
BYTE 080h,081h,082h,083h,084h,085h,086h,087h,088h,089h,08Ah,08Bh,08Ch,08Dh,08Eh,08Fh
BYTE 090h,091h,092h,093h,094h,095h,096h,097h,098h,099h,09Ah,09Bh,09Ch,09Dh,09Eh,09Fh
BYTE 0A0h,0A1h,0A2h,0A3h,0A4h,0A5h,0A6h,0A7h,0A8h,0A9h,0AAh,0ABh,0ACh,0ADh,0AEh,0AFh
BYTE 0B0h,0B1h,0B2h,0B3h,0B4h,0B5h,0B6h,0B7h,0B8h,0B9h,0BAh,0BBh,0BCh,0BDh,0BEh,0BFh
BYTE 0E0h,0C1h,0E2h,0C3h,0E4h,0C5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0CCh,0CDh,0EEh,0EFh
BYTE 0D0h,0D1h,0F2h,0D3h,0F4h,0D5h,0F6h,0D7h,0D8h,0D9h,0DAh,0FBh,0FCh,0DDh,0DEh,0DFh
BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
BYTE 0F0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0F7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0FEh,0FFh
I don't think I forgot any but if I forgot a character send me a mail (I'm not sure you'll see these in your browser if not french Windows).
Dites moi si j'ai oublié un accent quelconque pour que j'ajuste la table de caractères. Merci
szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ",0
jdoe,
szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ",0
Displays fine on my Gecko engined K-MELEON.
Quote from: hutch-- on January 05, 2007, 02:45:18 AM
Displays fine on my Gecko engined K-MELEON.
I think it must be because the french characters fits in the ASCII table but those who use unicode language could have problems to see it. I can't be sure of that and that's why I'm never sure if it will be displayed right.
:wink
This updated table should take care of ALL lower and uppercase characters, even if they are not used.
szCmpi_tbl BYTE 000h,001h,002h,003h,004h,005h,006h,007h,008h,009h,00Ah,00Bh,00Ch,00Dh,00Eh,00Fh
BYTE 010h,011h,012h,013h,014h,015h,016h,017h,018h,019h,01Ah,01Bh,01Ch,01Dh,01Eh,01Fh
BYTE 020h,021h,022h,023h,024h,025h,026h,027h,028h,029h,02Ah,02Bh,02Ch,02Dh,02Eh,02Fh
BYTE 030h,031h,032h,033h,034h,035h,036h,037h,038h,039h,03Ah,03Bh,03Ch,03Dh,03Eh,03Fh
BYTE 040h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,05Bh,05Ch,05Dh,05Eh,05Fh
BYTE 060h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,07Bh,07Ch,07Dh,07Eh,07Fh
BYTE 080h,081h,082h,083h,084h,085h,086h,087h,088h,089h,09Ah,08Bh,09Ch,08Dh,09Eh,08Fh
BYTE 090h,091h,092h,093h,094h,095h,096h,097h,098h,099h,09Ah,09Bh,09Ch,09Dh,09Eh,09Fh
BYTE 0A0h,0A1h,0A2h,0A3h,0A4h,0A5h,0A6h,0A7h,0A8h,0A9h,0AAh,0ABh,0ACh,0ADh,0AEh,0AFh
BYTE 0B0h,0B1h,0B2h,0B3h,0B4h,0B5h,0B6h,0B7h,0B8h,0B9h,0BAh,0BBh,0BCh,0BDh,0BEh,0BFh
BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
BYTE 0D0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0D7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0DEh,0DFh
BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
BYTE 0F0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0F7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0FEh,0FFh
szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ Ú Œ Á Ì Í Ó Õ Ù Ú Ø Š Ž Ý Å Ã Ñ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ ú œ á ì í ó õ ù ú ø š ž ý å ã ñ",0
There is nothing more to do with this table :P
Hello Danesh,
that's my way of comparing strings. The main idea is to compare DWORDs instead of bytes. A possible char translation is done on the fly. Its a bit faster than the algo above. :-)
There are 2 procs with and without char translation.
Beside this I've attached my CharTranslate-Routines with some predefined tables e.g. AsciiToAnsi, AnsiUpper, EbcdicToAscii. Significant faster than the MS APIs.
Ulli
[attachment deleted by admin]
WOW !!! Thanks everybody! I have modified the szCmpi table and of course the code. My goal was to support English and Swedish characters. (Swedish has three chars which are "äåö").
Thanks again,
Danesh
Quote from: Danesh on January 05, 2007, 06:40:29 PM
My goal was to support English and Swedish characters. (Swedish has three chars which are "äåö").
I had the feeling that if these few more characters were there that it could be useful somehow.
:8)
Here my strcmp algo, it's a modest one but it does the job :
.386
.model flat,stdcall
option casemap:none
.code
strcmp PROC USES esi edi dest:DWORD,src:DWORD
mov esi,src
mov edi,dest
xor eax,eax
xor ecx,ecx
xor edx,edx
dec edx
@@:
add edx,1
mov al,BYTE PTR [edi+edx]
mov cl,BYTE PTR [esi+edx]
cmp al,cl
jne no_match
test al,al
jnz @b
xor eax,eax
ret
no_match:
sub eax,ecx
ret
strcmp ENDP
END