News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

String comparision

Started by Danesh, January 04, 2007, 11:08:58 PM

Previous topic - Next topic

Danesh

Hi all,

I am looking for a function which can compare strings (it should not be case sensitive) and return a value which represents if the input strings were equal or the first one has had higher order in dictionary order or vice versa. In MASM32 library there is a function which szCmpi but it just compares two strings (it is not case sensitive either) and returned value can be used to know if they were equal or not. Actually, it is easy to find out if two strings are equal or one has higher order, but the problem is that I do not want to convert both to uppercase or lowercase to make the functions insensitive. I also know that szCmpi uses a table to be insensitive but the source code was quite confusing. Can anybody help?

D.

jdoe

Danesh,

lstrcmpi win32 api is doing that and it have the advantage to take care of the language of the string which I don't think szCmpi care. But lstrcmpi is slow.

hutch--

Danesh,

The algo is not that complicated, it has a table that has the "a" to "z" range and the "A" the "Z" ranges of the same value so that case comparisons are no longer a problem. By using a table "A" and "a" match because their position in the table is different but the value is the same. There should be no problem with just using it if you don't want to analyse the code.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

LithiumDex

In psuedo code:

func cisStrCmp ( string a, string b )
len = min(length(a), length(b))
for i = 1 to len
   ca = tolower(a)
   cb = tolower(b)
   if ca > cb then return a comes after b
   if cb > ca then return b comes after a
next
if length(a) < length(b) return a comes before b
if length(b) < length(a) return b comes before a
return a and b are equal
end func

I havn't done asm in a while, so I didn't write it as such, but it shouldn't be hard too... and this algorithm is O(n), I don't think it get's much faster, although I could be mistaken... (2nd thought) some optimizations when writing could be:
- store length of a and length of b for reuse at start
- make lookup-table for tolower, which would be a 0-255 array of bytes, where a = i, but [A-Z] = a-z

hope this helps.

- Chris

hutch--

Danesh,

Here is an over commented version of the algo. The two character ranges for upper and lower case have the same values so that case insensitive comparisons work. The algo code reads the two byte values by zero extending them into 32 bit registers. Gets the first byte value from the table and zero extends it into a 32 bit register then does a direct BYTE comparison to the other value in the table.

Its done this way for two reasons, the table can be modified for different chaacter sets and the table method is very fast.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .486                      ; force 32 bit code
    .model flat, stdcall      ; memory model & calling convention
    option casemap :none      ; case sensitive

comment * -----------------------------------------------------------
        This algorithm was developed in conjunction with Jim Giordano
        ----------------------------------------------------------- *

    .data
    align 16
      szCmpi_tbl \
      db   0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15
      db  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
      db  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
      db  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
      db  64, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111    ; < 1st 26 characters
      db 112,113,114,115,116,117,118,119,120,121,122, 91, 92, 93, 94, 95
      db  96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111    ; < 2nd 26 characters
      db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
      db 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143
      db 144,145,146,147,148,149,150,151,152,153,154,155,156,156,158,159
      db 160,161,162,163,164,165,166,167,168,169,170,171,172,173,173,175
      db 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
      db 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
      db 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
      db 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
      db 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

    .code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE                        ; remove stack frame
OPTION EPILOGUE:NONE

align 16

szCmpi proc src:DWORD,dst:DWORD,ln:DWORD

  ; -------------------------------------------------------------------------
  ; entry point of procedure is the address of the following PUSH instruction
  ; -------------------------------------------------------------------------
    push ebx                                ; preserve used registers
    push esi
    push edi

    mov esi, [esp+4+12]                     ; load source adress
    mov edi, [esp+8+12]                     ; load destination address
    sub eax, eax                            ; zero eax as index

  align 4
  @@:
    movzx edx, BYTE PTR [esi+eax]           ; zero extend source byte into EDX
    movzx ebx, BYTE PTR [edi+eax]           ; zero extend dest byte into EBX
    movzx ecx, BYTE PTR [edx+szCmpi_tbl]    ; zero extend 1st byte table value into ECX
    add eax, 1
    cmp cl, [ebx+szCmpi_tbl]                ; compare it to second byte in table
    jne quit                                ; exit on mismatch with count in EAX
    cmp eax, [esp+12+12]                    ; test index against byte count
    jb @B

    sub eax, eax                            ; set EAX to zero on match

  quit:
    pop edi                                 ; restore used registers
    pop esi
    pop ebx

    ret 12                                  ; balance the stack on exit

szCmpi endp

OPTION PROLOGUE:PrologueDef                 ; restore stack frame
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    end
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jdoe

I've worked on bytes table to make it worked with french characters also (for szCmpi from masm32.lib).

Table

szCmpi_tbl BYTE 000h,001h,002h,003h,004h,005h,006h,007h,008h,009h,00Ah,00Bh,00Ch,00Dh,00Eh,00Fh
           BYTE 010h,011h,012h,013h,014h,015h,016h,017h,018h,019h,01Ah,01Bh,01Ch,01Dh,01Eh,01Fh
           BYTE 020h,021h,022h,023h,024h,025h,026h,027h,028h,029h,02Ah,02Bh,02Ch,02Dh,02Eh,02Fh
           BYTE 030h,031h,032h,033h,034h,035h,036h,037h,038h,039h,03Ah,03Bh,03Ch,03Dh,03Eh,03Fh
           BYTE 040h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
           BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,05Bh,05Ch,05Dh,05Eh,05Fh
           BYTE 060h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
           BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,07Bh,07Ch,07Dh,07Eh,07Fh
           BYTE 080h,081h,082h,083h,084h,085h,086h,087h,088h,089h,08Ah,08Bh,08Ch,08Dh,08Eh,08Fh
           BYTE 090h,091h,092h,093h,094h,095h,096h,097h,098h,099h,09Ah,09Bh,09Ch,09Dh,09Eh,09Fh
           BYTE 0A0h,0A1h,0A2h,0A3h,0A4h,0A5h,0A6h,0A7h,0A8h,0A9h,0AAh,0ABh,0ACh,0ADh,0AEh,0AFh
           BYTE 0B0h,0B1h,0B2h,0B3h,0B4h,0B5h,0B6h,0B7h,0B8h,0B9h,0BAh,0BBh,0BCh,0BDh,0BEh,0BFh
           BYTE 0E0h,0C1h,0E2h,0C3h,0E4h,0C5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0CCh,0CDh,0EEh,0EFh
           BYTE 0D0h,0D1h,0F2h,0D3h,0F4h,0D5h,0F6h,0D7h,0D8h,0D9h,0DAh,0FBh,0FCh,0DDh,0DEh,0DFh
           BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
           BYTE 0F0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0F7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0FEh,0FFh


I don't think I forgot any but if I forgot a character send me a mail (I'm not sure you'll see these in your browser if not french Windows).

Dites moi si j'ai oublié un accent quelconque pour que j'ajuste la table de caractères. Merci


szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ",0



hutch--

jdoe,


szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ",0


Displays fine on my Gecko engined K-MELEON.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jdoe

Quote from: hutch-- on January 05, 2007, 02:45:18 AM
Displays fine on my Gecko engined K-MELEON.

I think it must be because the french characters fits in the ASCII table but those who use unicode language could have problems to see it. I can't be sure of that and that's why I'm never sure if it will be displayed right.

:wink


jdoe


This updated table should take care of ALL lower and uppercase characters, even if they are not used.


szCmpi_tbl BYTE 000h,001h,002h,003h,004h,005h,006h,007h,008h,009h,00Ah,00Bh,00Ch,00Dh,00Eh,00Fh
           BYTE 010h,011h,012h,013h,014h,015h,016h,017h,018h,019h,01Ah,01Bh,01Ch,01Dh,01Eh,01Fh
           BYTE 020h,021h,022h,023h,024h,025h,026h,027h,028h,029h,02Ah,02Bh,02Ch,02Dh,02Eh,02Fh
           BYTE 030h,031h,032h,033h,034h,035h,036h,037h,038h,039h,03Ah,03Bh,03Ch,03Dh,03Eh,03Fh
           BYTE 040h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
           BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,05Bh,05Ch,05Dh,05Eh,05Fh
           BYTE 060h,061h,062h,063h,064h,065h,066h,067h,068h,069h,06Ah,06Bh,06Ch,06Dh,06Eh,06Fh
           BYTE 070h,071h,072h,073h,074h,075h,076h,077h,078h,079h,07Ah,07Bh,07Ch,07Dh,07Eh,07Fh
           BYTE 080h,081h,082h,083h,084h,085h,086h,087h,088h,089h,09Ah,08Bh,09Ch,08Dh,09Eh,08Fh
           BYTE 090h,091h,092h,093h,094h,095h,096h,097h,098h,099h,09Ah,09Bh,09Ch,09Dh,09Eh,09Fh
           BYTE 0A0h,0A1h,0A2h,0A3h,0A4h,0A5h,0A6h,0A7h,0A8h,0A9h,0AAh,0ABh,0ACh,0ADh,0AEh,0AFh
           BYTE 0B0h,0B1h,0B2h,0B3h,0B4h,0B5h,0B6h,0B7h,0B8h,0B9h,0BAh,0BBh,0BCh,0BDh,0BEh,0BFh
           BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
           BYTE 0D0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0D7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0DEh,0DFh
           BYTE 0E0h,0E1h,0E2h,0E3h,0E4h,0E5h,0E6h,0E7h,0E8h,0E9h,0EAh,0EBh,0ECh,0EDh,0EEh,0EFh
           BYTE 0F0h,0F1h,0F2h,0F3h,0F4h,0F5h,0F6h,0F7h,0F8h,0F9h,0FAh,0FBh,0FCh,0FDh,0FEh,0FFh




szTest1 BYTE "À Ä Â Ç É È Ë Ê Î Ï Ô Ö Ò Û Ü Æ Ú Œ Á Ì Í Ó Õ Ù Ú Ø Š Ž Ý Å Ã Ñ",0
szTest2 BYTE "à ä â ç é è ë ê î ï ô ö ò û ü æ ú œ á ì í ó õ ù ú ø š ž ý å ã ñ",0


There is nothing more to do with this table   :P


UlliN

Hello Danesh,

that's my way of comparing strings. The main idea is to compare DWORDs instead of bytes. A possible char translation is done on the fly. Its a bit faster than the algo above. :-)
There are 2 procs with and without char translation.

Beside this I've attached my CharTranslate-Routines with some predefined tables e.g. AsciiToAnsi, AnsiUpper, EbcdicToAscii. Significant faster than the MS APIs.


Ulli


[attachment deleted by admin]

Danesh

WOW !!! Thanks everybody! I have modified the szCmpi table and of course the code. My goal was to support English and Swedish characters. (Swedish has three chars which are "äåö").

Thanks again,

Danesh


jdoe

Quote from: Danesh on January 05, 2007, 06:40:29 PM
My goal was to support English and Swedish characters. (Swedish has three chars which are "äåö").

I had the feeling that if these few more characters were there that it could be useful somehow.

:8)


Vortex

Here my strcmp algo, it's a modest one but it does the job :

.386
.model flat,stdcall
option casemap:none

.code

strcmp PROC USES esi edi dest:DWORD,src:DWORD

mov esi,src
mov edi,dest
xor eax,eax
xor ecx,ecx
xor edx,edx
dec edx
@@:
add edx,1
mov al,BYTE PTR [edi+edx]
mov cl,BYTE PTR [esi+edx]
cmp al,cl
jne no_match
test al,al
jnz @b
xor eax,eax
ret

no_match:

sub eax,ecx
ret

strcmp ENDP
END