News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

CmpI

Started by Jimg, June 27, 2005, 04:32:07 PM

Previous topic - Next topic

Jimg

Over in the masm32 section, they are talking about the cmpi routine.  I was curious to see if the old xlat instruction was neglected by the newer cpu's as so many other instructions have been.  The first routine is Hutch's stock routine, the second, I replace xlat with mov eax,[ebx+eax], and the third, I didn't use the table at all, but use the old fashioned method of test for 'A' to 'Z' and mask to lower case.  Sure enough, on an Athlon anyway, the old xlat is slower-

Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35

Proc cycl
szCmpix    495
szCmpix2   423
szCmpix3   424


[attachment deleted by admin]

Phil

Results on a 996 MHz P3:
Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35

Proc cycl
szCmpix    234
szCmpix2   204
szCmpix3   363


hutch--

You can expect that range of timing wander between different hardware as XLATB is an old instruction but it does have good averages across a wide range of hardware from very old to reasonably current.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Codewarp

My sense of xlatb is that it is another 8-bit oriented instruction that may benefit from a 32-bit implementation, like:

    movzx   eax, byte ptr [ebx]      ; read a byte
    movzx   eax, table [eax]          ; translate it through table

Jimg

uh..  yeah, but the table would have to be billions of bytes long...

Jimg

But you're right about 32 bit in general.  I converted the 8-bit instructions to 32-bit in test number 4.  It's now about 20% faster than stock using simple non-mmx, at least on an athlon, no idea if pentiums have a similar dislike to 8-bit instructions.
Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35
szCmpix4    35

Proc cycl
szCmpix    494
szCmpix2   423
szCmpix3   422
szCmpix4   396


hutch--

Give this one a blast, I have not set up a test piece to time it but it should be a bit faster on late model hardware. Note the MOVZX can be slow on some of the older stuff.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .data
    align 16
      tbl \
      db   0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15
      db  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
      db  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
      db  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
      db  64, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122, 91, 92, 93, 94, 95
      db  96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
      db 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143
      db 144,145,146,147,148,149,150,151,152,153,154,155,156,156,158,159
      db 160,161,162,163,164,165,166,167,168,169,170,171,172,173,173,175
      db 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
      db 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
      db 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
      db 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
      db 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
    .code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCmpixx proc src:DWORD,dst:DWORD,ln:DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov eax, -1

  align 4
  @@:
    add eax, 1
    movzx edx, BYTE PTR [esi+eax]
    mov cl, [edx+tbl]
    movzx ebx, BYTE PTR [edi+eax]
    cmp cl, [ebx+tbl]
    jne miss
    cmp eax, ln
    jne @B

    sub eax, eax
    jmp quit

  miss:
    add eax, 1

  quit:
    pop edi
    pop esi
    pop ebx

    ret

szCmpixx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

#7
Very nice, an incredible difference.  I had to unroll the loop once and make some other minor changes to beat it using compares A-Z.  And it's probably only faster on an athlon.  The only saving grace is my code is only 160 byte long vs. a 256 byte table.  Meaningless, really.
Test routines for correctness:
szCmpix     35
szCmpixx    35
szCmpix6    35

Proc cycl
szCmpix    494
szCmpixx   362
szCmpix6   331




Jimg

Hutch-

I had to add some more tests to be sure I wasn't screwing up the routine.  In the process, I detected a small problem with your latest (szCmpixx).  Returned 105 but was called with a length of 104 (went one too far?).

Test routines for correctness:
Matching Strings at various lengths, answer should be zero except last=105:
test size    0    1    2    3    5    8   13   22   39   55   89  104  105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix      0    0    0    0    0    0    0    0    0    0    0    0  105
szCmpixx     0    0    0    0    0    0    0    0    0    0    0  105  105
szCmpix6     0    0    0    0    0    0    0    0    0    0    0    0  105

Execution Cycles:
test size    0    1    2    3    5    8   13   22   39   55   89  104  105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix     27   39   52   65   91  131  211  331  553  765 1209 1410 1408
szCmpixx    15   27   36   45   63   91  158  243  399  550  875 1020 1022
szCmpix6     5   20   51   35  104   99  115  211  338  468  742  874  923


Test routines for correctness:
Mis-matching Strings, all test size=110, mismatch at character:
             1    2    3    4    5   13   22   39   55   89  103  104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix      1    2    3    4    5   13   22   39   55   89  103  104
szCmpixx     1    2    3    4    5   13   22   39   55   89  103  104
szCmpix6     1    2    3    4    5   13   22   39   55   89  103  104

Execution Cycles:
miss at      1    2    3    4    5   13   22   39   55   89  103  104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix     26   37   50   64   76  203  325  542  753 1199 1387 1396
szCmpixx    16   25   34   43   52  147  232  387  540  864 1000 1012
szCmpix6    20   27   34   42   51  114  211  336  465  740  854  914

[attachment deleted by admin]

hutch--

#9
Jim,

I will have a look at the new version when I get in front but it should also be tested on a single long source as well. What I would be inclined to do is load a known file of say about 1 meg then copy it to another buffer, convert one to upper case and the other to lower case, then run any insensitive compare algos on the two buffers.

The short repeat tests are good for testing fast takeoff but comparing files is a valid use for an algo of this type and a file of a meg or so would make a good starting point.

LATER :

Here is a test piece that test both on windows.inc. On longer runs the later version is much faster.

Theses are the timings on my PIV.


500 MS szCmpix
187 MS szCmpixx
485 MS szCmpix
203 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx

494 MS average for szCmpi
189 MS average for szCmpixx

Press ENTER to exit


[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

While you were doing that, I did one that generated strings of 1039999 charactes in heap space.

first 225 characters of test strings

Str1 = AbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStU
vWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvW
xYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOp

Str2 = aBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTu
VwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVw
XyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoP

Test routines for correctness:
Matching Strings at various lengths, answer should be zero:
test size         0         1         2       100   1039999
========= ========= ========= ========= ========= =========
szCmpix           0         0         0         0         0
szCmpixx          0         0         0         0         0
szCmpix6          0         0         0         0         0

Execution Cycles:
test size         0         1         2       100   1039999
========= ========= ========= ========= ========= =========
szCmpix          26        39        53      1452  15368647
szCmpixx         20        28        37       990  10457571
szCmpix6          6        21        51       845   9005966



[attachment deleted by admin]

hutch--

Here is a small optimisation on the loop code of the later version.


  align 4
  @@:
    add eax, 1
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    mov cl, [edx+tbl]
    cmp cl, [ebx+tbl]
    jne miss
    cmp eax, ln
    jne @B


Dropped the timing to the following.


500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
484 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
171 MS szCmpixx

498 MS average for szCmpi
171 MS average for szCmpixx

Press ENTER to exit


I just added the szCmpix6 algo to the benchmark and got these timings.


500 MS szCmpix
172 MS szCmpixx
296 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
485 MS szCmpix
171 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
484 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6

496 MS average for szCmpix
171 MS average for szCmpixx
292 MS average for szCmpix6

Press ENTER to exit


LATER STILL :

Here are the timings on my Sempron 2.4


313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6

312 MS average for szCmpix
172 MS average for szCmpixx
257 MS average for szCmpix6

Press ENTER to exit


[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Very good.  The new one is faster on my machine also.  Still need to fix the earlier bug, but looks like a winner :U

Jimg

Here's a fix for the problem (szCmpixx4), and it even runs a hair faster for some reason I don't understand :wink
218 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
110 MS szCmpixx4
250 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
141 MS szCmpixx
187 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
235 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
187 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
109 MS szCmpixx
219 MS szCmpix6
109 MS szCmpixx4

224 MS average for szCmpix
121 MS average for szCmpixx
201 MS average for szCmpix6
115 MS average for szCmpixx4

[attachment deleted by admin]

Jimg

And here it is with a few small changes, a hair faster again, and a little cleaner-
align 16
szCmpixx5 proc src:DWORD,dst:DWORD,ln:DWORD

    sub eax, eax
    cmp eax,ln ; just quit on zero length
    je done
    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst

  align 4
  @@:
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    mov cl, [edx+tbl]
    add eax, 1 ; setup eax for next, or miss number if nomatch
    cmp cl, [ebx+tbl]
    jne quit
    cmp eax,ln
    jb @b

    sub eax, eax

  quit:
    pop edi
    pop esi
    pop ebx
  done:
    ret

szCmpixx5 endp