The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Jimg on June 27, 2005, 04:32:07 PM

Title: CmpI
Post by: Jimg on June 27, 2005, 04:32:07 PM
Over in the masm32 section, they are talking about the cmpi routine.  I was curious to see if the old xlat instruction was neglected by the newer cpu's as so many other instructions have been.  The first routine is Hutch's stock routine, the second, I replace xlat with mov eax,[ebx+eax], and the third, I didn't use the table at all, but use the old fashioned method of test for 'A' to 'Z' and mask to lower case.  Sure enough, on an Athlon anyway, the old xlat is slower-

Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35

Proc cycl
szCmpix    495
szCmpix2   423
szCmpix3   424


[attachment deleted by admin]
Title: Re: CmpI
Post by: Phil on June 27, 2005, 04:39:34 PM
Results on a 996 MHz P3:
Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35

Proc cycl
szCmpix    234
szCmpix2   204
szCmpix3   363

Title: Re: CmpI
Post by: hutch-- on June 28, 2005, 01:18:32 AM
You can expect that range of timing wander between different hardware as XLATB is an old instruction but it does have good averages across a wide range of hardware from very old to reasonably current.
Title: Re: CmpI
Post by: Codewarp on June 28, 2005, 06:23:07 AM
My sense of xlatb is that it is another 8-bit oriented instruction that may benefit from a 32-bit implementation, like:

    movzx   eax, byte ptr [ebx]      ; read a byte
    movzx   eax, table [eax]          ; translate it through table
Title: Re: CmpI
Post by: Jimg on June 28, 2005, 01:09:35 PM
uh..  yeah, but the table would have to be billions of bytes long...
Title: Re: CmpI
Post by: Jimg on June 28, 2005, 02:44:18 PM
But you're right about 32 bit in general.  I converted the 8-bit instructions to 32-bit in test number 4.  It's now about 20% faster than stock using simple non-mmx, at least on an athlon, no idea if pentiums have a similar dislike to 8-bit instructions.
Test routines for correctness:
szCmpix     35
szCmpix2    35
szCmpix3    35
szCmpix4    35

Proc cycl
szCmpix    494
szCmpix2   423
szCmpix3   422
szCmpix4   396

Title: Re: CmpI
Post by: hutch-- on June 28, 2005, 03:01:06 PM
Give this one a blast, I have not set up a test piece to time it but it should be a bit faster on late model hardware. Note the MOVZX can be slow on some of the older stuff.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .data
    align 16
      tbl \
      db   0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15
      db  16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
      db  32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
      db  48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
      db  64, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122, 91, 92, 93, 94, 95
      db  96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
      db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
      db 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143
      db 144,145,146,147,148,149,150,151,152,153,154,155,156,156,158,159
      db 160,161,162,163,164,165,166,167,168,169,170,171,172,173,173,175
      db 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
      db 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
      db 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
      db 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
      db 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
    .code

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCmpixx proc src:DWORD,dst:DWORD,ln:DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov eax, -1

  align 4
  @@:
    add eax, 1
    movzx edx, BYTE PTR [esi+eax]
    mov cl, [edx+tbl]
    movzx ebx, BYTE PTR [edi+eax]
    cmp cl, [ebx+tbl]
    jne miss
    cmp eax, ln
    jne @B

    sub eax, eax
    jmp quit

  miss:
    add eax, 1

  quit:
    pop edi
    pop esi
    pop ebx

    ret

szCmpixx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: CmpI
Post by: Jimg on June 28, 2005, 04:00:51 PM
Very nice, an incredible difference.  I had to unroll the loop once and make some other minor changes to beat it using compares A-Z.  And it's probably only faster on an athlon.  The only saving grace is my code is only 160 byte long vs. a 256 byte table.  Meaningless, really.
Test routines for correctness:
szCmpix     35
szCmpixx    35
szCmpix6    35

Proc cycl
szCmpix    494
szCmpixx   362
szCmpix6   331



Title: Re: CmpI
Post by: Jimg on June 29, 2005, 12:06:26 AM
Hutch-

I had to add some more tests to be sure I wasn't screwing up the routine.  In the process, I detected a small problem with your latest (szCmpixx).  Returned 105 but was called with a length of 104 (went one too far?).

Test routines for correctness:
Matching Strings at various lengths, answer should be zero except last=105:
test size    0    1    2    3    5    8   13   22   39   55   89  104  105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix      0    0    0    0    0    0    0    0    0    0    0    0  105
szCmpixx     0    0    0    0    0    0    0    0    0    0    0  105  105
szCmpix6     0    0    0    0    0    0    0    0    0    0    0    0  105

Execution Cycles:
test size    0    1    2    3    5    8   13   22   39   55   89  104  105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix     27   39   52   65   91  131  211  331  553  765 1209 1410 1408
szCmpixx    15   27   36   45   63   91  158  243  399  550  875 1020 1022
szCmpix6     5   20   51   35  104   99  115  211  338  468  742  874  923


Test routines for correctness:
Mis-matching Strings, all test size=110, mismatch at character:
             1    2    3    4    5   13   22   39   55   89  103  104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix      1    2    3    4    5   13   22   39   55   89  103  104
szCmpixx     1    2    3    4    5   13   22   39   55   89  103  104
szCmpix6     1    2    3    4    5   13   22   39   55   89  103  104

Execution Cycles:
miss at      1    2    3    4    5   13   22   39   55   89  103  104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix     26   37   50   64   76  203  325  542  753 1199 1387 1396
szCmpixx    16   25   34   43   52  147  232  387  540  864 1000 1012
szCmpix6    20   27   34   42   51  114  211  336  465  740  854  914

[attachment deleted by admin]
Title: Re: CmpI
Post by: hutch-- on June 29, 2005, 12:27:04 AM
Jim,

I will have a look at the new version when I get in front but it should also be tested on a single long source as well. What I would be inclined to do is load a known file of say about 1 meg then copy it to another buffer, convert one to upper case and the other to lower case, then run any insensitive compare algos on the two buffers.

The short repeat tests are good for testing fast takeoff but comparing files is a valid use for an algo of this type and a file of a meg or so would make a good starting point.

LATER :

Here is a test piece that test both on windows.inc. On longer runs the later version is much faster.

Theses are the timings on my PIV.


500 MS szCmpix
187 MS szCmpixx
485 MS szCmpix
203 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx

494 MS average for szCmpi
189 MS average for szCmpixx

Press ENTER to exit


[attachment deleted by admin]
Title: Re: CmpI
Post by: Jimg on June 29, 2005, 02:57:51 AM
While you were doing that, I did one that generated strings of 1039999 charactes in heap space.

first 225 characters of test strings

Str1 = AbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStU
vWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvW
xYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOp

Str2 = aBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTu
VwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVw
XyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoP

Test routines for correctness:
Matching Strings at various lengths, answer should be zero:
test size         0         1         2       100   1039999
========= ========= ========= ========= ========= =========
szCmpix           0         0         0         0         0
szCmpixx          0         0         0         0         0
szCmpix6          0         0         0         0         0

Execution Cycles:
test size         0         1         2       100   1039999
========= ========= ========= ========= ========= =========
szCmpix          26        39        53      1452  15368647
szCmpixx         20        28        37       990  10457571
szCmpix6          6        21        51       845   9005966



[attachment deleted by admin]
Title: Re: CmpI
Post by: hutch-- on June 29, 2005, 03:04:49 AM
Here is a small optimisation on the loop code of the later version.


  align 4
  @@:
    add eax, 1
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    mov cl, [edx+tbl]
    cmp cl, [ebx+tbl]
    jne miss
    cmp eax, ln
    jne @B


Dropped the timing to the following.


500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
484 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
171 MS szCmpixx

498 MS average for szCmpi
171 MS average for szCmpixx

Press ENTER to exit


I just added the szCmpix6 algo to the benchmark and got these timings.


500 MS szCmpix
172 MS szCmpixx
296 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
485 MS szCmpix
171 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
484 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6

496 MS average for szCmpix
171 MS average for szCmpixx
292 MS average for szCmpix6

Press ENTER to exit


LATER STILL :

Here are the timings on my Sempron 2.4


313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6

312 MS average for szCmpix
172 MS average for szCmpixx
257 MS average for szCmpix6

Press ENTER to exit


[attachment deleted by admin]
Title: Re: CmpI
Post by: Jimg on June 29, 2005, 02:00:12 PM
Very good.  The new one is faster on my machine also.  Still need to fix the earlier bug, but looks like a winner :U
Title: Re: CmpI
Post by: Jimg on June 29, 2005, 02:59:48 PM
Here's a fix for the problem (szCmpixx4), and it even runs a hair faster for some reason I don't understand :wink
218 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
110 MS szCmpixx4
250 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
141 MS szCmpixx
187 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
235 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
187 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
109 MS szCmpixx
219 MS szCmpix6
109 MS szCmpixx4

224 MS average for szCmpix
121 MS average for szCmpixx
201 MS average for szCmpix6
115 MS average for szCmpixx4

[attachment deleted by admin]
Title: Re: CmpI
Post by: Jimg on June 29, 2005, 03:30:15 PM
And here it is with a few small changes, a hair faster again, and a little cleaner-
align 16
szCmpixx5 proc src:DWORD,dst:DWORD,ln:DWORD

    sub eax, eax
    cmp eax,ln ; just quit on zero length
    je done
    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst

  align 4
  @@:
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    mov cl, [edx+tbl]
    add eax, 1 ; setup eax for next, or miss number if nomatch
    cmp cl, [ebx+tbl]
    jne quit
    cmp eax,ln
    jb @b

    sub eax, eax

  quit:
    pop edi
    pop esi
    pop ebx
  done:
    ret

szCmpixx5 endp
Title: Re: CmpI
Post by: hutch-- on June 29, 2005, 05:14:05 PM
Jim,

I did a reordering with the two CL operations and the algo clocked up about 10% or so faster than the latest one you posted but I ran both on the Sempron 2.4 and the timings were reversed so the code design now is reducing down to the differences between AMD and Intel hardware. I spaced each benchmark by 1 second to stabilise them and it seems to be more reliable. For whatever reason, the benchmark will not run on an old Celeron I am setting up at the moment.

Here are the two timings.


PIV timing

156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5

156 MS average for szCmpixx
171 MS average for szCmpi5

Press ENTER to exit

Sempron 2.4 timing

171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5
171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5

171 MS average for szCmpixx
156 MS average for szCmpi5

Press ENTER to exit


LATER :

I got it to build on the old celeron and your version is clearly faster.

790 versus 730.

[attachment deleted by admin]
Title: Re: CmpI
Post by: Jimg on June 29, 2005, 06:20:54 PM
My results:
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
141 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5

126 MS average for szCmpixx
109 MS average for szCmpi5


Pretty darn good compared to the original.  Obviously I can't compete with the old compare to A-Z method any longer, which is as it should be, a table should certainly be faster :bg 
Title: Re: CmpI
Post by: hutch-- on June 30, 2005, 01:04:45 PM
Jim,

I think we got a good result, thanks for your help here.  :U
Title: Re: CmpI
Post by: Jimg on June 30, 2005, 02:19:06 PM
QuoteI think we got a good result, thanks for your help here.
Yes, thank you for letting me play :bg

Not to beat a dead horse here, but I had one other thought last night.  The normal use of this procedure would be to test two strings that are normally identical. The worst case I have been testing where every alpha way a different case between the file would rarely happen.  Therefore, we should check for the equal condition first, and then see if it is a case problem.  I tried it with your test code by commenting out the code to convert the file to lower case, so that all the letters that were already in upper case would match.  I changed to code to check for equality first.  The results were another 10% savings in execution time!
    ;mov hmem, lcase$(hmem) ;************************  commented this out for test
    mov hbuf, ucase$(hbuf)


The new routine is szCmpi6....

93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
110 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5

96 MS average for szCmpi6
109 MS average for szCmpi5



[attachment deleted by admin]
Title: Re: CmpI
Post by: hutch-- on June 30, 2005, 02:40:25 PM
Jim,

Its a good idea as it saves two table accesses if they are the same. Its clocking up about 10% faster on the PIV as well. I will have a play with the idea.
Title: Re: CmpI
Post by: Jimg on June 30, 2005, 03:40:48 PM
Sorry Hutch, just when I thought I was done, my fevered brain came up with this--
110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7

98 MS average for szCmpi6
78 MS average for szCmpi7


The code is not as pretty to look at, but dang, it's another 20% faster!!!!

[attachment deleted by admin]
Title: Re: CmpI
Post by: Jimg on June 30, 2005, 09:03:36 PM
My earlier jubilation has died as I now realize we are entering that same can of worms as the topics optimizing szCopy.  Compare 8 bytes at a time with mmx until a mismatch then drop to one byte at a time, worry about page alignment, mmx availability, sse2, etc, etc, etc... :dazzled:

YUCK!
Title: Re: CmpI
Post by: Phil on June 30, 2005, 10:06:31 PM
Hi JimG and Hutch. I'd ran the latest test on the P3 that I am using and I didn't see the improvement that your tests showed with AMD. The last routine was faster but only by about 5%. I think you are right to stick with something that both looks as good as it performs!
Title: Re: CmpI
Post by: hutch-- on July 01, 2005, 03:25:35 AM
I ran out of puff at about 2am this morning but my general impression with Jim's latest idea was that while it should be faster with matches, with testing it pays a big performance penalty on mismatches which is not outweighed by the gains on matches. With low instruction count loop code the switching code fr the mismatch makes the main loop a lot slower when its not taken.

We do have a couple that are now far faster than the original using XLATB and I found another mod on the two CL operations that tested up faster on a PIV. Replace one of them with a MOVZX and do the compare with an XOR and test for zero. I have a few thing to do today but I will get it coded up soon and then impose on a few people to test it on different processors.

I temporarily have a 480 Celeron to test on but I will need to impose on Phil for testing on a PIII as I don't have one set up and running at the moment.
Title: Re: CmpI
Post by: hutch-- on July 01, 2005, 08:20:32 AM
Here are the two versions that I have been playing with, another change and a modified version of Jims using the same mod. The performance is identical on my PIV but Jims is noticable faster on my Sempron 2.4, ratio of 200 to 156 MS so if these test up OK on a PIII, I am inclined to go with the modified version of Jims.

[attachment deleted by admin]
Title: Re: CmpI
Post by: Phil on July 01, 2005, 09:03:24 AM
Here are the results on a 996 MHz P3. The exe in your zip only tested two functions.


C:\ASM\test\BMARK4>bmark
360 MS szCmpi6
343 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
359 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5

363 MS average for szCmpi6
344 MS average for szCmpi5

Press ENTER to exit
Title: Re: CmpI
Post by: hutch-- on July 01, 2005, 09:13:04 AM
Phil,

Did you rename one of the output names, the zip I posted had "szCmpixx" and "szCmpi5".
Title: Re: CmpI
Post by: Jimg on July 01, 2005, 01:00:13 PM
Hi Hutch-
Perhaps you posted an old one, it had the ones enabled that Phil showed.  Here's one with all three enabled, plus my Cmpi8 which has one less compare per loop and should run the fastest except in the rare case where every character has the opposite case between the two files being compared.

Phil-  If you get a chance, try this one and let me know your results.
Edited--   Phil, ignore this and try the bmark7.zip I posted last.  Thanks.

  The last portion of my results are-
125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8
125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8

125 MS average for szCmpiix
111 MS average for szCmpi5
95 MS average for szCmpi6
79 MS average for szCmpi8

Press ENTER to exit

[attachment deleted by admin]
Title: Re: CmpI
Post by: roticv on July 01, 2005, 02:06:26 PM
I think you guys should give better function names. I thought I was bad.  :bdg
Title: Re: CmpI
Post by: hutch-- on July 01, 2005, 02:39:05 PM
Victor,

The procedure name is szCmpi, the rest are variations for testing.  :toothy

Jim,

I put the two later versions I had in the last post into your testbed then ran the benchmark. On my PIV, I get these results.


comment * ------------------------------

        mismatch compares

        140 MS average for szCmpiix
        140 MS average for szCmpi5
        265 MS average for szCmpi6
        261 MS average for szCmpi8
       
        matched compares

        140 MS average for szCmpiix
        140 MS average for szCmpi5
        93 MS average for szCmpi6
        78 MS average for szCmpi8
       
        ------------------------------ *


On the old Celeron, the timings for all only vary about 5% but on the Sempron, the szCmpi5 beats the rest with a mismatched test by a long way.

The problem with the additional branching code is it only works with exact matches and in the worst case it is much slower. Wild swings in timing make the preocedure unpredictable performance wise where the two versions that always check the table run the same time under any condition. On the PIV, the version you modified that I have also modified runs at the same speed on the PIV and faster on the Sempron so it is a better all round algorithm.

This is the modified version of your previous modification.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

szCmpi5 proc src:DWORD,dst:DWORD,ln:DWORD

    sub eax, eax
    cmp eax, ln                     ; just quit on zero length
    je done
    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst

  align 4
  @@:
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    movzx ecx, BYTE PTR [edx+tbl]   ; <<<-- modified here
    add eax, 1
    cmp cl, [ebx+tbl]
    jne quit
    cmp eax,ln
    jb @b

    sub eax, eax

  quit:
    pop edi
    pop esi
    pop ebx
  done:
    ret

szCmpi5 endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: CmpI
Post by: Jimg on July 01, 2005, 03:41:43 PM
Hutch-

Ok, we're getting close.  I took szCmpi5 assuming this is your favorite, and added the code to remove one compare from each loop (szCmpi5x).
I compiled three times, once using your code to change the case of every letter in both buffers for the worst possible match, which I called Bmark1.exe
Once changing the case on only one buffer so there was about a 50% mismatch, which I called Bmark2.exe
And once with no changes so buffers are identical, called Bmark3.exe.  The results are remarkable consistent, and 15 to 20 percent faster on AMD.
How do they test on fast Pentiums?

Bmark1.exe
Total mismatch:
    mov hmem, lcase$(hmem)  ; changed every letter to a mismatch
    mov hbuf, ucase$(hbuf)

109 MS average for szCmpi5
93 MS average for szCmpi5x

------------------------------------------------------------------ 

Bmark2.exe
Half mismatch:

    mov hmem, lcase$(hmem)  ;only changed one copy of file
;   mov hbuf, ucase$(hbuf)                                       

110 MS average for szCmpi5
94 MS average for szCmpi5x

Press ENTER to exit
-----------------------------------------------------------
No mismatch:

    ;mov hmem, lcase$(hmem)  ; identical buffers, no changes
    ;mov hbuf, ucase$(hbuf)

109 MS average for szCmpi5
95 MS average for szCmpi5x


[attachment deleted by admin]
Title: Re: CmpI
Post by: hutch-- on July 02, 2005, 12:30:39 AM
Jim,

I get the same results on all 3.


140 MS average for szCmpi5
156 MS average for szCmpi5x


Title: Re: CmpI
Post by: Jimg on July 02, 2005, 01:27:52 AM
Interesting.  I dusted off an old celeron and this is what I got--

Celeron 2.00Ghz 496 MB ram

188 MS average for szCmpi5
187 MS average for szCmpi5x
Title: Re: CmpI
Post by: hutch-- on July 02, 2005, 01:42:22 AM
Jim,

Will you answer the PM I sent you, I need the info.
Title: Re: CmpI
Post by: Jimg on July 02, 2005, 02:14:28 AM
Gee, I never think to check for those, shouldn't it ring a bell or something??  :wink

I just went and looked.  I had send me an email checked but I didn't see any email..  I changed it so now I have do a popup checked.  Maybe that will work....