Over in the masm32 section, they are talking about the cmpi routine. I was curious to see if the old xlat instruction was neglected by the newer cpu's as so many other instructions have been. The first routine is Hutch's stock routine, the second, I replace xlat with mov eax,[ebx+eax], and the third, I didn't use the table at all, but use the old fashioned method of test for 'A' to 'Z' and mask to lower case. Sure enough, on an Athlon anyway, the old xlat is slower-
Test routines for correctness:
szCmpix 35
szCmpix2 35
szCmpix3 35
Proc cycl
szCmpix 495
szCmpix2 423
szCmpix3 424
[attachment deleted by admin]
Results on a 996 MHz P3:
Test routines for correctness:
szCmpix 35
szCmpix2 35
szCmpix3 35
Proc cycl
szCmpix 234
szCmpix2 204
szCmpix3 363
You can expect that range of timing wander between different hardware as XLATB is an old instruction but it does have good averages across a wide range of hardware from very old to reasonably current.
My sense of xlatb is that it is another 8-bit oriented instruction that may benefit from a 32-bit implementation, like:
movzx eax, byte ptr [ebx] ; read a byte
movzx eax, table [eax] ; translate it through table
uh.. yeah, but the table would have to be billions of bytes long...
But you're right about 32 bit in general. I converted the 8-bit instructions to 32-bit in test number 4. It's now about 20% faster than stock using simple non-mmx, at least on an athlon, no idea if pentiums have a similar dislike to 8-bit instructions.
Test routines for correctness:
szCmpix 35
szCmpix2 35
szCmpix3 35
szCmpix4 35
Proc cycl
szCmpix 494
szCmpix2 423
szCmpix3 422
szCmpix4 396
Give this one a blast, I have not set up a test piece to time it but it should be a bit faster on late model hardware. Note the MOVZX can be slow on some of the older stuff.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
align 16
tbl \
db 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
db 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
db 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
db 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
db 64, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
db 112,113,114,115,116,117,118,119,120,121,122, 91, 92, 93, 94, 95
db 96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
db 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
db 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143
db 144,145,146,147,148,149,150,151,152,153,154,155,156,156,158,159
db 160,161,162,163,164,165,166,167,168,169,170,171,172,173,173,175
db 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
db 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
db 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
db 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
db 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
szCmpixx proc src:DWORD,dst:DWORD,ln:DWORD
push ebx
push esi
push edi
mov esi, src
mov edi, dst
mov eax, -1
align 4
@@:
add eax, 1
movzx edx, BYTE PTR [esi+eax]
mov cl, [edx+tbl]
movzx ebx, BYTE PTR [edi+eax]
cmp cl, [ebx+tbl]
jne miss
cmp eax, ln
jne @B
sub eax, eax
jmp quit
miss:
add eax, 1
quit:
pop edi
pop esi
pop ebx
ret
szCmpixx endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Very nice, an incredible difference. I had to unroll the loop once and make some other minor changes to beat it using compares A-Z. And it's probably only faster on an athlon. The only saving grace is my code is only 160 byte long vs. a 256 byte table. Meaningless, really.
Test routines for correctness:
szCmpix 35
szCmpixx 35
szCmpix6 35
Proc cycl
szCmpix 494
szCmpixx 362
szCmpix6 331
Hutch-
I had to add some more tests to be sure I wasn't screwing up the routine. In the process, I detected a small problem with your latest (szCmpixx). Returned 105 but was called with a length of 104 (went one too far?).
Test routines for correctness:
Matching Strings at various lengths, answer should be zero except last=105:
test size 0 1 2 3 5 8 13 22 39 55 89 104 105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix 0 0 0 0 0 0 0 0 0 0 0 0 105
szCmpixx 0 0 0 0 0 0 0 0 0 0 0 105 105
szCmpix6 0 0 0 0 0 0 0 0 0 0 0 0 105
Execution Cycles:
test size 0 1 2 3 5 8 13 22 39 55 89 104 105
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix 27 39 52 65 91 131 211 331 553 765 1209 1410 1408
szCmpixx 15 27 36 45 63 91 158 243 399 550 875 1020 1022
szCmpix6 5 20 51 35 104 99 115 211 338 468 742 874 923
Test routines for correctness:
Mis-matching Strings, all test size=110, mismatch at character:
1 2 3 4 5 13 22 39 55 89 103 104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix 1 2 3 4 5 13 22 39 55 89 103 104
szCmpixx 1 2 3 4 5 13 22 39 55 89 103 104
szCmpix6 1 2 3 4 5 13 22 39 55 89 103 104
Execution Cycles:
miss at 1 2 3 4 5 13 22 39 55 89 103 104
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
szCmpix 26 37 50 64 76 203 325 542 753 1199 1387 1396
szCmpixx 16 25 34 43 52 147 232 387 540 864 1000 1012
szCmpix6 20 27 34 42 51 114 211 336 465 740 854 914
[attachment deleted by admin]
Jim,
I will have a look at the new version when I get in front but it should also be tested on a single long source as well. What I would be inclined to do is load a known file of say about 1 meg then copy it to another buffer, convert one to upper case and the other to lower case, then run any insensitive compare algos on the two buffers.
The short repeat tests are good for testing fast takeoff but comparing files is a valid use for an algo of this type and a file of a meg or so would make a good starting point.
LATER :
Here is a test piece that test both on windows.inc. On longer runs the later version is much faster.
Theses are the timings on my PIV.
500 MS szCmpix
187 MS szCmpixx
485 MS szCmpix
203 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx
484 MS szCmpix
188 MS szCmpixx
500 MS szCmpix
187 MS szCmpixx
500 MS szCmpix
188 MS szCmpixx
494 MS average for szCmpi
189 MS average for szCmpixx
Press ENTER to exit
[attachment deleted by admin]
While you were doing that, I did one that generated strings of 1039999 charactes in heap space.
first 225 characters of test strings
Str1 = AbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStU
vWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvW
xYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOpQrStUvWxYzAbCdEfGhIjKlMnOp
Str2 = aBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTu
VwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVw
XyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoPqRsTuVwXyZaBcDeFgHiJkLmNoP
Test routines for correctness:
Matching Strings at various lengths, answer should be zero:
test size 0 1 2 100 1039999
========= ========= ========= ========= ========= =========
szCmpix 0 0 0 0 0
szCmpixx 0 0 0 0 0
szCmpix6 0 0 0 0 0
Execution Cycles:
test size 0 1 2 100 1039999
========= ========= ========= ========= ========= =========
szCmpix 26 39 53 1452 15368647
szCmpixx 20 28 37 990 10457571
szCmpix6 6 21 51 845 9005966
[attachment deleted by admin]
Here is a small optimisation on the loop code of the later version.
align 4
@@:
add eax, 1
movzx edx, BYTE PTR [esi+eax]
movzx ebx, BYTE PTR [edi+eax]
mov cl, [edx+tbl]
cmp cl, [ebx+tbl]
jne miss
cmp eax, ln
jne @B
Dropped the timing to the following.
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
484 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
172 MS szCmpixx
500 MS szCmpix
171 MS szCmpixx
498 MS average for szCmpi
171 MS average for szCmpixx
Press ENTER to exit
I just added the szCmpix6 algo to the benchmark and got these timings.
500 MS szCmpix
172 MS szCmpixx
296 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
485 MS szCmpix
171 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
484 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
297 MS szCmpix6
500 MS szCmpix
172 MS szCmpixx
281 MS szCmpix6
496 MS average for szCmpix
171 MS average for szCmpixx
292 MS average for szCmpix6
Press ENTER to exit
LATER STILL :
Here are the timings on my Sempron 2.4
313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
266 MS szCmpix6
312 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
265 MS szCmpix6
313 MS szCmpix
172 MS szCmpixx
250 MS szCmpix6
312 MS average for szCmpix
172 MS average for szCmpixx
257 MS average for szCmpix6
Press ENTER to exit
[attachment deleted by admin]
Very good. The new one is faster on my machine also. Still need to fix the earlier bug, but looks like a winner :U
Here's a fix for the problem (szCmpixx4), and it even runs a hair faster for some reason I don't understand :wink
218 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
110 MS szCmpixx4
250 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
141 MS szCmpixx
187 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
235 MS szCmpix
109 MS szCmpixx
203 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
187 MS szCmpix6
125 MS szCmpixx4
219 MS szCmpix
125 MS szCmpixx
203 MS szCmpix6
109 MS szCmpixx4
219 MS szCmpix
109 MS szCmpixx
219 MS szCmpix6
109 MS szCmpixx4
224 MS average for szCmpix
121 MS average for szCmpixx
201 MS average for szCmpix6
115 MS average for szCmpixx4
[attachment deleted by admin]
And here it is with a few small changes, a hair faster again, and a little cleaner-
align 16
szCmpixx5 proc src:DWORD,dst:DWORD,ln:DWORD
sub eax, eax
cmp eax,ln ; just quit on zero length
je done
push ebx
push esi
push edi
mov esi, src
mov edi, dst
align 4
@@:
movzx edx, BYTE PTR [esi+eax]
movzx ebx, BYTE PTR [edi+eax]
mov cl, [edx+tbl]
add eax, 1 ; setup eax for next, or miss number if nomatch
cmp cl, [ebx+tbl]
jne quit
cmp eax,ln
jb @b
sub eax, eax
quit:
pop edi
pop esi
pop ebx
done:
ret
szCmpixx5 endp
Jim,
I did a reordering with the two CL operations and the algo clocked up about 10% or so faster than the latest one you posted but I ran both on the Sempron 2.4 and the timings were reversed so the code design now is reducing down to the differences between AMD and Intel hardware. I spaced each benchmark by 1 second to stabilise them and it seems to be more reliable. For whatever reason, the benchmark will not run on an old Celeron I am setting up at the moment.
Here are the two timings.
PIV timing
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS average for szCmpixx
171 MS average for szCmpi5
Press ENTER to exit
Sempron 2.4 timing
171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5
171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5
171 MS average for szCmpixx
156 MS average for szCmpi5
Press ENTER to exit
LATER :
I got it to build on the old celeron and your version is clearly faster.
790 versus 730.
[attachment deleted by admin]
My results:
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
141 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
126 MS average for szCmpixx
109 MS average for szCmpi5
Pretty darn good compared to the original. Obviously I can't compete with the old compare to A-Z method any longer, which is as it should be, a table should certainly be faster :bg
Jim,
I think we got a good result, thanks for your help here. :U
QuoteI think we got a good result, thanks for your help here.
Yes, thank you for letting me play :bg
Not to beat a dead horse here, but I had one other thought last night. The normal use of this procedure would be to test two strings that are normally identical. The worst case I have been testing where every alpha way a different case between the file would rarely happen. Therefore, we should check for the equal condition first, and then see if it is a case problem. I tried it with your test code by commenting out the code to convert the file to lower case, so that all the letters that were already in upper case would match. I changed to code to check for equality first. The results were another 10% savings in execution time!
;mov hmem, lcase$(hmem) ;************************ commented this out for test
mov hbuf, ucase$(hbuf)
The new routine is szCmpi6....
93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
110 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5
96 MS average for szCmpi6
109 MS average for szCmpi5
[attachment deleted by admin]
Jim,
Its a good idea as it saves two table accesses if they are the same. Its clocking up about 10% faster on the PIV as well. I will have a play with the idea.
Sorry Hutch, just when I thought I was done, my fevered brain came up with this--
110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
98 MS average for szCmpi6
78 MS average for szCmpi7
The code is not as pretty to look at, but dang, it's another 20% faster!!!!
[attachment deleted by admin]
My earlier jubilation has died as I now realize we are entering that same can of worms as the topics optimizing szCopy. Compare 8 bytes at a time with mmx until a mismatch then drop to one byte at a time, worry about page alignment, mmx availability, sse2, etc, etc, etc... :dazzled:
YUCK!
Hi JimG and Hutch. I'd ran the latest test on the P3 that I am using and I didn't see the improvement that your tests showed with AMD. The last routine was faster but only by about 5%. I think you are right to stick with something that both looks as good as it performs!
I ran out of puff at about 2am this morning but my general impression with Jim's latest idea was that while it should be faster with matches, with testing it pays a big performance penalty on mismatches which is not outweighed by the gains on matches. With low instruction count loop code the switching code fr the mismatch makes the main loop a lot slower when its not taken.
We do have a couple that are now far faster than the original using XLATB and I found another mod on the two CL operations that tested up faster on a PIV. Replace one of them with a MOVZX and do the compare with an XOR and test for zero. I have a few thing to do today but I will get it coded up soon and then impose on a few people to test it on different processors.
I temporarily have a 480 Celeron to test on but I will need to impose on Phil for testing on a PIII as I don't have one set up and running at the moment.
Here are the two versions that I have been playing with, another change and a modified version of Jims using the same mod. The performance is identical on my PIV but Jims is noticable faster on my Sempron 2.4, ratio of 200 to 156 MS so if these test up OK on a PIII, I am inclined to go with the modified version of Jims.
[attachment deleted by admin]
Here are the results on a 996 MHz P3. The exe in your zip only tested two functions.
C:\ASM\test\BMARK4>bmark
360 MS szCmpi6
343 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
359 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
363 MS average for szCmpi6
344 MS average for szCmpi5
Press ENTER to exit
Phil,
Did you rename one of the output names, the zip I posted had "szCmpixx" and "szCmpi5".
Hi Hutch-
Perhaps you posted an old one, it had the ones enabled that Phil showed. Here's one with all three enabled, plus my Cmpi8 which has one less compare per loop and should run the fastest except in the rare case where every character has the opposite case between the two files being compared.
Phil- If you get a chance, try this one and let me know your results.
Edited-- Phil, ignore this and try the bmark7.zip I posted last. Thanks.
The last portion of my results are-
125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8
125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8
125 MS average for szCmpiix
111 MS average for szCmpi5
95 MS average for szCmpi6
79 MS average for szCmpi8
Press ENTER to exit
[attachment deleted by admin]
I think you guys should give better function names. I thought I was bad. :bdg
Victor,
The procedure name is szCmpi, the rest are variations for testing. :toothy
Jim,
I put the two later versions I had in the last post into your testbed then ran the benchmark. On my PIV, I get these results.
comment * ------------------------------
mismatch compares
140 MS average for szCmpiix
140 MS average for szCmpi5
265 MS average for szCmpi6
261 MS average for szCmpi8
matched compares
140 MS average for szCmpiix
140 MS average for szCmpi5
93 MS average for szCmpi6
78 MS average for szCmpi8
------------------------------ *
On the old Celeron, the timings for all only vary about 5% but on the Sempron, the szCmpi5 beats the rest with a mismatched test by a long way.
The problem with the additional branching code is it only works with exact matches and in the worst case it is much slower. Wild swings in timing make the preocedure unpredictable performance wise where the two versions that always check the table run the same time under any condition. On the PIV, the version you modified that I have also modified runs at the same speed on the PIV and faster on the Sempron so it is a better all round algorithm.
This is the modified version of your previous modification.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 16
szCmpi5 proc src:DWORD,dst:DWORD,ln:DWORD
sub eax, eax
cmp eax, ln ; just quit on zero length
je done
push ebx
push esi
push edi
mov esi, src
mov edi, dst
align 4
@@:
movzx edx, BYTE PTR [esi+eax]
movzx ebx, BYTE PTR [edi+eax]
movzx ecx, BYTE PTR [edx+tbl] ; <<<-- modified here
add eax, 1
cmp cl, [ebx+tbl]
jne quit
cmp eax,ln
jb @b
sub eax, eax
quit:
pop edi
pop esi
pop ebx
done:
ret
szCmpi5 endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Hutch-
Ok, we're getting close. I took szCmpi5 assuming this is your favorite, and added the code to remove one compare from each loop (szCmpi5x).
I compiled three times, once using your code to change the case of every letter in both buffers for the worst possible match, which I called Bmark1.exe
Once changing the case on only one buffer so there was about a 50% mismatch, which I called Bmark2.exe
And once with no changes so buffers are identical, called Bmark3.exe. The results are remarkable consistent, and 15 to 20 percent faster on AMD.
How do they test on fast Pentiums?
Bmark1.exe
Total mismatch:
mov hmem, lcase$(hmem) ; changed every letter to a mismatch
mov hbuf, ucase$(hbuf)
109 MS average for szCmpi5
93 MS average for szCmpi5x
------------------------------------------------------------------
Bmark2.exe
Half mismatch:
mov hmem, lcase$(hmem) ;only changed one copy of file
; mov hbuf, ucase$(hbuf)
110 MS average for szCmpi5
94 MS average for szCmpi5x
Press ENTER to exit
-----------------------------------------------------------
No mismatch:
;mov hmem, lcase$(hmem) ; identical buffers, no changes
;mov hbuf, ucase$(hbuf)
109 MS average for szCmpi5
95 MS average for szCmpi5x
[attachment deleted by admin]
Jim,
I get the same results on all 3.
140 MS average for szCmpi5
156 MS average for szCmpi5x
Interesting. I dusted off an old celeron and this is what I got--
Celeron 2.00Ghz 496 MB ram
188 MS average for szCmpi5
187 MS average for szCmpi5x
Jim,
Will you answer the PM I sent you, I need the info.
Gee, I never think to check for those, shouldn't it ring a bell or something?? :wink
I just went and looked. I had send me an email checked but I didn't see any email.. I changed it so now I have do a popup checked. Maybe that will work....