CmpI

hutch-- · June 29, 2005, 05:14:05 PM

Jim,

I did a reordering with the two CL operations and the algo clocked up about 10% or so faster than the latest one you posted but I ran both on the Sempron 2.4 and the timings were reversed so the code design now is reducing down to the differences between AMD and Intel hardware. I spaced each benchmark by 1 second to stabilise them and it seems to be more reliable. For whatever reason, the benchmark will not run on an old Celeron I am setting up at the moment.

Here are the two timings.

Code Select


PIV timing

156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
157 MS szCmpixx
171 MS szCmpi5
157 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5
156 MS szCmpixx
172 MS szCmpi5

156 MS average for szCmpixx
171 MS average for szCmpi5

Press ENTER to exit

Sempron 2.4 timing

171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5
171 MS szCmpixx
157 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
156 MS szCmpi5
172 MS szCmpixx
157 MS szCmpi5

171 MS average for szCmpixx
156 MS average for szCmpi5

Press ENTER to exit

LATER :

I got it to build on the old celeron and your version is clearly faster.

790 versus 730.

[attachment deleted by admin]

Jimg · June 29, 2005, 06:20:54 PM

My results:

Code Select

125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
141 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
109 MS szCmpi5
125 MS szCmpixx
110 MS szCmpi5

126 MS average for szCmpixx
109 MS average for szCmpi5

Pretty darn good compared to the original. Obviously I can't compete with the old compare to A-Z method any longer, which is as it should be, a table should certainly be faster :bg

hutch-- · June 30, 2005, 01:04:45 PM

Jim,

I think we got a good result, thanks for your help here. :U

Jimg · June 30, 2005, 02:19:06 PM

QuoteI think we got a good result, thanks for your help here.

Yes, thank you for letting me play :bg

Not to beat a dead horse here, but I had one other thought last night. The normal use of this procedure would be to test two strings that are normally identical. The worst case I have been testing where every alpha way a different case between the file would rarely happen. Therefore, we should check for the equal condition first, and then see if it is a case problem. I tried it with your test code by commenting out the code to convert the file to lower case, so that all the letters that were already in upper case would match. I changed to code to check for equality first. The results were another 10% savings in execution time!

Code Select

    ;mov hmem, lcase$(hmem) ;************************  commented this out for test
    mov hbuf, ucase$(hbuf)

The new routine is szCmpi6....

Code Select

93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
110 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
109 MS szCmpi6
110 MS szCmpi5
93 MS szCmpi6
110 MS szCmpi5
94 MS szCmpi6
109 MS szCmpi5
94 MS szCmpi6
110 MS szCmpi5

96 MS average for szCmpi6
109 MS average for szCmpi5

[attachment deleted by admin]

hutch-- · June 30, 2005, 02:40:25 PM

Jim,

Its a good idea as it saves two table accesses if they are the same. Its clocking up about 10% faster on the PIV as well. I will have a play with the idea.

Jimg · June 30, 2005, 03:40:48 PM

Sorry Hutch, just when I thought I was done, my fevered brain came up with this--

Code Select

110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
109 MS szCmpi6
78 MS szCmpi7
110 MS szCmpi6
78 MS szCmpi7
93 MS szCmpi6
79 MS szCmpi7
93 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7
94 MS szCmpi6
78 MS szCmpi7

98 MS average for szCmpi6
78 MS average for szCmpi7

The code is not as pretty to look at, but dang, it's another 20% faster!!!!

[attachment deleted by admin]

Jimg · June 30, 2005, 09:03:36 PM

My earlier jubilation has died as I now realize we are entering that same can of worms as the topics optimizing szCopy. Compare 8 bytes at a time with mmx until a mismatch then drop to one byte at a time, worry about page alignment, mmx availability, sse2, etc, etc, etc... :dazzled:

YUCK!

Phil · June 30, 2005, 10:06:31 PM

Hi JimG and Hutch. I'd ran the latest test on the P3 that I am using and I didn't see the improvement that your tests showed with AMD. The last routine was faster but only by about 5%. I think you are right to stick with something that both looks as good as it performs!

hutch-- · July 01, 2005, 03:25:35 AM

I ran out of puff at about 2am this morning but my general impression with Jim's latest idea was that while it should be faster with matches, with testing it pays a big performance penalty on mismatches which is not outweighed by the gains on matches. With low instruction count loop code the switching code fr the mismatch makes the main loop a lot slower when its not taken.

We do have a couple that are now far faster than the original using XLATB and I found another mod on the two CL operations that tested up faster on a PIV. Replace one of them with a MOVZX and do the compare with an XOR and test for zero. I have a few thing to do today but I will get it coded up soon and then impose on a few people to test it on different processors.

I temporarily have a 480 Celeron to test on but I will need to impose on Phil for testing on a PIII as I don't have one set up and running at the moment.

hutch-- · July 01, 2005, 08:20:32 AM

Here are the two versions that I have been playing with, another change and a modified version of Jims using the same mod. The performance is identical on my PIV but Jims is noticable faster on my Sempron 2.4, ratio of 200 to 156 MS so if these test up OK on a PIII, I am inclined to go with the modified version of Jims.

[attachment deleted by admin]

Phil · July 01, 2005, 09:03:24 AM

Here are the results on a 996 MHz P3. The exe in your zip only tested two functions.

Code Select


C:\ASM\test\BMARK4>bmark
360 MS szCmpi6
343 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5
360 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
375 MS szCmpi6
359 MS szCmpi5
375 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
359 MS szCmpi6
344 MS szCmpi5
360 MS szCmpi6
343 MS szCmpi5

363 MS average for szCmpi6
344 MS average for szCmpi5

Press ENTER to exit

hutch-- · July 01, 2005, 09:13:04 AM

Phil,

Did you rename one of the output names, the zip I posted had "szCmpixx" and "szCmpi5".

Jimg · July 01, 2005, 01:00:13 PM

Hi Hutch-
Perhaps you posted an old one, it had the ones enabled that Phil showed. Here's one with all three enabled, plus my Cmpi8 which has one less compare per loop and should run the fastest except in the rare case where every character has the opposite case between the two files being compared.

Phil- If you get a chance, try this one and let me know your results.
Edited-- Phil, ignore this and try the bmark7.zip I posted last. Thanks.

The last portion of my results are-

Code Select

125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8
125 MS szCmpixx
110 MS szCmpi5
93 MS szCmpi6
78 MS szCmpi8

125 MS average for szCmpiix
111 MS average for szCmpi5
95 MS average for szCmpi6
79 MS average for szCmpi8

Press ENTER to exit

[attachment deleted by admin]

roticv · July 01, 2005, 02:06:26 PM

I think you guys should give better function names. I thought I was bad. :bdg

hutch-- · July 01, 2005, 02:39:05 PM

Victor,

The procedure name is szCmpi, the rest are variations for testing. :toothy

Jim,

I put the two later versions I had in the last post into your testbed then ran the benchmark. On my PIV, I get these results.

Code Select


comment * ------------------------------

        mismatch compares

        140 MS average for szCmpiix
        140 MS average for szCmpi5
        265 MS average for szCmpi6
        261 MS average for szCmpi8
        
        matched compares

        140 MS average for szCmpiix
        140 MS average for szCmpi5
        93 MS average for szCmpi6
        78 MS average for szCmpi8
        
        ------------------------------ *

On the old Celeron, the timings for all only vary about 5% but on the Sempron, the szCmpi5 beats the rest with a mismatched test by a long way.

The problem with the additional branching code is it only works with exact matches and in the worst case it is much slower. Wild swings in timing make the preocedure unpredictable performance wise where the two versions that always check the table run the same time under any condition. On the PIV, the version you modified that I have also modified runs at the same speed on the PIV and faster on the Sempron so it is a better all round algorithm.

This is the modified version of your previous modification.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

szCmpi5 proc src:DWORD,dst:DWORD,ln:DWORD

    sub eax, eax
    cmp eax, ln                     ; just quit on zero length
    je done
    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst

  align 4
  @@:
    movzx edx, BYTE PTR [esi+eax]
    movzx ebx, BYTE PTR [edi+eax]
    movzx ecx, BYTE PTR [edx+tbl]   ; <<<-- modified here
    add eax, 1
    cmp cl, [ebx+tbl]
    jne quit
    cmp eax,ln
    jb @b

    sub eax, eax

  quit:
    pop edi
    pop esi
    pop ebx
  done:
    ret

szCmpi5 endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

News:

CmpI

Phil

Phil

roticv