News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing String Compare

Started by UlliN, January 16, 2007, 11:51:41 AM

Previous topic - Next topic

UlliN

Hi,

I've played around with some kind of string comparison including "szcmp" ,a DWORD-algo, REPE CMPSD and REPE CMPSB.
The result shows, that the DWORD-Algo seems to be the fastest, if mismatching doesn't occur at first byte. CMPSx obviously needs some extra cycles at startup but runs with constant additional amount of time depending on mismatch position.
Its remarkable, that each algo has its own characteristic timing.  szcmp has a "time gap" between i=8 and i=9. Same for DWORD-algo between i=35 and i=36. Timing of CMPSD changes after QWORD.  Any explanations?

I've attached a test scenario with all sources . The szcmp-routine is sligthly modified because of comparison reasons. Also added is a "write"-version of szMid. Run makeit.bat , test.exe.
Any improvement would be appreciated ;-)

Regards,
Ulli



Running on AMD64 FX-57

Comparing strings
Mism DWORD   CMPSD       CMPSB      szCmp
i=0    7 cyc       35 cyc      34 cyc      6 cyc
i=1   11 cyc      35 cyc      36 cyc      10 cyc
i=2   11 cyc      35 cyc      38 cyc      18 cyc
i=3   11 cyc      35 cyc      40 cyc      20 cyc
i=4   14 cyc      40 cyc      42 cyc      25 cyc
i=5   14 cyc      40 cyc      45 cyc      30 cyc
i=6   14 cyc      40 cyc      48 cyc      35 cyc
i=7   14 cyc      40 cyc      49 cyc      40 cyc
i=8   16 cyc      40 cyc      51 cyc      45 cyc
i=9   16 cyc      40 cyc      53 cyc      66 cyc
i=10   16 cyc      40 cyc      56 cyc      72 cyc
i=11   16 cyc      40 cyc      58 cyc      75 cyc
i=12   19 cyc      45 cyc      61 cyc      83 cyc
i=13   19 cyc      45 cyc      62 cyc      85 cyc
i=14   19 cyc      45 cyc      64 cyc      92 cyc
i=15   19 cyc      45 cyc      67 cyc      96 cyc
i=16   23 cyc      45 cyc      69 cyc      102 cyc
i=17   23 cyc      45 cyc      72 cyc      106 cyc
i=18   23 cyc      45 cyc      73 cyc      113 cyc
i=19   23 cyc      45 cyc      75 cyc      116 cyc
i=20   27 cyc      51 cyc      78 cyc      123 cyc
i=21   27 cyc      51 cyc      81 cyc      126 cyc
i=22   27 cyc      51 cyc      82 cyc      133 cyc
i=23   27 cyc      51 cyc      85 cyc      136 cyc
i=24   29 cyc      51 cyc      87 cyc      143 cyc
i=25   28 cyc      51 cyc      90 cyc      146 cyc
i=26   29 cyc      51 cyc      91 cyc      153 cyc
i=27   29 cyc      51 cyc      93 cyc      156 cyc
i=28   31 cyc      56 cyc      95 cyc      163 cyc
i=29   31 cyc      56 cyc      97 cyc      166 cyc
i=30   31 cyc      56 cyc      100 cyc      173 cyc
i=31   31 cyc      56 cyc      102 cyc      176 cyc
i=32   34 cyc      57 cyc      104 cyc      183 cyc
i=33   34 cyc      56 cyc      107 cyc      186 cyc
i=34   34 cyc      56 cyc      109 cyc      193 cyc
i=35   34 cyc      56 cyc      112 cyc      196 cyc
i=36   54 cyc      61 cyc      114 cyc      203 cyc
i=37   54 cyc      61 cyc      116 cyc      206 cyc
i=38   55 cyc      62 cyc      118 cyc      213 cyc
i=39   54 cyc      61 cyc      120 cyc      216 cyc
i=40   57 cyc      61 cyc      123 cyc      223 cyc

Press enter to exit...

[attachment deleted by admin]

six_L

QuoteRunning on Intel(R) pentium(R) 1.4GHz

Comparing strings
Mismatch DWORD-wise  CMPSD       CMPSB      szCmp
i=0   9 cyc      91 cyc      75 cyc      8 cyc   
i=1   42 cyc      91 cyc      80 cyc      10 cyc   
i=2   33 cyc      91 cyc      86 cyc      13 cyc   
i=3   33 cyc      91 cyc      91 cyc      16 cyc   
i=4   26 cyc      93 cyc      95 cyc      20 cyc   
i=5   26 cyc      93 cyc      99 cyc      23 cyc   
i=6   44 cyc      93 cyc      103 cyc      26 cyc   
i=7   26 cyc      93 cyc      107 cyc      28 cyc   
i=8   42 cyc      98 cyc      111 cyc      31 cyc   
i=9   29 cyc      98 cyc      115 cyc      41 cyc   
i=10   42 cyc      98 cyc      119 cyc      41 cyc   
i=11   28 cyc      98 cyc      123 cyc      46 cyc   
i=12   48 cyc      102 cyc      127 cyc      47 cyc   
i=13   48 cyc      102 cyc      132 cyc      53 cyc   
i=14   48 cyc      102 cyc      137 cyc      50 cyc   
i=15   48 cyc      102 cyc      143 cyc      54 cyc   
i=16   51 cyc      108 cyc      148 cyc      57 cyc   
i=17   51 cyc      108 cyc      153 cyc      65 cyc   
i=18   51 cyc      108 cyc      157 cyc      67 cyc   
i=19   51 cyc      108 cyc      161 cyc      64 cyc   
i=20   63 cyc      112 cyc      166 cyc      69 cyc   
i=21   63 cyc      112 cyc      171 cyc      71 cyc   
i=22   63 cyc      112 cyc      175 cyc      86 cyc   
i=23   63 cyc      112 cyc      180 cyc      75 cyc   
i=24   60 cyc      118 cyc      184 cyc      95 cyc   
i=25   40 cyc      118 cyc      189 cyc      94 cyc   
i=26   40 cyc      118 cyc      194 cyc      89 cyc   
i=27   40 cyc      118 cyc      197 cyc      89 cyc   
i=28   44 cyc      119 cyc      203 cyc      90 cyc   
i=29   43 cyc      119 cyc      206 cyc      95 cyc   
i=30   43 cyc      119 cyc      211 cyc      96 cyc   
i=31   44 cyc      119 cyc      216 cyc      104 cyc   
i=32   53 cyc      128 cyc      221 cyc      115 cyc   
i=33   47 cyc      128 cyc      226 cyc      125 cyc   
i=34   47 cyc      128 cyc      230 cyc      108 cyc   
i=35   47 cyc      128 cyc      233 cyc      111 cyc   
i=36   51 cyc      130 cyc      237 cyc      114 cyc   
i=37   51 cyc      130 cyc      241 cyc      117 cyc   
i=38   51 cyc      130 cyc      246 cyc      120 cyc   
i=39   51 cyc      130 cyc      249 cyc      123 cyc   
i=40   54 cyc      134 cyc      253 cyc      127 cyc   

Press enter to exit...
regards

hutch--

Here are the results from my 2.8gig Northwood core PIV.


Comparing strings
Mismatch DWORD-wise  CMPSD       CMPSB      szCmp
i=0   7 cyc      197 cyc      125 cyc      7 cyc
i=1   43 cyc      197 cyc      128 cyc      13 cyc
i=2   42 cyc      197 cyc      132 cyc      12 cyc
i=3   41 cyc      197 cyc      136 cyc      17 cyc
i=4   22 cyc      164 cyc      141 cyc      21 cyc
i=5   21 cyc      164 cyc      145 cyc      24 cyc
i=6   28 cyc      164 cyc      150 cyc      28 cyc
i=7   22 cyc      163 cyc      153 cyc      33 cyc
i=8   27 cyc      168 cyc      158 cyc      33 cyc
i=9   27 cyc      168 cyc      162 cyc      50 cyc
i=10   28 cyc      169 cyc      166 cyc      66 cyc
i=11   26 cyc      168 cyc      170 cyc      66 cyc
i=12   32 cyc      172 cyc      174 cyc      70 cyc
i=13   31 cyc      172 cyc      178 cyc      73 cyc
i=14   32 cyc      172 cyc      182 cyc      76 cyc
i=15   32 cyc      172 cyc      187 cyc      79 cyc
i=16   37 cyc      177 cyc      191 cyc      83 cyc
i=17   36 cyc      177 cyc      195 cyc      87 cyc
i=18   36 cyc      176 cyc      199 cyc      88 cyc
i=19   38 cyc      176 cyc      204 cyc      96 cyc
i=20   40 cyc      180 cyc      207 cyc      100 cyc
i=21   40 cyc      180 cyc      211 cyc      100 cyc
i=22   41 cyc      180 cyc      215 cyc      101 cyc
i=23   40 cyc      181 cyc      221 cyc      104 cyc
i=24   53 cyc      185 cyc      223 cyc      108 cyc
i=25   52 cyc      184 cyc      228 cyc      113 cyc
i=26   52 cyc      186 cyc      232 cyc      114 cyc
i=27   52 cyc      184 cyc      237 cyc      118 cyc
i=28   58 cyc      188 cyc      240 cyc      122 cyc
i=29   58 cyc      188 cyc      245 cyc      123 cyc
i=30   57 cyc      189 cyc      248 cyc      126 cyc
i=31   58 cyc      189 cyc      253 cyc      129 cyc
i=32   75 cyc      193 cyc      258 cyc      132 cyc
i=33   74 cyc      192 cyc      262 cyc      135 cyc
i=34   75 cyc      192 cyc      267 cyc      140 cyc
i=35   80 cyc      193 cyc      270 cyc      155 cyc
i=36   131 cyc      197 cyc      274 cyc      147 cyc
i=37   143 cyc      197 cyc      278 cyc      148 cyc
i=38   128 cyc      197 cyc      282 cyc      160 cyc
i=39   138 cyc      197 cyc      287 cyc      157 cyc
i=40   136 cyc      201 cyc      290 cyc      157 cyc

Press enter to exit...


I noticed that you are using the older version of szCmp so I changed it to the version in the last service pack and got these timings.


Comparing strings
Mismatch DWORD-wise  CMPSD       CMPSB      szCmp
i=0   8 cyc      197 cyc      124 cyc      2 cyc
i=1   42 cyc      198 cyc      129 cyc      3 cyc
i=2   45 cyc      197 cyc      133 cyc      9 cyc
i=3   44 cyc      197 cyc      137 cyc      11 cyc
i=4   23 cyc      164 cyc      140 cyc      16 cyc
i=5   22 cyc      164 cyc      145 cyc      16 cyc
i=6   22 cyc      164 cyc      150 cyc      20 cyc
i=7   23 cyc      164 cyc      153 cyc      21 cyc
i=8   27 cyc      168 cyc      158 cyc      29 cyc
i=9   28 cyc      168 cyc      161 cyc      31 cyc
i=10   27 cyc      168 cyc      165 cyc      33 cyc
i=11   28 cyc      168 cyc      169 cyc      37 cyc
i=12   32 cyc      172 cyc      174 cyc      62 cyc
i=13   32 cyc      172 cyc      178 cyc      53 cyc
i=14   31 cyc      172 cyc      182 cyc      63 cyc
i=15   32 cyc      172 cyc      188 cyc      76 cyc
i=16   36 cyc      176 cyc      190 cyc      73 cyc
i=17   36 cyc      176 cyc      195 cyc      77 cyc
i=18   37 cyc      176 cyc      199 cyc      84 cyc
i=19   37 cyc      176 cyc      204 cyc      86 cyc
i=20   40 cyc      180 cyc      207 cyc      90 cyc
i=21   39 cyc      180 cyc      211 cyc      88 cyc
i=22   40 cyc      181 cyc      215 cyc      96 cyc
i=23   39 cyc      180 cyc      220 cyc      101 cyc
i=24   52 cyc      185 cyc      223 cyc      95 cyc
i=25   53 cyc      184 cyc      228 cyc      100 cyc
i=26   52 cyc      185 cyc      232 cyc      103 cyc
i=27   52 cyc      184 cyc      236 cyc      106 cyc
i=28   58 cyc      188 cyc      241 cyc      108 cyc
i=29   58 cyc      188 cyc      244 cyc      113 cyc
i=30   58 cyc      188 cyc      249 cyc      114 cyc
i=31   58 cyc      189 cyc      253 cyc      126 cyc
i=32   74 cyc      193 cyc      257 cyc      121 cyc
i=33   72 cyc      192 cyc      261 cyc      119 cyc
i=34   87 cyc      192 cyc      266 cyc      122 cyc
i=35   82 cyc      193 cyc      270 cyc      128 cyc
i=36   137 cyc      197 cyc      274 cyc      130 cyc
i=37   136 cyc      197 cyc      278 cyc      132 cyc
i=38   139 cyc      197 cyc      282 cyc      140 cyc
i=39   126 cyc      197 cyc      286 cyc      134 cyc
i=40   134 cyc      201 cyc      290 cyc      154 cyc

Press enter to exit...

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

UlliN

Hello Hutch,

after installing servicepack2, I've got some nice results:
the new szCmp shows the same asymptotic behavior as CMPSB but without extra startup penalties.
In case of mismatching at a high position CMPSD seems the best choice ?

You've got to know, that string comparison is an integral part of most of our software. We daily sort, merge and process  huge files of address data (up to 30GB, 120mio Recs, flat files ;-)). In our environment(COBOL, host) zero-strings are unusual , so functions must accept a defined length. In addition comparison must result mostly in a "gt/eql/lt" not only "eq/ne". So some modifications are required to szCmp.

Please, can you explain the difference between the two szCmp versions and why the new one is faster?
Is it possible to improve the DWORD-algorithm that way?

Thank you

Regards,
Ulli


Comparing strings
Mismatch DWORD  CMPSD    CMPSB     szCmp_o   szCmp_n
i=0   7 cyc      34 cyc      34 cyc      6 cyc   6 cyc
i=1   13 cyc      34 cyc      36 cyc      10 cyc   10 cyc
i=2   14 cyc      34 cyc      38 cyc      18 cyc   12 cyc
i=3   13 cyc      34 cyc      40 cyc      20 cyc   14 cyc
i=4   15 cyc      36 cyc      42 cyc      25 cyc   18 cyc
i=5   15 cyc      36 cyc      45 cyc      30 cyc   21 cyc
i=6   15 cyc      36 cyc      47 cyc      35 cyc   21 cyc
i=7   15 cyc      36 cyc      49 cyc      40 cyc   27 cyc
i=8   16 cyc      38 cyc      52 cyc      45 cyc   29 cyc
i=9   16 cyc      38 cyc      53 cyc      65 cyc   31 cyc
....
i=100   101 cyc      91 cyc      256 cyc      526 cyc   259 cyc
i=101   101 cyc      91 cyc      257 cyc      528 cyc   260 cyc
i=102   100 cyc      90 cyc      259 cyc      535 cyc   262 cyc
i=103   101 cyc      90 cyc      261 cyc      538 cyc   259 cyc
i=104   103 cyc      92 cyc      263 cyc      545 cyc   262 cyc
i=105   103 cyc      92 cyc      266 cyc      548 cyc   269 cyc
i=106   103 cyc      92 cyc      268 cyc      555 cyc   272 cyc
i=107   103 cyc      92 cyc      270 cyc      558 cyc   268 cyc

hutch--

Ulli,

A couple of things, you need to test the DWORD algo on different data alignments as in most instances non aligned by 4 data in either string will drop the timings. The other is you need to calculate the time to test the length of the string as you pass it as a known length where the zero terminated types must check for the end.

I am in the middle of rebuilding the roof on my house at the moment so i will be busy for the next week or so but having a quick look at your DWORD algo you can get it slightly faster on low counts by removing the stack frame. This gives you an extra register EBP as well to work with as well. A DWORD algo on a long run if its aligned properly should be about 2 and a half times faster that a byte algo so there is probably room to get your sped up a bit.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Ulli,

Here is a test piece I whizzed up at about 3 am before going back to sleep. Its an aligned DWORD algo that runs at over 3 times faster on my dev PIV. Its not fully tested but seems to be working OK at the moment. It only returns a boolean value, 0 or 1 and not the greater, same or less than that you need for a string sort algo but it should give you some idea of the speed difference of a full DWORD algo if the data is aligned by 4 or greater.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

UlliN

hutch,

thank you for your efforts.
I've modified the code slightly to fit our requirements and have tested on different machines.  Compared to szCmp your dcmp-algo is first choice. The difference between these algos increases in favour of dcmp depending on the length of the matching run. The slower the machine, the more obvious. There are some outlier when mismatching at position > 32 (on AMD64) or > 16 (Athlon32). Prefetching?
Compared to DWORD-algo there's no significant difference until mismatch occurs at position 32(or 16). But when comparing longer (matching) strings  dcmp is faster.
Combined with the precheck of the first char and the change to result in (-1, 0, +1), the dcmp is the recommendation for string comparison.  ;-)

Thank you once again.
Regards,
Ulli


[attachment deleted by admin]

EduardoS

A sugestion:

StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD
    mov esi, str01
    mov edi, str02
    mov ecx, strln
    mov edx, ecx
    and ecx, 3
    je @F
    shl ecx, 2
    mov eax, [esi]
    sub eax, [edi]
    shl eax, cl
    jne diff

    @@:
    add esi, edx
    add edi, edx
    shr edx, 2
    jz equl
    neg edx
    xor ecx, ecx

    align 16
    @@:
    mov eax, [esi+edx*4]
    sub eax, [edi+edx*4]
    cmp ecx, eax
    inc edx
    ja @B
    jz equl

    diff:
    test eax, eax
    setg al
    movzx eax, al
    shl eax, 1
    dec eax

    equl:
    ret
StrCmpLng ENDP

Compare the results for 1000+ char strings with other methods ;)

hutch--

EduardoS,

Your algo does not preserve ESI and EDI, it could only be called by a procedure that has already preserved the two registers.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

EduardoS

hutch, i didn't use the PROLOGUE:none...

StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD

hutch--

 :bg

EduardoS,

My eyesight is failing in my old age, I jus did not see the USES syntax.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

UlliN

Hi Eduardo,

your algo seems slightly(not significant) faster than dcmp on long runs. But can you interpret the results of the following comparisons?

EAX
50396925         StrCmpLng : ABCD vs. DCBA   -> LESS         
-50396925        StrCmpLng : DCBA vs. ABCD   -> GREATER       
0                     StrCmpLng : ABCD vs. ABCD   -> EQUAL
-1                    StrCmpLng : ABCD vs. BBCD   -> LESS
1                     StrCmpLng : BBCD vs. ABCD   -> GREATER

IMO you've got to add a BWAP to the regs to be compared and also a tail processing routine if strlen != 0mod4.

Regards,
Ulli

EduardoS

I'm sorry for those bugs... Here the version properly debbuged:

StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD
    mov esi, str01
    mov edi, str02
    mov ecx, strln
    neg ecx
    mov edx, ecx
    and ecx, 3

    je @F
    shl ecx, 3
    mov eax, [esi]
    sub eax, [edi]
    shl eax, cl
    mov ecx, -1
    jne diff

    @@:
    sub esi, edx
    sub edi, edx
    sar edx, 2
    sub edx, ecx
    jz equl
    xor ecx, ecx

    align 16
    @@:
    mov eax, [esi+edx*4]
    sub eax, [edi+edx*4]
    cmp ecx, eax
    inc edx
    ja @B
    jnc equl

    diff:
    not ecx
    add edx, ecx
    and edx, ecx
    mov ecx, [esi+edx*4]
    mov edx, [edi+edx*4]
    bswap ecx
    bswap edx
    xor eax, eax
    cmp edx, ecx
    adc eax, 0
    shl eax, 1
    dec eax

    equl:
    ret
StrCmpLng ENDP


On my A64 3200+:

Comparing strings
Mismatch DWORD-wise       CMPSD         CMPSB         szCmp     CmpStrLng
i=1        13 cyc        34 cyc        36 cyc        10 cyc        19 cyc
i=2        13 cyc        34 cyc        38 cyc        18 cyc        19 cyc
i=4        15 cyc        51 cyc        42 cyc        25 cyc        22 cyc
i=8        16 cyc        53 cyc        51 cyc        45 cyc        21 cyc
i=16       22 cyc        57 cyc        69 cyc       110 cyc        27 cyc
i=32       34 cyc        52 cyc       104 cyc       191 cyc        47 cyc
i=64       73 cyc        70 cyc       185 cyc       342 cyc        61 cyc
i=128     121 cyc       111 cyc       315 cyc       671 cyc        90 cyc
i=256     217 cyc       175 cyc       608 cyc      1315 cyc       155 cyc
i=512     417 cyc       316 cyc      1179 cyc      2602 cyc       283 cyc
i=1024    806 cyc       613 cyc      2291 cyc      5158 cyc       549 cyc
i=2048   1566 cyc      1175 cyc      4551 cyc     10295 cyc      1065 cyc
i=4096   3106 cyc      2308 cyc      9072 cyc     20558 cyc      2087 cyc
i=8192   6190 cyc      4562 cyc     18110 cyc     41091 cyc      4144 cyc