Hi,
I've played around with some kind of string comparison including "szcmp" ,a DWORD-algo, REPE CMPSD and REPE CMPSB.
The result shows, that the DWORD-Algo seems to be the fastest, if mismatching doesn't occur at first byte. CMPSx obviously needs some extra cycles at startup but runs with constant additional amount of time depending on mismatch position.
Its remarkable, that each algo has its own characteristic timing. szcmp has a "time gap" between i=8 and i=9. Same for DWORD-algo between i=35 and i=36. Timing of CMPSD changes after QWORD. Any explanations?
I've attached a test scenario with all sources . The szcmp-routine is sligthly modified because of comparison reasons. Also added is a "write"-version of szMid. Run makeit.bat , test.exe.
Any improvement would be appreciated ;-)
Regards,
Ulli
Running on AMD64 FX-57
Comparing strings
Mism DWORD CMPSD CMPSB szCmp
i=0 7 cyc 35 cyc 34 cyc 6 cyc
i=1 11 cyc 35 cyc 36 cyc 10 cyc
i=2 11 cyc 35 cyc 38 cyc 18 cyc
i=3 11 cyc 35 cyc 40 cyc 20 cyc
i=4 14 cyc 40 cyc 42 cyc 25 cyc
i=5 14 cyc 40 cyc 45 cyc 30 cyc
i=6 14 cyc 40 cyc 48 cyc 35 cyc
i=7 14 cyc 40 cyc 49 cyc 40 cyc
i=8 16 cyc 40 cyc 51 cyc 45 cyc
i=9 16 cyc 40 cyc 53 cyc 66 cyc
i=10 16 cyc 40 cyc 56 cyc 72 cyc
i=11 16 cyc 40 cyc 58 cyc 75 cyc
i=12 19 cyc 45 cyc 61 cyc 83 cyc
i=13 19 cyc 45 cyc 62 cyc 85 cyc
i=14 19 cyc 45 cyc 64 cyc 92 cyc
i=15 19 cyc 45 cyc 67 cyc 96 cyc
i=16 23 cyc 45 cyc 69 cyc 102 cyc
i=17 23 cyc 45 cyc 72 cyc 106 cyc
i=18 23 cyc 45 cyc 73 cyc 113 cyc
i=19 23 cyc 45 cyc 75 cyc 116 cyc
i=20 27 cyc 51 cyc 78 cyc 123 cyc
i=21 27 cyc 51 cyc 81 cyc 126 cyc
i=22 27 cyc 51 cyc 82 cyc 133 cyc
i=23 27 cyc 51 cyc 85 cyc 136 cyc
i=24 29 cyc 51 cyc 87 cyc 143 cyc
i=25 28 cyc 51 cyc 90 cyc 146 cyc
i=26 29 cyc 51 cyc 91 cyc 153 cyc
i=27 29 cyc 51 cyc 93 cyc 156 cyc
i=28 31 cyc 56 cyc 95 cyc 163 cyc
i=29 31 cyc 56 cyc 97 cyc 166 cyc
i=30 31 cyc 56 cyc 100 cyc 173 cyc
i=31 31 cyc 56 cyc 102 cyc 176 cyc
i=32 34 cyc 57 cyc 104 cyc 183 cyc
i=33 34 cyc 56 cyc 107 cyc 186 cyc
i=34 34 cyc 56 cyc 109 cyc 193 cyc
i=35 34 cyc 56 cyc 112 cyc 196 cyc
i=36 54 cyc 61 cyc 114 cyc 203 cyc
i=37 54 cyc 61 cyc 116 cyc 206 cyc
i=38 55 cyc 62 cyc 118 cyc 213 cyc
i=39 54 cyc 61 cyc 120 cyc 216 cyc
i=40 57 cyc 61 cyc 123 cyc 223 cyc
Press enter to exit...
[attachment deleted by admin]
QuoteRunning on Intel(R) pentium(R) 1.4GHz
Comparing strings
Mismatch DWORD-wise CMPSD CMPSB szCmp
i=0 9 cyc 91 cyc 75 cyc 8 cyc
i=1 42 cyc 91 cyc 80 cyc 10 cyc
i=2 33 cyc 91 cyc 86 cyc 13 cyc
i=3 33 cyc 91 cyc 91 cyc 16 cyc
i=4 26 cyc 93 cyc 95 cyc 20 cyc
i=5 26 cyc 93 cyc 99 cyc 23 cyc
i=6 44 cyc 93 cyc 103 cyc 26 cyc
i=7 26 cyc 93 cyc 107 cyc 28 cyc
i=8 42 cyc 98 cyc 111 cyc 31 cyc
i=9 29 cyc 98 cyc 115 cyc 41 cyc
i=10 42 cyc 98 cyc 119 cyc 41 cyc
i=11 28 cyc 98 cyc 123 cyc 46 cyc
i=12 48 cyc 102 cyc 127 cyc 47 cyc
i=13 48 cyc 102 cyc 132 cyc 53 cyc
i=14 48 cyc 102 cyc 137 cyc 50 cyc
i=15 48 cyc 102 cyc 143 cyc 54 cyc
i=16 51 cyc 108 cyc 148 cyc 57 cyc
i=17 51 cyc 108 cyc 153 cyc 65 cyc
i=18 51 cyc 108 cyc 157 cyc 67 cyc
i=19 51 cyc 108 cyc 161 cyc 64 cyc
i=20 63 cyc 112 cyc 166 cyc 69 cyc
i=21 63 cyc 112 cyc 171 cyc 71 cyc
i=22 63 cyc 112 cyc 175 cyc 86 cyc
i=23 63 cyc 112 cyc 180 cyc 75 cyc
i=24 60 cyc 118 cyc 184 cyc 95 cyc
i=25 40 cyc 118 cyc 189 cyc 94 cyc
i=26 40 cyc 118 cyc 194 cyc 89 cyc
i=27 40 cyc 118 cyc 197 cyc 89 cyc
i=28 44 cyc 119 cyc 203 cyc 90 cyc
i=29 43 cyc 119 cyc 206 cyc 95 cyc
i=30 43 cyc 119 cyc 211 cyc 96 cyc
i=31 44 cyc 119 cyc 216 cyc 104 cyc
i=32 53 cyc 128 cyc 221 cyc 115 cyc
i=33 47 cyc 128 cyc 226 cyc 125 cyc
i=34 47 cyc 128 cyc 230 cyc 108 cyc
i=35 47 cyc 128 cyc 233 cyc 111 cyc
i=36 51 cyc 130 cyc 237 cyc 114 cyc
i=37 51 cyc 130 cyc 241 cyc 117 cyc
i=38 51 cyc 130 cyc 246 cyc 120 cyc
i=39 51 cyc 130 cyc 249 cyc 123 cyc
i=40 54 cyc 134 cyc 253 cyc 127 cyc
Press enter to exit...
Here are the results from my 2.8gig Northwood core PIV.
Comparing strings
Mismatch DWORD-wise CMPSD CMPSB szCmp
i=0 7 cyc 197 cyc 125 cyc 7 cyc
i=1 43 cyc 197 cyc 128 cyc 13 cyc
i=2 42 cyc 197 cyc 132 cyc 12 cyc
i=3 41 cyc 197 cyc 136 cyc 17 cyc
i=4 22 cyc 164 cyc 141 cyc 21 cyc
i=5 21 cyc 164 cyc 145 cyc 24 cyc
i=6 28 cyc 164 cyc 150 cyc 28 cyc
i=7 22 cyc 163 cyc 153 cyc 33 cyc
i=8 27 cyc 168 cyc 158 cyc 33 cyc
i=9 27 cyc 168 cyc 162 cyc 50 cyc
i=10 28 cyc 169 cyc 166 cyc 66 cyc
i=11 26 cyc 168 cyc 170 cyc 66 cyc
i=12 32 cyc 172 cyc 174 cyc 70 cyc
i=13 31 cyc 172 cyc 178 cyc 73 cyc
i=14 32 cyc 172 cyc 182 cyc 76 cyc
i=15 32 cyc 172 cyc 187 cyc 79 cyc
i=16 37 cyc 177 cyc 191 cyc 83 cyc
i=17 36 cyc 177 cyc 195 cyc 87 cyc
i=18 36 cyc 176 cyc 199 cyc 88 cyc
i=19 38 cyc 176 cyc 204 cyc 96 cyc
i=20 40 cyc 180 cyc 207 cyc 100 cyc
i=21 40 cyc 180 cyc 211 cyc 100 cyc
i=22 41 cyc 180 cyc 215 cyc 101 cyc
i=23 40 cyc 181 cyc 221 cyc 104 cyc
i=24 53 cyc 185 cyc 223 cyc 108 cyc
i=25 52 cyc 184 cyc 228 cyc 113 cyc
i=26 52 cyc 186 cyc 232 cyc 114 cyc
i=27 52 cyc 184 cyc 237 cyc 118 cyc
i=28 58 cyc 188 cyc 240 cyc 122 cyc
i=29 58 cyc 188 cyc 245 cyc 123 cyc
i=30 57 cyc 189 cyc 248 cyc 126 cyc
i=31 58 cyc 189 cyc 253 cyc 129 cyc
i=32 75 cyc 193 cyc 258 cyc 132 cyc
i=33 74 cyc 192 cyc 262 cyc 135 cyc
i=34 75 cyc 192 cyc 267 cyc 140 cyc
i=35 80 cyc 193 cyc 270 cyc 155 cyc
i=36 131 cyc 197 cyc 274 cyc 147 cyc
i=37 143 cyc 197 cyc 278 cyc 148 cyc
i=38 128 cyc 197 cyc 282 cyc 160 cyc
i=39 138 cyc 197 cyc 287 cyc 157 cyc
i=40 136 cyc 201 cyc 290 cyc 157 cyc
Press enter to exit...
I noticed that you are using the older version of szCmp so I changed it to the version in the last service pack and got these timings.
Comparing strings
Mismatch DWORD-wise CMPSD CMPSB szCmp
i=0 8 cyc 197 cyc 124 cyc 2 cyc
i=1 42 cyc 198 cyc 129 cyc 3 cyc
i=2 45 cyc 197 cyc 133 cyc 9 cyc
i=3 44 cyc 197 cyc 137 cyc 11 cyc
i=4 23 cyc 164 cyc 140 cyc 16 cyc
i=5 22 cyc 164 cyc 145 cyc 16 cyc
i=6 22 cyc 164 cyc 150 cyc 20 cyc
i=7 23 cyc 164 cyc 153 cyc 21 cyc
i=8 27 cyc 168 cyc 158 cyc 29 cyc
i=9 28 cyc 168 cyc 161 cyc 31 cyc
i=10 27 cyc 168 cyc 165 cyc 33 cyc
i=11 28 cyc 168 cyc 169 cyc 37 cyc
i=12 32 cyc 172 cyc 174 cyc 62 cyc
i=13 32 cyc 172 cyc 178 cyc 53 cyc
i=14 31 cyc 172 cyc 182 cyc 63 cyc
i=15 32 cyc 172 cyc 188 cyc 76 cyc
i=16 36 cyc 176 cyc 190 cyc 73 cyc
i=17 36 cyc 176 cyc 195 cyc 77 cyc
i=18 37 cyc 176 cyc 199 cyc 84 cyc
i=19 37 cyc 176 cyc 204 cyc 86 cyc
i=20 40 cyc 180 cyc 207 cyc 90 cyc
i=21 39 cyc 180 cyc 211 cyc 88 cyc
i=22 40 cyc 181 cyc 215 cyc 96 cyc
i=23 39 cyc 180 cyc 220 cyc 101 cyc
i=24 52 cyc 185 cyc 223 cyc 95 cyc
i=25 53 cyc 184 cyc 228 cyc 100 cyc
i=26 52 cyc 185 cyc 232 cyc 103 cyc
i=27 52 cyc 184 cyc 236 cyc 106 cyc
i=28 58 cyc 188 cyc 241 cyc 108 cyc
i=29 58 cyc 188 cyc 244 cyc 113 cyc
i=30 58 cyc 188 cyc 249 cyc 114 cyc
i=31 58 cyc 189 cyc 253 cyc 126 cyc
i=32 74 cyc 193 cyc 257 cyc 121 cyc
i=33 72 cyc 192 cyc 261 cyc 119 cyc
i=34 87 cyc 192 cyc 266 cyc 122 cyc
i=35 82 cyc 193 cyc 270 cyc 128 cyc
i=36 137 cyc 197 cyc 274 cyc 130 cyc
i=37 136 cyc 197 cyc 278 cyc 132 cyc
i=38 139 cyc 197 cyc 282 cyc 140 cyc
i=39 126 cyc 197 cyc 286 cyc 134 cyc
i=40 134 cyc 201 cyc 290 cyc 154 cyc
Press enter to exit...
Hello Hutch,
after installing servicepack2, I've got some nice results:
the new szCmp shows the same asymptotic behavior as CMPSB but without extra startup penalties.
In case of mismatching at a high position CMPSD seems the best choice ?
You've got to know, that string comparison is an integral part of most of our software. We daily sort, merge and process huge files of address data (up to 30GB, 120mio Recs, flat files ;-)). In our environment(COBOL, host) zero-strings are unusual , so functions must accept a defined length. In addition comparison must result mostly in a "gt/eql/lt" not only "eq/ne". So some modifications are required to szCmp.
Please, can you explain the difference between the two szCmp versions and why the new one is faster?
Is it possible to improve the DWORD-algorithm that way?
Thank you
Regards,
Ulli
Comparing strings
Mismatch DWORD CMPSD CMPSB szCmp_o szCmp_n
i=0 7 cyc 34 cyc 34 cyc 6 cyc 6 cyc
i=1 13 cyc 34 cyc 36 cyc 10 cyc 10 cyc
i=2 14 cyc 34 cyc 38 cyc 18 cyc 12 cyc
i=3 13 cyc 34 cyc 40 cyc 20 cyc 14 cyc
i=4 15 cyc 36 cyc 42 cyc 25 cyc 18 cyc
i=5 15 cyc 36 cyc 45 cyc 30 cyc 21 cyc
i=6 15 cyc 36 cyc 47 cyc 35 cyc 21 cyc
i=7 15 cyc 36 cyc 49 cyc 40 cyc 27 cyc
i=8 16 cyc 38 cyc 52 cyc 45 cyc 29 cyc
i=9 16 cyc 38 cyc 53 cyc 65 cyc 31 cyc
....
i=100 101 cyc 91 cyc 256 cyc 526 cyc 259 cyc
i=101 101 cyc 91 cyc 257 cyc 528 cyc 260 cyc
i=102 100 cyc 90 cyc 259 cyc 535 cyc 262 cyc
i=103 101 cyc 90 cyc 261 cyc 538 cyc 259 cyc
i=104 103 cyc 92 cyc 263 cyc 545 cyc 262 cyc
i=105 103 cyc 92 cyc 266 cyc 548 cyc 269 cyc
i=106 103 cyc 92 cyc 268 cyc 555 cyc 272 cyc
i=107 103 cyc 92 cyc 270 cyc 558 cyc 268 cyc
Ulli,
A couple of things, you need to test the DWORD algo on different data alignments as in most instances non aligned by 4 data in either string will drop the timings. The other is you need to calculate the time to test the length of the string as you pass it as a known length where the zero terminated types must check for the end.
I am in the middle of rebuilding the roof on my house at the moment so i will be busy for the next week or so but having a quick look at your DWORD algo you can get it slightly faster on low counts by removing the stack frame. This gives you an extra register EBP as well to work with as well. A DWORD algo on a long run if its aligned properly should be about 2 and a half times faster that a byte algo so there is probably room to get your sped up a bit.
Ulli,
Here is a test piece I whizzed up at about 3 am before going back to sleep. Its an aligned DWORD algo that runs at over 3 times faster on my dev PIV. Its not fully tested but seems to be working OK at the moment. It only returns a boolean value, 0 or 1 and not the greater, same or less than that you need for a string sort algo but it should give you some idea of the speed difference of a full DWORD algo if the data is aligned by 4 or greater.
[attachment deleted by admin]
hutch,
thank you for your efforts.
I've modified the code slightly to fit our requirements and have tested on different machines. Compared to szCmp your dcmp-algo is first choice. The difference between these algos increases in favour of dcmp depending on the length of the matching run. The slower the machine, the more obvious. There are some outlier when mismatching at position > 32 (on AMD64) or > 16 (Athlon32). Prefetching?
Compared to DWORD-algo there's no significant difference until mismatch occurs at position 32(or 16). But when comparing longer (matching) strings dcmp is faster.
Combined with the precheck of the first char and the change to result in (-1, 0, +1), the dcmp is the recommendation for string comparison. ;-)
Thank you once again.
Regards,
Ulli
[attachment deleted by admin]
A sugestion:
StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD
mov esi, str01
mov edi, str02
mov ecx, strln
mov edx, ecx
and ecx, 3
je @F
shl ecx, 2
mov eax, [esi]
sub eax, [edi]
shl eax, cl
jne diff
@@:
add esi, edx
add edi, edx
shr edx, 2
jz equl
neg edx
xor ecx, ecx
align 16
@@:
mov eax, [esi+edx*4]
sub eax, [edi+edx*4]
cmp ecx, eax
inc edx
ja @B
jz equl
diff:
test eax, eax
setg al
movzx eax, al
shl eax, 1
dec eax
equl:
ret
StrCmpLng ENDP
Compare the results for 1000+ char strings with other methods ;)
EduardoS,
Your algo does not preserve ESI and EDI, it could only be called by a procedure that has already preserved the two registers.
hutch, i didn't use the PROLOGUE:none...
StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD
:bg
EduardoS,
My eyesight is failing in my old age, I jus did not see the USES syntax.
Hi Eduardo,
your algo seems slightly(not significant) faster than dcmp on long runs. But can you interpret the results of the following comparisons?
EAX
50396925 StrCmpLng : ABCD vs. DCBA -> LESS
-50396925 StrCmpLng : DCBA vs. ABCD -> GREATER
0 StrCmpLng : ABCD vs. ABCD -> EQUAL
-1 StrCmpLng : ABCD vs. BBCD -> LESS
1 StrCmpLng : BBCD vs. ABCD -> GREATER
IMO you've got to add a BWAP to the regs to be compared and also a tail processing routine if strlen != 0mod4.
Regards,
Ulli
I'm sorry for those bugs... Here the version properly debbuged:
StrCmpLng PROC uses edi esi str01:DWORD, str02:DWORD, strln:DWORD
mov esi, str01
mov edi, str02
mov ecx, strln
neg ecx
mov edx, ecx
and ecx, 3
je @F
shl ecx, 3
mov eax, [esi]
sub eax, [edi]
shl eax, cl
mov ecx, -1
jne diff
@@:
sub esi, edx
sub edi, edx
sar edx, 2
sub edx, ecx
jz equl
xor ecx, ecx
align 16
@@:
mov eax, [esi+edx*4]
sub eax, [edi+edx*4]
cmp ecx, eax
inc edx
ja @B
jnc equl
diff:
not ecx
add edx, ecx
and edx, ecx
mov ecx, [esi+edx*4]
mov edx, [edi+edx*4]
bswap ecx
bswap edx
xor eax, eax
cmp edx, ecx
adc eax, 0
shl eax, 1
dec eax
equl:
ret
StrCmpLng ENDP
On my A64 3200+:
Comparing strings
Mismatch DWORD-wise CMPSD CMPSB szCmp CmpStrLng
i=1 13 cyc 34 cyc 36 cyc 10 cyc 19 cyc
i=2 13 cyc 34 cyc 38 cyc 18 cyc 19 cyc
i=4 15 cyc 51 cyc 42 cyc 25 cyc 22 cyc
i=8 16 cyc 53 cyc 51 cyc 45 cyc 21 cyc
i=16 22 cyc 57 cyc 69 cyc 110 cyc 27 cyc
i=32 34 cyc 52 cyc 104 cyc 191 cyc 47 cyc
i=64 73 cyc 70 cyc 185 cyc 342 cyc 61 cyc
i=128 121 cyc 111 cyc 315 cyc 671 cyc 90 cyc
i=256 217 cyc 175 cyc 608 cyc 1315 cyc 155 cyc
i=512 417 cyc 316 cyc 1179 cyc 2602 cyc 283 cyc
i=1024 806 cyc 613 cyc 2291 cyc 5158 cyc 549 cyc
i=2048 1566 cyc 1175 cyc 4551 cyc 10295 cyc 1065 cyc
i=4096 3106 cyc 2308 cyc 9072 cyc 20558 cyc 2087 cyc
i=8192 6190 cyc 4562 cyc 18110 cyc 41091 cyc 4144 cyc