The usual way to compare strings for equal or not-equal, is to call memcmp(s1, s2, count) and compare its result with zero or not zero, respectively. But when I look across the landscape of all string comparisons I have witnessed in code, equal and not-equal seem to show up more often than all the other comparisons combined, i.e. greater-than, less-than, etc. Comparison for exact match is an inherently simpler operation than a relational string comparison, so I decided to look at just how much faster string equality could be computed by itself.
The result is the routine below, which outperforms my own MemCmp( ) by 2-to-1 and well worth the effort. I now use MemEq( ) in a large-scale C++ string library, used by multiple applications, in every context that had used MemCmp( ) for equality. In particular, MemEq( ) is a core function called many times during string searches. I have fine-tuned this method on an Athlon64, but there is nothing in it to give any cpu trouble--a P1 will run MemEq( ).
There is no doubt that you could use SSE instructions to compare vast strings for equality at rediculous speeds, easily surpassing MemEq( ) at, say, 1000 bytes. However, such attempts would be unwise, because the initial SSE setup cost could not be paid off for at least 60-100 bytes. And for a call that mostly operates on short strings, and mismatches mostly occurring within a few bytes, an SSE approach would likely slow down many applications, because its long-string advantage would never be realized. MemEq( ) is designed for fastest performance on 0 to 250 characters with favorable alignment, and respectable performance for all comparisons longer than that and at any alignment.
The code below is in VC++ form, but it maps easily to masm and no C-ism's appear in the assembly code. With that said--it works, it's fast, it computes a widely necessary function, and I offer it up for anyone to use, comment on, benchmark or out-do...
// string compare optimized for exact match or not (no above, below, etc)
// compares a left string operand with a right, and returns 1/0 for equal/not equal
// fast (Athlon64) performance: strings of 1/5/20/64/256 bytes compare in 7/13/20/45/120 cycles
__declspec( naked ) int MemEq ( const void *left, const void *right, size_t count ) {
_asm {
mov eax, [esp + 4] ; eax = pointer to left operand
mov edx, [esp + 8] ; edx = pointer to right operand
mov ecx, [esp + 12] ; ecx = count
cmp ecx, 4 ; jmp if more than 4 bytes to compare
jg fixalign ; otherwise forget about alignment and just compare the 1-4 bytes
cmp ecx, 1
jl same ; equal if zero bytes or fewer bytes to compare
neg ecx ; ecx = bytes past the end (0-3 in bits 0,1)
mov eax, [eax] ; compare one dword (i.e. 1-4 bytes of it)
shl ecx, 3 ; ecx = top-end bits to ignore (0, 8, 16 or 24 in low5 bits)
xor eax, [edx] ; compare entire dword
shl eax, cl ; lose unwanted upper bytes and test what's left
sub eax, 1 ; carry => equal
sbb eax, eax ; eax = 0 or -1
neg eax ; return 1/0 for equal/not equal
ret
align 8
fixalign: push ebx ; save ebx and edi for use below
push edi
mov ebx, eax ; ebx = points to left operand
sub edx, eax ; edx = address of right operand relative to left operand
jz equal ; return equal for equal addresses
test ebx, 3 ; jmp if ebx already aligned
jz next4
mov edi, ebx ; compute leading bytes to ignore (1-3)
and ebx, not 3 ; chop ebx down to dword
and edi, 3 ; edi = bytes to ignore
mov eax, [ebx] ; compare left and right dwords
lea ecx, [edi * 8] ; ecx = low-end bits to ignore (8, 16 or 24)
xor eax, [ebx + edx] ; xor leaves comparison bits in eax
add ebx, 4 ; advance dword address (for both operands)
shr eax, cl ; lose the low cl bits from eax (sign-extension ok here)
mov ecx, [esp + 20] ; ecx = original count
jnz noteq ; if non-zero then not-equal
lea ecx, [ecx + edi - 4] ; ecx = remaining bytes to compare
align 8
next4: test ebx, 4 ; check for the odd dword
jz next8 ; jmp ahead if none
mov eax, [ebx] ; compare left and right dwords
xor eax, [ebx + edx] ; (don't use a cmp here, only use an xor)
jne tailchk4 ; jmp to tailchk when not equal
add ebx, 4 ; now we have 8-byte alignment
sub ecx, 4
jbe equal ; equal, if the count runs out
align 8
next8: mov eax, [ebx] ; load 8 bytes at a time
mov edi, [ebx + 4]
xor eax, [ebx + edx] ; (don't use a cmp here, only use an xor)
jne tailchk4 ; jmp to tailchk when not equal
xor edi, [ebx + edx + 4]
jne tailchk8 ; jmp to tailchk when not equal
add ebx, 8
sub ecx, 8
ja next8 ; repeat while 1 or more bytes remain
equal: pop edi
pop ebx
same: mov eax, 1 ; return equal
ret
noteq: pop edi
pop ebx
mov eax, 0 ; return not equal
ret
; evaluate a not-equal result
tailchk8: sub ecx, 4 ; account for the first of two dwords
jle equal ; equal if search already ended on the 1st dword match
mov eax, edi ; move comparison bits to eax for examination
tailchk4: sub ecx, 4 ; account for the 2nd dword
jge noteq ; not-equal if not off the end
neg ecx ; ecx = bytes too far (1-3)
shl ecx, 3 ; ecx = number upper eax comparison bits to lose (8-16-24)
pop edi
shl eax, cl ; lose unwanted comparison bytes and test what's left
pop ebx
sub eax, 1 ; carry => zeros => equality
sbb eax, eax ; eax = 0 or -1
neg eax ; return 1/0 for equal/not equal
ret
}
}
looking at it, i can say that it would run faster without push/pop instructions, you could move
these into the stack, perhaps?
Also some instructions there do not "pair".
Other than that, it looks good.
Topic Summary
Posted on: Today at 11:11:11 AMPosted by: Kernel_Gaddafi
Quote
looking at it, i can say that it would run faster without push/pop instructions, you could move
these into the stack, perhaps?
Also some instructions there do not "pair".
Other than that, it looks good.
Points well taken, and I was able to rid the first misalignment handling of the push/pop ecx pair. I also tried your idea, and replaced all push/pops with mov[esp..] instructions. However everything ran 1 or 2 cycles longer, :boohoo: which seems to settle the notion of push/pop versus mov/mov--push/pop is faster. I updated the source code with these changes and others.
As far as pairing, I put no effort into pairing, a la P1, but rather took every opportunity to minimize dependencies between adjacent instructions, through careful ordering. If there are any I missed, that could be remedied with a net gain of at least 1 cycle--I would be grateful if you pointed them out.
PUSH and especially POP were slow on older CPU's. But the latency of EA calculations on newer ones make them more useful. PUSHing DWORD registers has always been relatively quick.
I noticed also that it took 1 -2 cycles longer..
But I'm positive that where there are many push instructions..say atleast 4 - 5 or more in consecutive order,
then a mov is faster, but only if using DATA section..
push eax
push ebx
push ecx
push edx
pop edx
pop ecx
pop ebx
pop eax
the following seemed to run a little faster for me, and can seem like a large speed
gain through..atleast millions of calls ;-) one after the other.
mov [temp_eax], eax
mov [temp_ebx], ebx
mov [temp_ecx], ecx
mov [temp_edx], edx
mov eax, [temp_eax]
mov ebx, [temp_ebx]
mov ecx, [temp_ecx]
mov edx, [temp_edx]
although, i don't know if it would speed up your routine at all, because there are only
1 or 2 push instructions together.
Although, i want to stress that my tests were on a PII 350 and as zooba has said this speed gain
may only be applicable to the older pentiums..so it may take longer with your AMD Athlon 64
Kernel---
You can't be serious--would you really :eek just start storing registers to global memory to escape a couple push/pops, damning my routine to the single-threaded library, forever doomed to a life of mutual exclusion? And besides, these are single-byte instructions here--that you want to turn into 5-6 byte instructions. No, the bottom line here is that MemEq( ) has to preserve some registers to use them, so I pay the piper and push/pop I must.
However, there is a shorter code sequence in the alignment handler of MemEq( )--can you see it, or is there still something I have missed about push/pops. Can you find Waldo ? :lol
I know that replacing push/pop with mov instruction is all but pointless given that your routine is not some time critical
component.
The point i was making is that with a critical piece of code, called many times, it would be worth replacing
multiple push/pop's with mov instruction..because, yes it was faster for my program.
When I was writing an NT password cracker in assembly, the des_set_key routine used push/pop on entry/exit
aswell did des_encrypt and str_to_key, so i always tried to optimise the code, and to my surprise, a mov
into the data section really speeded it up.
When push/pop were replaced with mov, there was a gain of 30%. there were atleast 4-5 on entry/exit for each routine.
You have to remember that these routines were called billions of times, and so if i gained a 1 second speed up,
it over time would be 1 hour saved, and so on.
You think that just by saving 1 or 2 cycles doesn't make a difference with code, but it bloody well does when
called billions of times...thats all i'm saying!!
we're going off topic now, and you are ranting a bit man ;)
Kernel --
Sorry, but I was a little amused at my two little pushes becoming the object of so much attention, when I thought I had provided some juicy material for discussion. But let me address your comments:
(1) MemEq( ) is a time-critical routine, easily called millions of times, and 1 cycle very much matters here. Among its many uses, MemEq( ) is called in the inner loop of a pattern matching algorithm, after every partial match to complete the match. MemEq( ) is also a replacement for memcmp( ), in the C runtime library, when only equality testing is needed. This is one of the heavy lifting functions, the very definition of time-critical.
(2) I believe you, and have no doubt that your code has benefitted from such changes, and that mine, in a different context, would as well. But I tried your suggestion without judgement in my code, and it ran slower, so I took it out.
(3) Every cycle counts in time critical code like this, and in yours. But MemEq( ) is a thread-safe routine that is callable fom multiple threads by design. How many cycles saved are worth sacrificing the intended design? MemEq( ) thread safety is non-negotiable, therefore, no global memory stores.
All that said, I do appreciate your focus and comments on this topic. Please forgive my "where's Waldo" comment, but it seemed so appropriate to the code optimization context.
:) its cool, Codewarp, i wasn't offended.
I got 118 cycles comparing 256 bytes of code, i chose 260 as this was MAX_SIZE for ANSI
on windows.
It wasn't faster than your routine with anything smaller.
Nice Codewarp, thanks for sharing, I think we should have a string compare speed comparison thread like we had for string len, i'd be interested to see the fastest.