News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster string compares

Started by Codewarp, July 18, 2005, 09:25:42 PM

Previous topic - Next topic

Codewarp

The usual way to compare strings for equal or not-equal, is to call memcmp(s1, s2, count) and compare its result with zero or not zero, respectively.  But when I look across the landscape of all string comparisons I have witnessed in code, equal and not-equal seem to show up more often than all the other comparisons combined, i.e. greater-than, less-than, etc.  Comparison for exact match is an inherently simpler operation than a relational string comparison, so I decided to look at just how much faster string equality could be computed by itself. 

The result is the routine below, which outperforms my own MemCmp( ) by 2-to-1 and well worth the effort.  I now use MemEq( ) in a large-scale C++ string library, used by multiple applications, in every context that had used MemCmp( ) for equality.  In particular, MemEq( ) is a core function called many times during string searches.  I have fine-tuned this method on an Athlon64, but there is nothing in it to give any cpu trouble--a P1 will run MemEq( ).

There is no doubt that you could use SSE instructions to compare vast strings for equality at rediculous speeds, easily surpassing MemEq( ) at, say, 1000 bytes.  However, such attempts would be unwise, because the initial SSE setup cost could not be paid off for at least 60-100 bytes.  And for a call that mostly operates on short strings, and mismatches mostly occurring within a few bytes, an SSE approach would likely slow down many applications, because its long-string advantage would never be realized.  MemEq( ) is designed for fastest performance on 0 to 250 characters with favorable alignment, and respectable performance for all comparisons longer than that and at any alignment.

The code below is in VC++ form, but it maps easily to masm and no C-ism's appear in the assembly code.  With that said--it works, it's fast, it computes a widely necessary function, and I offer it up for anyone to use, comment on, benchmark or out-do...


// string compare optimized for exact match or not (no above, below, etc)
// compares a left string operand with a right, and returns 1/0 for equal/not equal
// fast (Athlon64) performance: strings of 1/5/20/64/256 bytes compare in 7/13/20/45/120 cycles

__declspec( naked ) int MemEq ( const void *left, const void *right, size_t count ) {
    _asm {
            mov     eax, [esp + 4]      ; eax = pointer to left operand
            mov     edx, [esp + 8]      ; edx = pointer to right operand
            mov     ecx, [esp + 12]     ; ecx = count
            cmp     ecx, 4              ; jmp if more than 4 bytes to compare
            jg      fixalign            ; otherwise forget about alignment and just compare the 1-4 bytes
            cmp     ecx, 1
            jl      same                ; equal if zero bytes or fewer bytes to compare
            neg     ecx                 ; ecx = bytes past the end (0-3 in bits 0,1)
            mov     eax, [eax]          ; compare one dword (i.e. 1-4 bytes of it)
            shl     ecx, 3              ; ecx = top-end bits to ignore (0, 8, 16 or 24 in low5 bits)
            xor     eax, [edx]          ; compare entire dword
            shl     eax, cl             ; lose unwanted upper bytes and test what's left
            sub     eax, 1              ; carry => equal
            sbb     eax, eax            ; eax = 0 or -1
            neg     eax                 ; return 1/0 for equal/not equal
            ret
               
            align   8
fixalign:   push     ebx                ; save ebx and edi for use below
            push    edi
            mov     ebx, eax            ; ebx = points to left operand
            sub      edx, eax           ; edx = address of right operand relative to left operand
            jz      equal               ; return equal for equal addresses
            test    ebx, 3              ; jmp if ebx already aligned
            jz      next4

            mov     edi, ebx                ; compute leading bytes to ignore (1-3)
            and     ebx, not 3              ; chop ebx down to dword
            and     edi, 3                  ; edi = bytes to ignore
            mov     eax, [ebx]              ; compare left and right dwords
            lea     ecx, [edi * 8]          ; ecx = low-end bits to ignore (8, 16 or 24)
            xor     eax, [ebx + edx]        ; xor leaves comparison bits in eax
            add     ebx, 4                  ; advance dword address (for both operands)
            shr     eax, cl                 ; lose the low cl bits from eax (sign-extension ok here)
            mov     ecx, [esp + 20]         ; ecx = original count
            jnz     noteq                   ; if non-zero then not-equal
            lea     ecx, [ecx + edi - 4]    ; ecx = remaining bytes to compare

            align   8
next4:      test    ebx, 4                  ; check for the odd dword
            jz      next8                   ; jmp ahead if none
            mov     eax, [ebx]              ; compare left and right dwords
            xor     eax, [ebx + edx]        ; (don't use a cmp here, only use an xor)
            jne     tailchk4                ; jmp to tailchk when not equal
            add     ebx, 4                  ; now we have 8-byte alignment
            sub     ecx, 4
            jbe     equal                   ; equal, if the count runs out

            align   8
next8:      mov     eax, [ebx]              ; load 8 bytes at a time
            mov     edi, [ebx + 4]
            xor     eax, [ebx + edx]        ; (don't use a cmp here, only use an xor)
            jne     tailchk4                ; jmp to tailchk when not equal
            xor     edi, [ebx + edx + 4]
            jne     tailchk8                ; jmp to tailchk when not equal
            add     ebx, 8
            sub     ecx, 8
            ja      next8                   ; repeat while 1 or more bytes remain

equal:      pop     edi
            pop     ebx
same:       mov     eax, 1          ; return equal
            ret

noteq:      pop    edi
            pop     ebx
            mov     eax, 0          ; return not equal
            ret

            ; evaluate a not-equal result
tailchk8:   sub     ecx, 4          ; account for the first of two dwords
            jle     equal           ; equal if search already ended on the 1st dword match
            mov     eax, edi        ; move comparison bits to eax for examination
tailchk4:   sub     ecx, 4          ; account for the 2nd dword
            jge     noteq           ; not-equal if not off the end
            neg     ecx             ; ecx = bytes too far (1-3)
            shl     ecx, 3          ; ecx = number upper eax comparison bits to lose (8-16-24)
            pop     edi
            shl     eax, cl         ; lose unwanted comparison bytes and test what's left
            pop     ebx
            sub     eax, 1          ; carry => zeros => equality
            sbb     eax, eax        ; eax = 0 or -1
            neg     eax             ; return 1/0 for equal/not equal
            ret
    }
}


bozo

looking at it, i can say that it would run faster without push/pop instructions, you could move
these into the stack, perhaps?
Also some instructions there do not "pair".
Other than that, it looks good.

Codewarp

Topic Summary
Posted on: Today at 11:11:11 AMPosted by: Kernel_Gaddafi 
Quote
looking at it, i can say that it would run faster without push/pop instructions, you could move
these into the stack, perhaps?
Also some instructions there do not "pair".
Other than that, it looks good.

Points well taken, and I was able to rid the first misalignment handling of the push/pop ecx pair. I also tried your idea, and replaced all push/pops with mov[esp..] instructions.  However everything ran 1 or 2 cycles longer, :boohoo: which seems to settle the notion of push/pop versus mov/mov--push/pop is faster.  I updated the source code with these changes and others.

As far as pairing, I put no effort into pairing, a la P1, but rather took every opportunity to minimize dependencies between adjacent instructions, through careful ordering.  If there are any I missed, that could be remedied with a net gain of at least 1 cycle--I would be grateful if you pointed them out.

zooba

PUSH and especially POP were slow on older CPU's. But the latency of EA calculations on newer ones make them more useful. PUSHing DWORD registers has always been relatively quick.

bozo

I noticed also that it took 1 -2 cycles longer..
But I'm positive that where there are many push instructions..say atleast 4 - 5 or more in consecutive order,
then a mov is faster, but only if using DATA section..


   push eax
   push ebx
   push ecx
   push edx
   
   pop edx
   pop ecx
   pop ebx
   pop eax


the following seemed to run a little faster for me, and can seem like a large speed
gain through..atleast millions of calls ;-) one after the other.


   mov [temp_eax], eax
   mov [temp_ebx], ebx
   mov [temp_ecx], ecx
   mov [temp_edx], edx

   mov eax, [temp_eax]
   mov ebx, [temp_ebx]
   mov ecx, [temp_ecx]
   mov edx, [temp_edx]


although, i don't know if it would speed up your routine at all, because there are only
1 or 2 push instructions together.

Although, i want to stress that my tests were on a PII 350 and as zooba has said this speed gain
may only be applicable to the older pentiums..so it may take longer with your AMD Athlon 64

Codewarp

Kernel---

You can't be serious--would you really :eek just start storing registers to global memory to escape a couple push/pops, damning my routine to the single-threaded library, forever doomed to a life of mutual exclusion?  And besides, these are single-byte instructions here--that you want to turn into 5-6 byte instructions.  No, the bottom line here is that MemEq( ) has to preserve some registers to use them, so I pay the piper and push/pop I must.

However, there is a shorter code sequence in the alignment handler of MemEq( )--can you see it, or is there still something I have missed about push/pops.  Can you find Waldo ? :lol

bozo

I know that replacing push/pop with mov instruction is all but pointless given that your routine is not some time critical
component.

The point i was making is that with a critical piece of code, called many times, it would be worth replacing
multiple push/pop's with mov instruction..because, yes it was faster for my program.

When I was writing an NT password cracker in assembly, the des_set_key routine used push/pop on entry/exit
aswell did des_encrypt and str_to_key, so i always tried to optimise the code, and to my surprise, a mov
into the data section really speeded it up.

When push/pop were replaced with mov, there was a gain of 30%. there were atleast 4-5 on entry/exit for each routine.

You have to remember that these routines were called billions of times, and so if i gained a 1 second speed up,
it over time would be 1 hour saved, and so on.

You think that just by saving 1 or 2 cycles doesn't make a difference with code, but it bloody well does when
called billions of times...thats all i'm saying!!

we're going off topic now, and you are ranting a bit man ;)


Codewarp

Kernel --

Sorry, but I was a little amused at my two little pushes becoming the object of so much attention, when I thought I had provided some juicy material for discussion.  But let me address your comments:

(1)  MemEq( ) is a time-critical routine, easily called millions of times, and 1 cycle very much matters here.  Among its many uses, MemEq( ) is called in the inner loop of a pattern matching algorithm, after every partial match to complete the match.  MemEq( ) is also a replacement for memcmp( ), in the C runtime library, when only equality testing is needed.  This is one of the heavy lifting functions, the very definition of time-critical.

(2)  I believe you, and have no doubt that your code has benefitted from such changes, and that mine, in a different context, would as well.  But I tried your suggestion without judgement in my code, and it ran slower, so I took it out.

(3)  Every cycle counts in time critical code like this, and in yours.  But MemEq( ) is a thread-safe routine that is callable fom multiple threads by design.  How many cycles saved are worth sacrificing the intended design?  MemEq( ) thread safety is non-negotiable, therefore, no global memory stores.

All that said, I do appreciate your focus and comments on this topic.  Please forgive my "where's Waldo" comment, but it seemed so appropriate to the code optimization context.

bozo

:) its cool, Codewarp, i wasn't offended.

I got 118 cycles comparing 256 bytes of code, i chose 260 as this was MAX_SIZE for ANSI
on windows.
It wasn't faster than your routine with anything smaller.

Webring

Nice Codewarp, thanks for sharing, I think we should have a string compare speed comparison thread like we had for string len, i'd be interested to see the fastest.