News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

jj2007

Quote from: Jimg on May 02, 2009, 04:40:15 PM
Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.


Jim,

Out of boredom, I just replaced a movzx eax,[esi] with a lodsb - and it runs over 20 cycles faster. It's not even a speed-critical loop, but it consistently makes a difference. See UseLodsb switch in new attachment here.

   push esi
   lea ebx, [esi+ecx-16]            ; Mainstring (only ebx is free)
   mov esi, [esp+6*4+3*4+4]      ; lpPattern
   add ebx, ebp
   if UseLodsb
      dec esi
   endif
@@:
   if UseLodsb
      lodsb                           ; ca. 25 cycles faster!
   else
      inc esi
      movzx eax, byte ptr [esi]         ; esi=Substr (mov al is a few cycles slower)
   endif
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

@@:   pop esi

Jimg

Just from looking at the above post, your two possibilities don't seem equivalent.

Quoteif UseLodsb

   add ebx, ebp
   dec esi
@@:
   lodsb                           ; ca. 25 cycles faster!
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

else

   add ebx, ebp
@@:
   inc esi
   movzx eax, byte ptr [esi]         ; esi=Substr (mov al is a few cycles slower)
   test al, al                  ; this could be shifted lower, but there is the rare case of equality after the zero delimiter:
   je @F                        ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
   inc ebx
   cmp al, [ebx]                  ; ebx=Mainstr
   je @B

endif

I could be missing something, but it looks like you start two bytes earlier when using lodsb?

jj2007

Quote from: Jimg on May 09, 2009, 02:04:07 PM
Just from looking at the above post, your two possibilities don't seem equivalent.
...
I could be missing something, but it looks like you start two bytes earlier when using lodsb?

You are perfectly right, Jim :red

In addition, I found a bug that shows up for certain unusual patterns, so I better pull back that code until it's correct.

Jimg

Looking at your latest-
if UseLodsb
add esi, 6
else
add esi, 5
endif
@@:
if UseLodsb
lodsb ; some cycles faster!
else
inc esi
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
endif
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B

lodsb loads and then increments esi, so esi is pointing at the next character.

for your non lodsb code, you increment first then load, so esi is pointing at the current character.

therefore, the lodsb code will not end at the same esi as the non-lodsb code.

Does this matter?

If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]" so the movzx doesn't have to wait for the inc esi to be done incrementing.  Should pick up a cycle.

If it does, you'll have to decrement esi later for the lodsb code.

Yes?

jj2007

Quote from: Jimg on May 10, 2009, 01:43:12 AM
Does this matter?

If not, then use the same startup and move the "inc esi" below the "movzx eax, byte ptr [esi]"


Jim,

Good point, thanks. It doesn't matter because soon after I pop esi, but it makes the code more readable.

add ebx, ebp
add esi, 6
@@:
if UseLodsb
lodsb ; some cycles faster!
test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
je @F ; db "Mainstr", 0, "abc" ... db "str", 0, "abx"  would crash
else
movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
je @F
inc esi
endif
inc ebx
cmp al, [ebx] ; ebx=Mainstr
je @B

@@: pop ebx
pop esi
jne BadLuck ; test al, al not needed, flags still valid

Jimg

or simply
add esi, 6
@@:
if UseLodsb
    lodsb ; some cycles faster!
else
    movzx eax, byte ptr [esi] ; esi=Substr (mov al is a few cycles slower)
    inc esi
endif   
    test al, al ; this cannot be shifted below because of the rare case of equality after the zero delimiter:
    je @F
    inc ebx
    cmp al, [ebx] ; ebx=Mainstr
    je @B


jj2007

Oops, that should have been

      movzx eax, byte ptr [esi]      ; esi=Substr (mov al is a few cycles slower)
      test al, al
      je @F
      inc esi

My version is about half a cycle faster because it tests for zero before the inc esi :wink

hutch--

JJ,

See if "test eax, eax" is faster than the byte test.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 10, 2009, 03:08:19 PM
JJ,

See if "test eax, eax" is faster than the byte test.

Not on a Celeron M (code below):
664     cycles for 1000* test 1000*reg32, reg32
664     cycles for 1000* test reg8, reg8

0.7 cycles for an algo that needs about 2500, so "full algo" testing is simply impossible.
But I'll replace it, thanks - test eax after a movzx eax simply looks better.

The limits are elsewhere anyway. I have tested three variants now for the SSE2 Instring:
1. word scanner in SSE2, 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
2. byte scanner in SSE2, if match: word scanner, then 3 more bytes through cmp edi, [mem+2]: 2470 cycles on average
3. byte scanner in SSE2, 3 more bytes through cmp edi, [mem+1]: 2470 cycles on average

See the problem? :bg

No. 2+3 get really complex, so I'll stick with the word scanner posted in the laboratory.

Of course, No. 2+3 are blazing fast if you are searching for a pattern starting with a rare letter like X, but that is the exception...

Quote.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm         ; get them from the Masm32 Laboratory
   LOOP_COUNT   = 1000000

.code
start:
   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT 250
         test eax, eax
         test ebx, ebx
         test ecx, ecx
         test edx, edx
      ENDM
   counter_end
   print str$(eax), 9, "cycles for 1000* test reg32, reg32", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT 250
         test al, al
         test bl, bl
         test cl, cl
         test dl, dl
      ENDM
   counter_end
   print str$(eax), 9, "cycles for 1000* test reg8, reg8", 13, 10

   inkey chr$(13, 10, "--- ok ---", 13)
   exit
end start