News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Unrolled algo is slower

Started by jj2007, November 08, 2008, 07:11:53 PM

Previous topic - Next topic

jj2007

I am fumbling with Unicode string len algos and have two questions:
1. According to Masm, the length of...
.data
WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
...is:
  LENGTHOF:     40
  SIZEOF:       80
However, all algos return 100:
ucLenScasW retval:      100, size=22
lstrlenW return value:  100, size=0 (kernel32)
Masm32 ucLen retval:    100, size=30
ucsLen return value:    100, size=103
lenUC return value:     100, size=57
StrLenW return value:   100, size=103
AzmtStr return value:   100, size=243


2. Timings on a Celeron M:
ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   230 cycles
ucsLen:         153 cycles
lenUC:          138 cycles
StrLenW:        132 cycles
AzmtStrLenW:    115 cycles

ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   230 cycles
ucsLen:         153 cycles
lenUC:          137 cycles
StrLenW:        132 cycles
AzmtStrLenW:    114 cycles

ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   231 cycles
ucsLen:         153 cycles
lenUC:          137 cycles
StrLenW:        132 cycles
AzmtStrLenW:    115 cycles


Now the second question. Here is the lenUC algo. I have tried all numbers behind REPEAT, but instead of getting faster with higher numbers, there is an optimum for n=1:

align 2
lenUC proc pStr
mov eax, [esp+4]
.Repeat
REPEAT 1 ; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
  test dword ptr [eax], 00000FFFFh ; one unroll yields best result
  je j0
  test dword ptr [eax], 0FFFF0000h
  je j2
  add eax, 4
ENDM
test dword ptr [eax], 00000FFFFh
je j0
test dword ptr [eax], 0FFFF0000h
lea eax, [eax+4] ; keep the zero flag
.Until Zero?
if 1
  sub eax, 4 ; equally fast and shorter
else
  sub eax, [esp+4] ; reverse order and
  sub eax, 2 ; lose 18 cycles...!
  shr eax, 1
  retn 4
endif
j2: add eax, 2
j0: sub eax, [esp+4]
shr eax, 1
retn 4
lenUC endp


So my question is simply: Why is there no improvement for unrolling?
Code attached.

[attachment deleted by admin]

DoomyD

That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).
My timings are still in favour of ucsLen:  LENGTHOF: 40
  SIZEOF:    80

ucLenScasW retval: 100, size=22
lstrlenW return value: 100, size=0 (kernel32)
Masm32 ucLen retval: 100, size=30
ucsLen return value: 100, size=103
lenUC return value: 100, size=57
StrLenW return value: 100, size=103
AzmtStr return value: 100, size=243

ucLenScasW: 670 cycles
lstrlenW: 265 cycles
Masm32 ucLen: 188 cycles
ucsLen: 101 cycles
lenUC:    102 cycles
StrLenW: 104 cycles
AzmtStrLenW: 101 cycles

ucLenScasW: 663 cycles
lstrlenW: 265 cycles
Masm32 ucLen: 189 cycles
ucsLen: 101 cycles
lenUC:    103 cycles
StrLenW: 105 cycles
AzmtStrLenW: 101 cycles

ucLenScasW: 662 cycles
lstrlenW: 263 cycles
Masm32 ucLen: 187 cycles
ucsLen: 100 cycles
lenUC:    102 cycles
StrLenW: 104 cycles
AzmtStrLenW: 103 cycles

Hit any key to get outta here...
even                            ; minimal alignment (2)
ucsLen   proc   buf             ;
   mov   eax,[esp+4]            ;
   .repeat                      ;
      REPEAT 3                  ;\x1~x3
      mov    ecx,[eax]          ; read *once* from the memory to give it a boost
      add    eax,4              ; I wanted to avoid an index of -4, hence that's the only place it would fit to
      test   ecx, 00000FFFFh    ; low character
      jz     _EVEN_             ; > even amount of characters
      test   ecx, 0FFFF0000h    ; high character
      jz     _ODD_              ; > odd amount of characters
      ENDM                      ;/
      mov    ecx,[eax]          ;\x4
      add    eax,4              ;
      test   ecx,00000FFFFh     ; repeat with the exception of the loop
      jz     _EVEN_             ;
      test   ecx,0FFFF0000h     ;
   .until zero?                 ;/
   _ODD_:                       ;
   add   eax,2                  ; odd amount of characters: increase the offset by 2
   _EVEN_:                      ;
   sub   eax,4                  ; decrease by 4, as there was an extra addition
   sub   eax,[esp+4]            ; substruct the base offset from the current
   shr   eax,1                  ; divide by 2
   retn  4
ucsLen endp

jj2007

Quote from: DoomyD on November 08, 2008, 08:05:26 PM
That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).


Makes sense: fewer branches, more precise predictions (??) - but I am just guessing. My lenUC is inspired by ucslen, of course - the test 0000ffffh is a cute trick, compliments. I wanted to reduce the number of trashed registers while keeping the speed. Strange that the 4 "good" algos are so close for your CPU, though: for my Celeron, the (somewhat long) Azmt algo is clearly ahead. In the meantime, I added a macro:
uclen MACRO arg
or eax, -1
mov edx, arg
@@: inc eax
cmp word ptr [edx+2*eax], 0
jne @B
ENDM

This is a variant of the MasmLib ucLen, but instead of 30 bytes it is 16 bytes long and 10% faster. But at that point, I still prefer the lenUC with 57 bytes but 40% more speed and all registers preserved except eax.
By the way, I found that for lenUC alignment had absolutely no influence on speed, and the same was true for the old add eax, 1 vs inc eax debate.
It would be nice to see timings for other CPUs. New code with the macro is attached.

[attachment deleted by admin]

hutch--

JJ,

It is not uncommon for an unrolled algo to be slower than the short form, its usually worth the effort to do a REPEAT block then coment it in and out to see if there is a speed difference. The main gain is branch reduction and I have seen unrolls by 16 work fine and unrolls by 4 slow. Either a register or an immediate appear to be the fastest while any memory access is a lot slower, especially if its sequential reads and even worse if the reads are random.

If you ever bother to play with optimising C compiler output the changes that make the most difference are those that reduce memory reads and writes.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

DoomyD

#4
I have entered the timings for increasing buffer lengths with increasing REPEAT counts into a logarithm chart:

  • tested: ucslen(as it fits on my machine)
  • tests -> 15
  • the maximum difference between the tests was (-+)1.5~
  • Max buffer length -> 200
  • Minimum speed -> 236 (x0)
  • Maximum speed -> 201 (x4)
  • Optimum -> x3 (speed | size)
As you can see, the more we branch forward, the better the code is being executed (predicted?). However, the speed gain does reach an end point - where the difference is trivial.

jj2007

Impressing graph, although I admit I did not fully understand the axis: What is buffer length? X is the REPEAT count??
My timings are less elaborated, but as mentioned earlier, there is a clear speed maximum for REPEAT 1, at least for my Celeron M (a Core Duo afaik).

ucsLen return value:    100, size=103
lenUC1 return value:    100, size=57
AzmtStr return value:   100, size=243

DoomyD ucsLen:  153 cycles
lenUC0:         154 cycles
lenUC1:         140 cycles
lenUC2:         150 cycles
lenUC3:         140 cycles
lenUC4:         152 cycles
lenUC5:         146 cycles
lenUC6:         145 cycles
AzmtStrLenW:    116 cycles

DoomyD ucsLen:  153 cycles
lenUC0:         154 cycles
lenUC1:         139 cycles
lenUC2:         151 cycles
lenUC3:         141 cycles
lenUC4:         151 cycles
lenUC5:         146 cycles
lenUC6:         145 cycles
AzmtStrLenW:    116 cycles

DoomyD ucsLen:  153 cycles
lenUC0:         154 cycles
lenUC1:         138 cycles
lenUC2:         150 cycles
lenUC3:         142 cycles
lenUC4:         151 cycles
lenUC5:         146 cycles
lenUC6:         145 cycles
AzmtStrLenW:    207 cycles

Here again the routine, full code attached.
align 2
lenUC1 proc pStr ; ***************** why no improvement for unrolling? *****************
mov eax, [esp+4]
.Repeat
REPEAT 1 ; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
  test dword ptr [eax], 00000FFFFh ; one unroll yields best result
  je j0
  test dword ptr [eax], 0FFFF0000h
  je j2
  add eax, 4
ENDM
test dword ptr [eax], 00000FFFFh
je j0
test dword ptr [eax], 0FFFF0000h
lea eax, [eax+4] ; keep the zero flag
.Until Zero?
sub eax, 4
j2: add eax, 2
j0: sub eax, [esp+4]
shr eax, 1
retn 4
lenUC1 endp

[attachment deleted by admin]

DoomyD

I uploaded a new picture, maybe I could explain it better: the numbers that follow the x are the numbers used for the REPEAT block. I used a loop to time the procedure, while appending another character to the unicode string each time. I then loded the values into excel to see which number would be the first result with a linear, stable sequence (which is x3 - every number after that will not make the procedure's speed drasticly lower or more stable).
My point is, there is a factor in the loop that slows it's execution. The repetition avoids this factor, or is delaying it's appearance in a matter that would not slow it down like it was supposed to. By including one repetition, you solved the problem (I took it a step ahead to avoid the unbalanced results given when using 1 or 2 for small strings).

jdoe

Another factor for the long unrolls to be slower is code alignment. When it reach a level, instructions can get encoded differently. For example, a conditional jump which is encoded as 2 bytes can changed in the middle of a long unroll by 6 bytes and be different as you were expecting. AMD seems to not be as much affected by the 6 bytes jumps (as long as you force them everywhere), while Intel seems to prefer the 2 bytes jumps.

I don't have any proof of that but it is my deduction after having done many tests.


EDIT: If you add just few bytes in the AzmtStrLenW algo (like ALIGN 8 after the loop), all jumps turns to 6 bytes the cpu cycles needed raise noticibly.