Print Page - Unrolled algo is slower

Title: Unrolled algo is slower
Post by: jj2007 on November 08, 2008, 07:11:53 PM

I am fumbling with Unicode string len algos and have two questions:
1. According to Masm, the length of...
.data
WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
...is:
LENGTHOF: 40
SIZEOF: 80
However, all algos return 100:

Code Select

ucLenScasW retval:      100, size=22
lstrlenW return value:  100, size=0 (kernel32)
Masm32 ucLen retval:    100, size=30
ucsLen return value:    100, size=103
lenUC return value:     100, size=57
StrLenW return value:   100, size=103
AzmtStr return value:   100, size=243

2. Timings on a Celeron M:

Code Select

ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   230 cycles
ucsLen:         153 cycles
lenUC:          138 cycles
StrLenW:        132 cycles
AzmtStrLenW:    115 cycles

ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   230 cycles
ucsLen:         153 cycles
lenUC:          137 cycles
StrLenW:        132 cycles
AzmtStrLenW:    114 cycles

ucLenScasW:     449 cycles
lstrlenW:       319 cycles
Masm32 ucLen:   231 cycles
ucsLen:         153 cycles
lenUC:          137 cycles
StrLenW:        132 cycles
AzmtStrLenW:    115 cycles

Now the second question. Here is the lenUC algo. I have tried all numbers behind REPEAT, but instead of getting faster with higher numbers, there is an optimum for n=1:

Code Select

align 2
lenUC proc pStr
	mov eax, [esp+4]
	.Repeat
		REPEAT 1		; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
		  test dword ptr [eax], 00000FFFFh		; one unroll yields best result
		  je j0
		  test dword ptr [eax], 0FFFF0000h
		  je j2
		  add eax, 4
		ENDM
		test dword ptr [eax], 00000FFFFh
		je j0
		test dword ptr [eax], 0FFFF0000h
		lea eax, [eax+4]				; keep the zero flag
	.Until Zero?
	if 1
	  sub eax, 4			; equally fast and shorter
	else
	  sub eax, [esp+4]	; reverse order and
	  sub eax, 2			; lose 18 cycles...!
	  shr eax, 1
	  retn 4
	endif
j2:	add eax, 2
j0:	sub eax, [esp+4]
	shr eax, 1
	retn 4
lenUC endp

So my question is simply: Why is there no improvement for unrolling?
Code attached.

[attachment deleted by admin]

Title: Re: Unrolled algo is slower
Post by: DoomyD on November 08, 2008, 08:05:26 PM

That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).
My timings are still in favour of ucsLen:

Code Select

  LENGTHOF: 	40
  SIZEOF:    	80

ucLenScasW retval: 	100, size=22
lstrlenW return value: 	100, size=0 (kernel32)
Masm32 ucLen retval: 	100, size=30
ucsLen return value: 	100, size=103
lenUC return value: 	100, size=57
StrLenW return value: 	100, size=103
AzmtStr return value: 	100, size=243

ucLenScasW: 	670 cycles
lstrlenW: 	265 cycles
Masm32 ucLen: 	188 cycles
ucsLen: 	101 cycles
lenUC:    	102 cycles
StrLenW: 	104 cycles
AzmtStrLenW: 	101 cycles

ucLenScasW: 	663 cycles
lstrlenW: 	265 cycles
Masm32 ucLen: 	189 cycles
ucsLen: 	101 cycles
lenUC:    	103 cycles
StrLenW: 	105 cycles
AzmtStrLenW: 	101 cycles

ucLenScasW: 	662 cycles
lstrlenW: 	263 cycles
Masm32 ucLen: 	187 cycles
ucsLen: 	100 cycles
lenUC:    	102 cycles
StrLenW: 	104 cycles
AzmtStrLenW: 	103 cycles

Hit any key to get outta here...

Code Select

even                            ; minimal alignment (2)
ucsLen   proc   buf             ;
   mov   eax,[esp+4]            ;
   .repeat                      ; 
      REPEAT 3                  ;\x1~x3
      mov    ecx,[eax]          ; read *once* from the memory to give it a boost
      add    eax,4              ; I wanted to avoid an index of -4, hence that's the only place it would fit to
      test   ecx, 00000FFFFh    ; low character
      jz     _EVEN_             ; > even amount of characters
      test   ecx, 0FFFF0000h    ; high character
      jz     _ODD_              ; > odd amount of characters
      ENDM                      ;/
      mov    ecx,[eax]          ;\x4
      add    eax,4              ;
      test   ecx,00000FFFFh     ; repeat with the exception of the loop
      jz     _EVEN_             ;
      test   ecx,0FFFF0000h     ;
   .until zero?                 ;/
   _ODD_:                       ;
   add   eax,2                  ; odd amount of characters: increase the offset by 2
   _EVEN_:                      ;
   sub   eax,4                  ; decrease by 4, as there was an extra addition
   sub   eax,[esp+4]            ; substruct the base offset from the current
   shr   eax,1                  ; divide by 2
   retn  4
ucsLen endp

Title: Re: Unrolled algo is slower
Post by: jj2007 on November 08, 2008, 09:07:25 PM

Quote from: DoomyD on November 08, 2008, 08:05:26 PM
That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).

Makes sense: fewer branches, more precise predictions (??) - but I am just guessing. My lenUC is inspired by ucslen, of course - the test 0000ffffh is a cute trick, compliments. I wanted to reduce the number of trashed registers while keeping the speed. Strange that the 4 "good" algos are so close for your CPU, though: for my Celeron, the (somewhat long) Azmt algo is clearly ahead. In the meantime, I added a macro:

Code Select

uclen MACRO arg
	or eax, -1
	mov edx, arg
@@:	inc eax
	cmp word ptr [edx+2*eax], 0
	jne @B
ENDM

This is a variant of the MasmLib ucLen, but instead of 30 bytes it is 16 bytes long and 10% faster. But at that point, I still prefer the lenUC with 57 bytes but 40% more speed and all registers preserved except eax.
By the way, I found that for lenUC alignment had absolutely no influence on speed, and the same was true for the old add eax, 1 vs inc eax debate.
It would be nice to see timings for other CPUs. New code with the macro is attached.

[attachment deleted by admin]

Title: Re: Unrolled algo is slower
Post by: hutch-- on November 08, 2008, 11:13:33 PM

JJ,

It is not uncommon for an unrolled algo to be slower than the short form, its usually worth the effort to do a REPEAT block then coment it in and out to see if there is a speed difference. The main gain is branch reduction and I have seen unrolls by 16 work fine and unrolls by 4 slow. Either a register or an immediate appear to be the fastest while any memory access is a lot slower, especially if its sequential reads and even worse if the reads are random.

If you ever bother to play with optimising C compiler output the changes that make the most difference are those that reduce memory reads and writes.

Title: Re: Unrolled algo is slower
Post by: DoomyD on November 09, 2008, 09:09:09 AM

I have entered the timings for increasing buffer lengths with increasing REPEAT counts into a logarithm chart:
(http://img504.imageshack.us/img504/2637/ucslengi3.png)

tested: ucslen(as it fits on my machine)
tests -> 15
the maximum difference between the tests was (-+)1.5~
Max buffer length -> 200
Minimum speed -> 236 (x0)
Maximum speed -> 201 (x4)
Optimum -> x3 (speed | size)

As you can see, the more we branch forward, the better the code is being executed (predicted?). However, the speed gain does reach an end point - where the difference is trivial.

Title: Re: Unrolled algo is slower
Post by: jj2007 on November 09, 2008, 11:24:37 AM

Impressing graph, although I admit I did not fully understand the axis: What is buffer length? X is the REPEAT count??
My timings are less elaborated, but as mentioned earlier, there is a clear speed maximum for REPEAT 1, at least for my Celeron M (a Core Duo afaik).

ucsLen return value: 100, size=103
lenUC1 return value: 100, size=57
AzmtStr return value: 100, size=243

DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 140 cycles
lenUC2: 150 cycles
lenUC3: 140 cycles
lenUC4: 152 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 116 cycles

DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 139 cycles
lenUC2: 151 cycles
lenUC3: 141 cycles
lenUC4: 151 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 116 cycles

DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 138 cycles
lenUC2: 150 cycles
lenUC3: 142 cycles
lenUC4: 151 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 207 cycles

Here again the routine, full code attached.

Code Select

align 2
lenUC1 proc pStr					; ***************** why no improvement for unrolling? *****************
	mov eax, [esp+4]
	.Repeat
		REPEAT 1		; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
		  test dword ptr [eax], 00000FFFFh		; one unroll yields best result
		  je j0
		  test dword ptr [eax], 0FFFF0000h
		  je j2
		  add eax, 4
		ENDM
		test dword ptr [eax], 00000FFFFh
		je j0
		test dword ptr [eax], 0FFFF0000h
		lea eax, [eax+4]				; keep the zero flag
	.Until Zero?
	sub eax, 4
j2:	add eax, 2
j0:	sub eax, [esp+4]
	shr eax, 1
	retn 4
lenUC1 endp

[attachment deleted by admin]

Title: Re: Unrolled algo is slower
Post by: DoomyD on November 09, 2008, 04:07:46 PM

I uploaded a new picture, maybe I could explain it better: the numbers that follow the x are the numbers used for the REPEAT block. I used a loop to time the procedure, while appending another character to the unicode string each time. I then loded the values into excel to see which number would be the first result with a linear, stable sequence (which is x3 - every number after that will not make the procedure's speed drasticly lower or more stable).
My point is, there is a factor in the loop that slows it's execution. The repetition avoids this factor, or is delaying it's appearance in a matter that would not slow it down like it was supposed to. By including one repetition, you solved the problem (I took it a step ahead to avoid the unbalanced results given when using 1 or 2 for small strings).

Title: Re: Unrolled algo is slower
Post by: jdoe on November 11, 2008, 06:54:10 AM

Another factor for the long unrolls to be slower is code alignment. When it reach a level, instructions can get encoded differently. For example, a conditional jump which is encoded as 2 bytes can changed in the middle of a long unroll by 6 bytes and be different as you were expecting. AMD seems to not be as much affected by the 6 bytes jumps (as long as you force them everywhere), while Intel seems to prefer the 2 bytes jumps.

I don't have any proof of that but it is my deduction after having done many tests.

EDIT: If you add just few bytes in the AzmtStrLenW algo (like ALIGN 8 after the loop), all jumps turns to 6 bytes the cpu cycles needed raise noticibly.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on November 08, 2008, 07:11:53 PM