I am fumbling with Unicode string len algos and have two questions:
1. According to Masm, the length of...
.data
WSTR ws, "my other brother darryl my other brother darryl my other brother darryl my other brother darryl xxxx"
...is:
LENGTHOF: 40
SIZEOF: 80
However, all algos return 100:
ucLenScasW retval: 100, size=22
lstrlenW return value: 100, size=0 (kernel32)
Masm32 ucLen retval: 100, size=30
ucsLen return value: 100, size=103
lenUC return value: 100, size=57
StrLenW return value: 100, size=103
AzmtStr return value: 100, size=243
2. Timings on a Celeron M:
ucLenScasW: 449 cycles
lstrlenW: 319 cycles
Masm32 ucLen: 230 cycles
ucsLen: 153 cycles
lenUC: 138 cycles
StrLenW: 132 cycles
AzmtStrLenW: 115 cycles
ucLenScasW: 449 cycles
lstrlenW: 319 cycles
Masm32 ucLen: 230 cycles
ucsLen: 153 cycles
lenUC: 137 cycles
StrLenW: 132 cycles
AzmtStrLenW: 114 cycles
ucLenScasW: 449 cycles
lstrlenW: 319 cycles
Masm32 ucLen: 231 cycles
ucsLen: 153 cycles
lenUC: 137 cycles
StrLenW: 132 cycles
AzmtStrLenW: 115 cycles
Now the second question. Here is the lenUC algo. I have tried all numbers behind REPEAT, but instead of getting faster with higher numbers, there is an optimum for n=1:
align 2
lenUC proc pStr
mov eax, [esp+4]
.Repeat
REPEAT 1 ; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
test dword ptr [eax], 00000FFFFh ; one unroll yields best result
je j0
test dword ptr [eax], 0FFFF0000h
je j2
add eax, 4
ENDM
test dword ptr [eax], 00000FFFFh
je j0
test dword ptr [eax], 0FFFF0000h
lea eax, [eax+4] ; keep the zero flag
.Until Zero?
if 1
sub eax, 4 ; equally fast and shorter
else
sub eax, [esp+4] ; reverse order and
sub eax, 2 ; lose 18 cycles...!
shr eax, 1
retn 4
endif
j2: add eax, 2
j0: sub eax, [esp+4]
shr eax, 1
retn 4
lenUC endp
So my question is simply: Why is there no improvement for unrolling?
Code attached.
[attachment deleted by admin]
That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).
My timings are still in favour of ucsLen: LENGTHOF: 40
SIZEOF: 80
ucLenScasW retval: 100, size=22
lstrlenW return value: 100, size=0 (kernel32)
Masm32 ucLen retval: 100, size=30
ucsLen return value: 100, size=103
lenUC return value: 100, size=57
StrLenW return value: 100, size=103
AzmtStr return value: 100, size=243
ucLenScasW: 670 cycles
lstrlenW: 265 cycles
Masm32 ucLen: 188 cycles
ucsLen: 101 cycles
lenUC: 102 cycles
StrLenW: 104 cycles
AzmtStrLenW: 101 cycles
ucLenScasW: 663 cycles
lstrlenW: 265 cycles
Masm32 ucLen: 189 cycles
ucsLen: 101 cycles
lenUC: 103 cycles
StrLenW: 105 cycles
AzmtStrLenW: 101 cycles
ucLenScasW: 662 cycles
lstrlenW: 263 cycles
Masm32 ucLen: 187 cycles
ucsLen: 100 cycles
lenUC: 102 cycles
StrLenW: 104 cycles
AzmtStrLenW: 103 cycles
Hit any key to get outta here...
even ; minimal alignment (2)
ucsLen proc buf ;
mov eax,[esp+4] ;
.repeat ;
REPEAT 3 ;\x1~x3
mov ecx,[eax] ; read *once* from the memory to give it a boost
add eax,4 ; I wanted to avoid an index of -4, hence that's the only place it would fit to
test ecx, 00000FFFFh ; low character
jz _EVEN_ ; > even amount of characters
test ecx, 0FFFF0000h ; high character
jz _ODD_ ; > odd amount of characters
ENDM ;/
mov ecx,[eax] ;\x4
add eax,4 ;
test ecx,00000FFFFh ; repeat with the exception of the loop
jz _EVEN_ ;
test ecx,0FFFF0000h ;
.until zero? ;/
_ODD_: ;
add eax,2 ; odd amount of characters: increase the offset by 2
_EVEN_: ;
sub eax,4 ; decrease by 4, as there was an extra addition
sub eax,[esp+4] ; substruct the base offset from the current
shr eax,1 ; divide by 2
retn 4
ucsLen endp
Quote from: DoomyD on November 08, 2008, 08:05:26 PM
That's exactly why I included that repeat block :)
I believe it has to do with branch prediction(on my CPU, the optimum was 3 - which means x4 per loop).
Makes sense: fewer branches, more precise predictions (??) - but I am just guessing. My lenUC is inspired by ucslen, of course - the test 0000ffffh is a cute trick, compliments. I wanted to reduce the number of trashed registers while keeping the speed. Strange that the 4 "good" algos are so close for your CPU, though: for my Celeron, the (somewhat long) Azmt algo is clearly ahead. In the meantime, I added a macro:
uclen MACRO arg
or eax, -1
mov edx, arg
@@: inc eax
cmp word ptr [edx+2*eax], 0
jne @B
ENDM
This is a variant of the MasmLib
ucLen, but instead of 30 bytes it is 16 bytes long and 10% faster. But at that point, I still prefer the
lenUC with 57 bytes but 40% more speed and all registers preserved except eax.
By the way, I found that for
lenUC alignment had absolutely no influence on speed, and the same was true for the old add eax, 1 vs inc eax debate.
It would be nice to see timings for other CPUs. New code with the macro is attached.
[attachment deleted by admin]
JJ,
It is not uncommon for an unrolled algo to be slower than the short form, its usually worth the effort to do a REPEAT block then coment it in and out to see if there is a speed difference. The main gain is branch reduction and I have seen unrolls by 16 work fine and unrolls by 4 slow. Either a register or an immediate appear to be the fastest while any memory access is a lot slower, especially if its sequential reads and even worse if the reads are random.
If you ever bother to play with optimising C compiler output the changes that make the most difference are those that reduce memory reads and writes.
I have entered the timings for increasing buffer lengths with increasing REPEAT counts into a logarithm chart:
(http://img504.imageshack.us/img504/2637/ucslengi3.png)
- tested: ucslen(as it fits on my machine)
- tests -> 15
- the maximum difference between the tests was (-+)1.5~
- Max buffer length -> 200
- Minimum speed -> 236 (x0)
- Maximum speed -> 201 (x4)
- Optimum -> x3 (speed | size)
As you can see, the more we branch forward, the better the code is being executed (predicted?). However, the speed gain does reach an end point - where the difference is trivial.
Impressing graph, although I admit I did not fully understand the axis: What is buffer length? X is the REPEAT count??
My timings are less elaborated, but as mentioned earlier, there is a clear speed maximum for REPEAT 1, at least for my Celeron M (a Core Duo afaik).
ucsLen return value: 100, size=103
lenUC1 return value: 100, size=57
AzmtStr return value: 100, size=243
DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 140 cycles
lenUC2: 150 cycles
lenUC3: 140 cycles
lenUC4: 152 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 116 cycles
DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 139 cycles
lenUC2: 151 cycles
lenUC3: 141 cycles
lenUC4: 151 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 116 cycles
DoomyD ucsLen: 153 cycles
lenUC0: 154 cycles
lenUC1: 138 cycles
lenUC2: 150 cycles
lenUC3: 142 cycles
lenUC4: 151 cycles
lenUC5: 146 cycles
lenUC6: 145 cycles
AzmtStrLenW: 207 cycles
Here again the routine, full code attached.
align 2
lenUC1 proc pStr ; ***************** why no improvement for unrolling? *****************
mov eax, [esp+4]
.Repeat
REPEAT 1 ; 0=155, 1=137, 2=150, 3=141, 4=152 cycles
test dword ptr [eax], 00000FFFFh ; one unroll yields best result
je j0
test dword ptr [eax], 0FFFF0000h
je j2
add eax, 4
ENDM
test dword ptr [eax], 00000FFFFh
je j0
test dword ptr [eax], 0FFFF0000h
lea eax, [eax+4] ; keep the zero flag
.Until Zero?
sub eax, 4
j2: add eax, 2
j0: sub eax, [esp+4]
shr eax, 1
retn 4
lenUC1 endp
[attachment deleted by admin]
I uploaded a new picture, maybe I could explain it better: the numbers that follow the x are the numbers used for the REPEAT block. I used a loop to time the procedure, while appending another character to the unicode string each time. I then loded the values into excel to see which number would be the first result with a linear, stable sequence (which is x3 - every number after that will not make the procedure's speed drasticly lower or more stable).
My point is, there is a factor in the loop that slows it's execution. The repetition avoids this factor, or is delaying it's appearance in a matter that would not slow it down like it was supposed to. By including one repetition, you solved the problem (I took it a step ahead to avoid the unbalanced results given when using 1 or 2 for small strings).
Another factor for the long unrolls to be slower is code alignment. When it reach a level, instructions can get encoded differently. For example, a conditional jump which is encoded as 2 bytes can changed in the middle of a long unroll by 6 bytes and be different as you were expecting. AMD seems to not be as much affected by the 6 bytes jumps (as long as you force them everywhere), while Intel seems to prefer the 2 bytes jumps.
I don't have any proof of that but it is my deduction after having done many tests.
EDIT: If you add just few bytes in the AzmtStrLenW algo (like ALIGN 8 after the loop), all jumps turns to 6 bytes the cpu cycles needed raise noticibly.