News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Unicode string length

Started by jdoe, April 10, 2006, 01:32:54 AM

Previous topic - Next topic

EduardoS

Are we measuring right? I mean, about branch prediction, it isn't counted, running the algo with same data millions times allow the processor to predicts all branches correctly, at least for small strings (in this case), i measured the time for SebW2 with strings from 10 chars to 60 chars length, i used a simple rdtsc before and another after instead of timming macros, and got this result:

23 cicles
25 cicles*
25 cicles
26 cicles
27 cicles
30 cicles*
29 cicles
30 cicles
31 cicles
53 cicles*
33 cicles
47 cicles*
35 cicles
36 cicles
37 cicles
38 cicles
39 cicles
40 cicles
41 cicles
55 cicles*
43 cicles
71 cicles*
45 cicles
46 cicles
47 cicles
48 cicles
49 cicles
63 cicles-
64 cicles
65 cicles
66 cicles
67 cicles
68 cicles
69 cicles
70 cicles
71 cicles
72 cicles
73 cicles
74 cicles
75 cicles
76 cicles
77 cicles
81 cicles-
82 cicles
83 cicles
84 cicles
85 cicles
86 cicles
87 cicles
88 cicles

I got one extra cicle for each extra byte length, with exceptions *, wich seens to be due to processor internal state, and - at length 37 and 53, where the miss-prediction seens to increase.

Finally, shouldn't the branch-prediction taken in account?

jdoe


Sorry for this old topic revival but I would like to compare different CPU timing. If few members have time to post timing result for this, it would be appreciated. I added a new one (AzmtStrLenW) that is a good compromise between AMD and INTEL. I can do faster on INTEL but AMD don't like it.


AMD Athlon XP 1800+


my other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71

StrLenW : 112 cycles
Lingo32W : 183 cycles
ucLen2 : 166 cycles
Edu32W : 177 cycles
Edu32W2 : 96 cycles
SebW : 143 cycles
SebW2 : 139 cycles
lstrlenW : 250 cycles
AzmtStrLenW : 77 cycles




[attachment deleted by admin]

sinsi

Q6600 2.4GHz

StrLenW : 75 cycles
Lingo32W : 327 cycles
ucLen2 : 72 cycles
Edu32W : 96 cycles
Edu32W2 : 88 cycles
SebW : 95 cycles
SebW2 : 95 cycles
lstrlenW : 197 cycles
AzmtStrLenW : 72 cycles

Light travels faster than sound, that's why some people seem bright until you hear them.

DoomyD

Core 2 Duo 6300 1.86GHzmy other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71

StrLenW : 76 cycles
Lingo32W : 331 cycles
ucLen2 : 73 cycles
Edu32W : 421 cycles
Edu32W2 : 352 cycles
SebW : 94 cycles
SebW2 : 96 cycles
lstrlenW : 201 cycles
AzmtStrLenW : 73 cycles

Press any key to exit...

jj2007

Pentium 4 3.4 GHz

StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles

DoomyD

Could someone time the following?
On my machine it yields 72 cycles
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
even
ucsLen   proc   buf
   mov   eax,[esp+4]
   .repeat
      REPEAT 3
      mov    ecx,[eax]
      add    eax,4
      test   ecx,00000FFFFh
      jz     _EVEN_
      test   ecx,0FFFF0000h
      jz     _ODD_
      ENDM
      mov    ecx,[eax]
      add    eax,4
      test   ecx,00000FFFFh
      jz     _EVEN_
      test   ecx,0FFFF0000h
   .until zero?
   _ODD_:
   add   eax,2
   _EVEN_:
   sub   eax,4
   sub   eax,[esp+4]
   shr   eax,1
   retn  4
ucsLen endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

jj2007

Quote from: jj2007 on November 06, 2008, 09:14:41 AM
Pentium 4 3.4 GHz

StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles


StrLenW : 201 cycles
Lingo32W : 278 cycles
ucLen2 : 193 cycles
Edu32W : 170 cycles
Edu32W2 : 132 cycles
SebW : 229 cycles
SebW2 : 213 cycles
lstrlenW : 452 cycles
AzmtStrLenW : 151 cycles (unchanged code, most values are a bit slower)

... with a minor modification:

@@:
    mov edx, [eax]
    mov ecx, [eax+4]
  if 1 ; faster
    add eax, 8
    test dx, dx
  else ; slower
    test dx, dx
    lea eax, [eax+8]
  endif
    jz sub8


jj2007

Quote from: DoomyD on November 06, 2008, 09:25:44 AM
Could someone time the following?
On my machine it yields 72 cycles

Edu32W2:        135 cycles
ucsLen:         161 cycles
AzmtStrLenW:    141 cycles

Edu32W2:        135 cycles
ucsLen:         161 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        135 cycles
ucsLen:         162 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        136 cycles
ucsLen:         161 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        134 cycles
ucsLen:         162 cycles
AzmtStrLenW:    140 cycles


New code attached, Edu32W2 is slightly modified.

[attachment deleted by admin]

DoomyD

Iteresting results :P
I wonder what makes the difference
StrLenW : 76 cycles
ucLen2 : 73 cycles

Edu32W2: 331 cycles
ucsLen: 71 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 332 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 331 cycles
ucsLen: 72 cycles
AzmtStrLenW: 72 cycles

Edu32W2: 334 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 333 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles

sinsi


StrLenW : 75 cycles
ucLen2 : 72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    76 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

hmmm, what to choose, what to choose...
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: DoomyD on November 06, 2008, 01:51:46 PM
Iteresting results :P
I wonder what makes the difference

Let me guess: You run a Core Duo?

DoomyD

Of course :U
Yet it's wierd, because shifts are relatively slow compared to a test operation.

jdoe


Thanks guys.

I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.

Something that is likely to be faster on INTEL for a function like that is a "test ax, ax" followed by a "shr eax, 16" in a small unroll. I did it on my Intel at the office and it was real fast but I don't have the source anymore.

Anyway, I had the answer I wanted... AzmtStrLenW is a good compromise between AMD and INTEL but I think the time AMD was slightly superior is gone and my next computer will definitely have an Intel processor.

Thanks again.


jj2007

Quote from: jdoe on November 07, 2008, 03:50:27 AM
I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.

Differences between processors are surprisingly huge for this case. Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:
wsLen proc pStr
mov eax, [esp+4] ; mov edx, eax 3 cycles faster but trashes edx
sub eax, 4
.Repeat
add eax, 4
test dword ptr [eax], 00000FFFFh
je @F
test dword ptr [eax], 0FFFF0000h
.Until Zero?
add eax, 2
@@:
sub eax, [esp+4] ; sub eax, edx 3 cycles faster but...
shr eax, 1
retn 4
wsLen endp


Timings on a Core Duo Celeron M:
wsLen:          112 cycles (with 38 bytes)
AzmtStrLenW:    83 cycles (with 243 bytes)

jdoe

Quote from: jj2007 on November 07, 2008, 09:45:42 PM
Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:

jj2007,

You are absolutely right that using only EAX would be a little bit faster but it is by design that in my library almost all my string functions returns the string length in EAX and the destination pointer in EDX. Also, I don't optimize for small code because I do not think it relevant anymore. It is not my goal anyway to do the fastest functions because my satisfaction comes from being faster than the Windows functions and doing something that equal or beat the MASM32 library when it comes to run on AMD or INTEL. Hutch writes pretty good functions and I can hardly be faster than him on Intel but on AMD, his functions are not a good compromise between the two CPU.

I will never be as the good as Lingo when it comes to optimizing functions because this guys can write amazing stuff in ASM (even though I think sometimes he tries to optimize when it is not necessary). I don't want to know all CPU's technology anyway and I won't read manuals like those of Agner Fog on optimizing ASM (about it I liked the comments of Bogdan on his forum saying that optimizing was about experiments and measurements... I fully agree and this is the way I work anyway).

You are right also when saying that the differences between processors are surprisingly huge. When I started playing with optimization, it cause me headaches trying to be the fastest on my computer and when I tried my functions on Intel I realized that my efforts were losing their meaning. Now I only try to do a good compromise between the two CPU without trying to be the fastest. If I didn't started to do it like that I would turns insane and will always lives in a kind of insatisfactions.