Hi,
While doing ANSI to Unicode function conversion for fun, I have got a faster unicode string length function than ucLen from masm32 lib by applying the unroll advantage of szLen.
align 4
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword
mov eax, p_lpszStr
sub eax, 8
@@:
add eax, 8
cmp word ptr [eax], 0
je Add0
cmp word ptr [eax+2], 0
je Add1
cmp word ptr [eax+4], 0
je Add2
cmp word ptr [eax+6], 0
jne @B
sub eax, p_lpszStr
shr eax, 1
add eax, 3
ret
Add2:
sub eax, p_lpszStr
shr eax, 1
add eax, 2
ret
Add1:
sub eax, p_lpszStr
shr eax, 1
add eax, 1
ret
Add0:
sub eax, p_lpszStr
shr eax, 1
ret
StrLenW endp
Compliments,
Looks good. :U
As for me i would store 0 in some register in bx for example.
Quote from: asmfan on April 10, 2006, 05:55:30 AM
As for me i would store 0 in some register in bx for example.
I did the test and there is no gain doing so.
I have used edx instead of ebx because it don't need to be preserved. :wink
Just to be really annoying, I'll point out that it will only work for a limited subset of unicode strings (those converted from ansi.) And also, only for little-endian unicode!
The null terminator is two zeroes for a reason :P
Quote from: Tedd on April 10, 2006, 05:05:06 PM
The null terminator is two zeroes for a reason :P
That's why it is
cmp word ptr [eax], 0
Could you explain whats wrong because I don't get it.
jdoe,
Good work :U
As it is currently coded, when running on a P3 your procedure is sensitive to alignment. Using the test string "my other brother darryl", by varying the alignment ahead of the 'align 4' I can cause the cycles to vary from 56 to 77. For example, this will cause the procedure to run in 77 cycles:
align 16
nops 1
align 4
...
And this will cause it to run in 56 cycles:
align 16
nops 5
align 4
...
If I replace the 'align 4' with an 'align 16', varying the alignment ahead of the 'align 16' has only a small effect on the cycles, with the procedure running in 56 or 57 cycles for all of the nop counts that I tried.
For reference, the MASM32 ucLen procedure runs in 85 cycles.
In case you are not familiar with it, nops is a MASM32 macro.
:lol
OPTION PROLOGUE:NONE ; turn it off
OPTION EPILOGUE:NONE
Lingo32W proc lpst:DWORD
mov eax, [esp+4]
mov edx, 80008000h
mov ecx, [eax]
@@:
add eax, 4
add ecx, 0FEFFFF00h
test edx, ecx
mov ecx, [eax]
je @b
test word ptr [eax-4], 0FFFFh
je C_minus4
test word ptr [eax-4+2], 0FFFFh
je C_minus2
@@:
and ecx, 07F7F7F7FH
add eax, 4
add ecx, 0FEFFFF00h
test ecx, edx
mov ecx, [eax]
je @b
test word ptr [eax-4], 0FFFFh
je C_minus4
test word ptr [eax-4+2], 0FFFFh
jne @b
C_minus2:
sub eax, [esp+4]
add eax, 2-4
shr eax,1
ret 4
C_minus4:
sub eax, [esp+4]
add eax,0-4
shr eax,1
ret 4
Lingo32W endp
OPTION PROLOGUE:PROLOGUEDEF ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF
Lingo,
I assume your code does substantially better on a P4 than it does on my P3 :eek
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 56 cycles
Lingo32W : 163 cycles
crt_wcslen : 84 cycles
[attachment deleted by admin]
Here are the timings on my Prescott PIV.
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 80 cycles
StrLenW : 86 cycles
Lingo32W : 121 cycles
crt_wcslen : 123 cycles
Press any key to exit...
ADM Athlon 1800+
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 38 cycles
Lingo32W : 78 cycles
crt_wcslen : 101 cycles
Press any key to exit...
@MichaelW
Alignment impact seems to be specific on the processor. On my ADM Athlon, adding "align 16" I notice only one clock cycle decrease (and 10ms less).
Speed seems to be better with align 8 before the loop.
align 16
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword
mov eax, p_lpszStr
sub eax, 8
align 8
@@:
add eax, 8
cmp word ptr [eax], 0
je Add0
cmp word ptr [eax+2], 0
je Add1
cmp word ptr [eax+4], 0
je Add2
cmp word ptr [eax+6], 0
jne @B
sub eax, p_lpszStr
shr eax, 1
add eax, 3
ret
Add2:
sub eax, p_lpszStr
shr eax, 1
add eax, 2
ret
Add1:
sub eax, p_lpszStr
shr eax, 1
add eax, 1
ret
Add0:
sub eax, p_lpszStr
shr eax, 1
ret
StrLenW endp
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 37 cycles
Lingo32W : 78 cycles
crt_wcslen : 97 cycles
Press any key to exit...
Here is a test algo for unicode string length. I have utilised an idea of Lingo's to reduce the memory reads by half and using a ROL to read the other end of the register. On this PIV its timing faster than the version in the library by 343 MS to 390 MS. It works on the theory that processor is still a lot faster than memory.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
ucLen2 proc lpwstr:DWORD
mov ecx, [esp+4] ; lpwstr
xor eax, eax
sub ecx, 4
lbl0:
add ecx, 4
mov eax, [ecx]
cmp ax, 0
je lbl1
rol eax, 16
cmp ax, 0
jne lbl0
sub ecx, [esp+4] ; lpwstr
mov eax, ecx
shr eax, 1
add eax, 1
ret 4
lbl1:
sub ecx, [esp+4] ; lpwstr
mov eax, ecx
shr eax, 1
ret 4
ucLen2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
another test
Quotemy other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
ucLen2 return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 71 cycles
ucLen2 : 51 cycles
StrLenW : 38 cycles
Lingo32W : 134 cycles
crt_wcslen : 64 cycles
Press enter to exit...
I updated the original attachment to include all of the procedures so far. To help level the playing field I eliminated the stack frame from StrLenW, and for the P6 family of processors, placed an align 16 in front of all the procedures.
For my P3:
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 53 cycles
Lingo32W : 141 cycles
ucLen2 : 59 cycles
crt_wcslen : 84 cycles
And the timings only for my old K5:
ucLen : 87 cycles
StrLenW : 46 cycles
Lingo32W : 69 cycles
ucLen2 : 49 cycles
crt_wcslen : 78 cycles
jdoe,
AFAIK alignment has a greater effect on the P6 family of processors (PPro, P2, P3) than on the P1, PMMX, and P4. For a P3 I can improve on align 16 slightly by adding a sufficient number of nops after the align 16 to place the jump label at or close to a 16-byte boundary. To actually know which alignment is best for your processor I think you should try changing the alignment of the align 8 statement by putting varying numbers of nops in front of it. For a P3, at certain alignments an align 8 slows the procedure down substantially.
Very fast...
I try to reduce the number of loads but don't help...
Edu32W proc src:DWORD
mov eax, [esp+4]
@@:
mov edx, [eax]
mov ecx, [eax+4]
test dx, dx
lea eax, [eax+2]
jz @F
shr edx, 16
lea eax, [eax+2]
jz @F
test cx, cx
lea eax, [eax+2]
jz @F
shr ecx, 16
lea eax, [eax+2]
jnz @B
@@:
sub eax, [esp+4]
shr eax, 1
dec eax
ret 4
Edu32W endp
On an Athlon 64
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
Edu32W return value : 23
crt_wcslen return value : 23
ucLen : 87 cycles
StrLenW : 29 cycles
Lingo32W : 68 cycles
Edu32W : 35 cycles
crt_wcslen : 92 cycles
Press any key to exit...
[attachment deleted by admin]
@Thanks for your comments Michael, it helps a lot. I'm gonna read on code alignment as much as I can to know exactly what I'm doing because for now it is a little "unsure stuff" for me.
I'm at the office rigth now and I work on a P4
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
Edu32W return value : 23
ucLen : 74 cycles
StrLenW : 75 cycles
Lingo32W : 124 cycles
ucLen2 : 69 cycles
crt_wcslen : 110 cycles
Edu32W : 61 cycles
Press any key to exit...
StrLenw seems to have better result on AMD processor (I will post the result when at home).
All togethers in the attachment including the new one of EduardoS :U
[attachment deleted by admin]
The last version:
Quote
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
Edu32W return value : 23
ucLen : 87 cycles
StrLenW : 26 cycles
Lingo32W : 63 cycles
ucLen2 : 58 cycles
crt_wcslen : 92 cycles
Edu32W : 35 cycles
Press any key to exit...
Hi all,
I try two other algos,
One is my last one "joined" with jdoe's one,
The other use SSE (3DNow+, so it run on the first Athlon),
Can someone with a P4 test them?
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23
StrLenW : 26 cycles
Lingo32W : 63 cycles
ucLen2 : 58 cycles
Edu32W : 35 cycles
Edu32W2 : 24 cycles
EduSSE : 27 cycles
Press any key to exit...
[attachment deleted by admin]
This is on a 2.8 gig Prescott PIV.
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23
StrLenW : 82 cycles
Lingo32W : 122 cycles
ucLen2 : 76 cycles
Edu32W : 72 cycles
Edu32W2 : 59 cycles
EduSSE : 18 cycles
Press any key to exit...
I do have a comment on the test sample though, while it makes sense to test a short string as it tells you the takeoff speed of each algo, it does not address the algo speed on much longer strings where the stack frame size does not matter and where you are more interested in its linear forward speed. probably a string over 64k would be a good idea as well as it avoids the considerations that best suit a short string.
Quote from: EduardoS on April 12, 2006, 12:58:56 AM
I try two other algos,
One is my last one "joined" with jdoe's one,
In fact StrLenW is just StrLen from masm32 library I played with to fit unicode. :wink
On my AMD Athlon 1800+
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23
StrLenW : 35 cycles
Lingo32W : 77 cycles
ucLen2 : 72 cycles
Edu32W : 60 cycles
Edu32W2 : 29 cycles
EduSSE : 33 cycles
Press any key to exit...
:clap:
Here is a benchmark using the windows.inc file.
These are the time I get on the PIV.
1047 ucLen
968 ucLen2
797 EduSSE
906 Edu32W2
1032 StrLenW
1453 Lingo32W
1032 ucLen
968 ucLen2
797 EduSSE
907 Edu32W2
1031 StrLenW
1453 Lingo32W
1047 ucLen
969 ucLen2
796 EduSSE
891 Edu32W2
1047 StrLenW
1453 Lingo32W
1047 ucLen
969 ucLen2
797 EduSSE
906 Edu32W2
1031 StrLenW
1453 Lingo32W
Press any key to continue ...
[attachment deleted by admin]
Thank you for testing :U,
Quote from: hutch-- on April 12, 2006, 01:30:46 AM
I do have a comment on the test sample though, while it makes sense to test a short string as it tells you the takeoff speed of each algo, it does not address the algo speed on much longer strings where the stack frame size does not matter and where you are more interested in its linear forward speed. probably a string over 64k would be a good idea as well as it avoids the considerations that best suit a short string.
hutch, when i try test a strlen with big strings a guy say me "no one want the strlen of big strings"...
I'm happy seeing someone who thinks diferent...
Yeah,
Me. Its not used all that often but being able to test on big stuff has its place from time to time. It does tell you if an algo has more than startup speed though, it tells you the core speed without stuff like stack entry and exit.
And what if you want to know the length of text in an Edit Control? (Assuming Unicode versions exist.) That could easily be >64kB.
Algo idea: perhaps read a dword with STOSD then AND the two "null" positions and break on either being zero?
Quote from: jdoe on April 10, 2006, 09:47:23 PM
That's why it is
cmp word ptr [eax], 0
Could you explain whats wrong because I don't get it.
Sorry, my bad - I obviously can't read ::)
Hi,
One question came to my mind about reading past the end of a buffer. Agner Fog algo StrLen reads 3 characters past the end. Is there any danger if I want to read 7 characters past the end for example (for ASCII). Doing so make good speed improvement.
This confusion came to me while playing with StrLenW and reading 3 characters (UNICODE) past the end.
This is what I have for StrLenW...
.586
.MODEL FLAT, STDCALL
OPTION CASEMAP:NONE
.CODE
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
ALGN BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
BYTE 0CCh
StrLenW PROC p_lpszStr:DWORD
mov eax, dword ptr [esp+4]
@@:
mov ecx, dword ptr [eax]
mov edx, dword ptr [eax+4]
add eax, 8
test ecx, 0FFFFh
jz @0
test ecx, 0FFFF0000h
jz @2
test edx, 0FFFFh
jz @4
test edx, 0FFFF0000h
jnz @B
@6:
sub eax, 2
sub eax, dword ptr [esp+4]
shr eax, 1
ret 4
@4:
sub eax, 4
sub eax, dword ptr [esp+4]
shr eax, 1
ret 4
@2:
sub eax, 6
sub eax, dword ptr [esp+4]
shr eax, 1
ret 4
@0:
sub eax, 8
sub eax, dword ptr [esp+4]
shr eax, 1
ret 4
StrLenW ENDP
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
END
1) Do you see any problems with this StrLenW algo above (2 dword reads ecx-edx)
2) What is the danger of doing the same (2 dword reads ecx-edx) in StrLenA
Thanks
JDoe,
I do not think it will pose a problem as long as you know you are doing it and you ignore the extra bytes when processing the information. The thing you should not do is write past the end of the buffer. No problem there, though, right?
Paul
The only time reading past a length of a buffer (whether it be strings or not), is when you cross a page-boundary and the next page isn't mark for accessing by your application. Then you will generate an exception -- even if it is one byte past. So as long as you are aware of this, you can read any length past a buffer. :P
Relvnian
I added my two newbie attempts at calculating the length of a Unicode string. I didn't test them very much, but I hope they are alright and they seem to be working correctly. I also integrated my functions in hutch's windows.inc test. Results are shown below:
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23
SebW return value : 23
SebW2 return value : 23
StrLenW : 25 cycles
Lingo32W : 64 cycles
ucLen2 : 57 cycles
Edu32W : 36 cycles
Edu32W2 : 25 cycles
EduSSE : 28 cycles
SebW : 37 cycles
SebW2 : 29 cycles
Press any key to exit...
1672 ucLen
1453 ucLen2
985 EduSSE
1125 Edu32W2
1125 StrLenW
1656 Lingo32W
1078 SebW
1109 SebW2
1641 ucLen
1484 ucLen2
969 EduSSE
1125 Edu32W2
1110 StrLenW
1671 Lingo32W
1094 SebW
1110 SebW2
1656 ucLen
1469 ucLen2
968 EduSSE
1125 Edu32W2
1110 StrLenW
1672 Lingo32W
1093 SebW
1110 SebW2
1656 ucLen
1469 ucLen2
984 EduSSE
1125 Edu32W2
1109 StrLenW
1657 Lingo32W
1093 SebW
1110 SebW2
Press any key to continue ...
OPTION PROLOGUE:NONE ; turn it off
OPTION EPILOGUE:NONE
SebW proc src:DWORD
mov eax,[esp+4]
xor ecx,ecx
align 16
@@:
cmp word ptr [eax],0
jz @F
add ecx,1
cmp word ptr [eax+2],0
jz @F
add ecx,1
cmp word ptr [eax+4],0
jz @F
add ecx,1
cmp word ptr [eax+6],0
jz @F
add ecx,1
add eax,8
jmp @B
@@:
mov eax,ecx
ret 4
SebW endp
OPTION PROLOGUE:PROLOGUEDEF ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF
OPTION PROLOGUE:NONE ; turn it off
OPTION EPILOGUE:NONE
SebW2 proc src:DWORD
mov eax,[esp+4]
mov ecx,eax
align 16
@@:
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jnz @B
@@:
sub eax,ecx
shr eax,1
ret 4
SebW2 endp
OPTION PROLOGUE:PROLOGUEDEF ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF
Oh, by the way, I'm on a Athlon 64 X2 Dual.
[attachment deleted by admin]
Are we measuring right? I mean, about branch prediction, it isn't counted, running the algo with same data millions times allow the processor to predicts all branches correctly, at least for small strings (in this case), i measured the time for SebW2 with strings from 10 chars to 60 chars length, i used a simple rdtsc before and another after instead of timming macros, and got this result:
23 cicles
25 cicles*
25 cicles
26 cicles
27 cicles
30 cicles*
29 cicles
30 cicles
31 cicles
53 cicles*
33 cicles
47 cicles*
35 cicles
36 cicles
37 cicles
38 cicles
39 cicles
40 cicles
41 cicles
55 cicles*
43 cicles
71 cicles*
45 cicles
46 cicles
47 cicles
48 cicles
49 cicles
63 cicles-
64 cicles
65 cicles
66 cicles
67 cicles
68 cicles
69 cicles
70 cicles
71 cicles
72 cicles
73 cicles
74 cicles
75 cicles
76 cicles
77 cicles
81 cicles-
82 cicles
83 cicles
84 cicles
85 cicles
86 cicles
87 cicles
88 cicles
I got one extra cicle for each extra byte length, with exceptions *, wich seens to be due to processor internal state, and - at length 37 and 53, where the miss-prediction seens to increase.
Finally, shouldn't the branch-prediction taken in account?
Sorry for this old topic revival but I would like to compare different CPU timing. If few members have time to post timing result for this, it would be appreciated. I added a new one (AzmtStrLenW) that is a good compromise between AMD and INTEL. I can do faster on INTEL but AMD don't like it.
AMD Athlon XP 1800+
my other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71
StrLenW : 112 cycles
Lingo32W : 183 cycles
ucLen2 : 166 cycles
Edu32W : 177 cycles
Edu32W2 : 96 cycles
SebW : 143 cycles
SebW2 : 139 cycles
lstrlenW : 250 cycles
AzmtStrLenW : 77 cycles
[attachment deleted by admin]
Q6600 2.4GHz
StrLenW : 75 cycles
Lingo32W : 327 cycles
ucLen2 : 72 cycles
Edu32W : 96 cycles
Edu32W2 : 88 cycles
SebW : 95 cycles
SebW2 : 95 cycles
lstrlenW : 197 cycles
AzmtStrLenW : 72 cycles
Core 2 Duo 6300 1.86GHzmy other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71
StrLenW : 76 cycles
Lingo32W : 331 cycles
ucLen2 : 73 cycles
Edu32W : 421 cycles
Edu32W2 : 352 cycles
SebW : 94 cycles
SebW2 : 96 cycles
lstrlenW : 201 cycles
AzmtStrLenW : 73 cycles
Press any key to exit...
Pentium 4 3.4 GHz
StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles
Could someone time the following?
On my machine it yields 72 cycles
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
even
ucsLen proc buf
mov eax,[esp+4]
.repeat
REPEAT 3
mov ecx,[eax]
add eax,4
test ecx,00000FFFFh
jz _EVEN_
test ecx,0FFFF0000h
jz _ODD_
ENDM
mov ecx,[eax]
add eax,4
test ecx,00000FFFFh
jz _EVEN_
test ecx,0FFFF0000h
.until zero?
_ODD_:
add eax,2
_EVEN_:
sub eax,4
sub eax,[esp+4]
shr eax,1
retn 4
ucsLen endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: jj2007 on November 06, 2008, 09:14:41 AM
Pentium 4 3.4 GHz
StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles
StrLenW : 201 cycles
Lingo32W : 278 cycles
ucLen2 : 193 cycles
Edu32W : 170 cycles
Edu32W2 : 132 cyclesSebW : 229 cycles
SebW2 : 213 cycles
lstrlenW : 452 cycles
AzmtStrLenW : 151 cycles (unchanged code, most values are a bit slower)
... with a minor modification:
@@:
mov edx, [eax]
mov ecx, [eax+4]
if 1 ; faster
add eax, 8
test dx, dx
else ; slower
test dx, dx
lea eax, [eax+8]
endif
jz sub8
Quote from: DoomyD on November 06, 2008, 09:25:44 AM
Could someone time the following?
On my machine it yields 72 cycles
Edu32W2: 135 cycles
ucsLen: 161 cycles
AzmtStrLenW: 141 cycles
Edu32W2: 135 cycles
ucsLen: 161 cycles
AzmtStrLenW: 142 cycles
Edu32W2: 135 cycles
ucsLen: 162 cycles
AzmtStrLenW: 142 cycles
Edu32W2: 136 cycles
ucsLen: 161 cycles
AzmtStrLenW: 142 cycles
Edu32W2: 134 cycles
ucsLen: 162 cycles
AzmtStrLenW: 140 cycles
New code attached, Edu32W2 is slightly modified.
[attachment deleted by admin]
Iteresting results :P
I wonder what makes the difference
StrLenW : 76 cycles
ucLen2 : 73 cycles
Edu32W2: 331 cycles
ucsLen: 71 cycles
AzmtStrLenW: 73 cycles
Edu32W2: 332 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles
Edu32W2: 331 cycles
ucsLen: 72 cycles
AzmtStrLenW: 72 cycles
Edu32W2: 334 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles
Edu32W2: 333 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
StrLenW : 75 cycles
ucLen2 : 72 cycles
Edu32W2: 72 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
Edu32W2: 72 cycles
ucsLen: 71 cycles
AzmtStrLenW: 76 cycles
Edu32W2: 72 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
Edu32W2: 72 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
Edu32W2: 72 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
hmmm, what to choose, what to choose...
Quote from: DoomyD on November 06, 2008, 01:51:46 PM
Iteresting results :P
I wonder what makes the difference
Let me guess: You run a Core Duo?
Of course :U
Yet it's wierd, because shifts are relatively slow compared to a test operation.
Thanks guys.
I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.
Something that is likely to be faster on INTEL for a function like that is a "test ax, ax" followed by a "shr eax, 16" in a small unroll. I did it on my Intel at the office and it was real fast but I don't have the source anymore.
Anyway, I had the answer I wanted... AzmtStrLenW is a good compromise between AMD and INTEL but I think the time AMD was slightly superior is gone and my next computer will definitely have an Intel processor.
Thanks again.
Quote from: jdoe on November 07, 2008, 03:50:27 AM
I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.
Differences between processors are surprisingly huge for this case. Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:
wsLen proc pStr
mov eax, [esp+4] ; mov edx, eax 3 cycles faster but trashes edx
sub eax, 4
.Repeat
add eax, 4
test dword ptr [eax], 00000FFFFh
je @F
test dword ptr [eax], 0FFFF0000h
.Until Zero?
add eax, 2
@@:
sub eax, [esp+4] ; sub eax, edx 3 cycles faster but...
shr eax, 1
retn 4
wsLen endp
Timings on a Core Duo Celeron M:
wsLen: 112 cycles (with 38 bytes)
AzmtStrLenW: 83 cycles (with 243 bytes)
Quote from: jj2007 on November 07, 2008, 09:45:42 PM
Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:
jj2007,
You are absolutely right that using only EAX would be a little bit faster but it is by design that in my library almost all my string functions returns the string length in EAX and the destination pointer in EDX. Also, I don't optimize for small code because I do not think it relevant anymore. It is not my goal anyway to do the fastest functions because my satisfaction comes from being faster than the Windows functions and doing something that equal or beat the MASM32 library when it comes to run on AMD or INTEL. Hutch writes pretty good functions and I can hardly be faster than him on Intel but on AMD, his functions are not a good compromise between the two CPU.
I will never be as the good as Lingo when it comes to optimizing functions because this guys can write amazing stuff in ASM (even though I think sometimes he tries to optimize when it is not necessary). I don't want to know all CPU's technology anyway and I won't read manuals like those of Agner Fog on optimizing ASM (about it I liked the comments of Bogdan on his forum saying that optimizing was about experiments and measurements... I fully agree and this is the way I work anyway).
You are right also when saying that the differences between processors are surprisingly huge. When I started playing with optimization, it cause me headaches trying to be the fastest on my computer and when I tried my functions on Intel I realized that my efforts were losing their meaning. Now I only try to do a good compromise between the two CPU without trying to be the fastest. If I didn't started to do it like that I would turns insane and will always lives in a kind of insatisfactions.
Quote from: jdoe on November 08, 2008, 03:44:42 AM
it cause me headaches trying to be the fastest on my computer and when I tried my functions on Intel I realized that my efforts were losing their meaning. Now I only try to do a good compromise between the two CPU without trying to be the fastest.
Ok, let's wait then for the Super Dual Core - half Intel, half AMD, with automatic speed-optimising thread switching :green2