The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jdoe on April 10, 2006, 01:32:54 AM

Title: Unicode string length
Post by: jdoe on April 10, 2006, 01:32:54 AM
Hi,

While doing ANSI to Unicode function conversion for fun, I have got a faster unicode string length function than ucLen from masm32 lib by applying the unroll advantage of szLen.



align 4
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword

   mov eax, p_lpszStr
   sub eax, 8

@@:
   add eax, 8
   cmp word ptr [eax], 0
   je Add0
   cmp word ptr [eax+2], 0
   je Add1
   cmp word ptr [eax+4], 0
   je Add2
   cmp word ptr [eax+6], 0
   jne @B

   sub eax, p_lpszStr
   shr eax, 1
   add eax, 3
   ret

Add2:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 2
   ret

Add1:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 1
   ret

Add0:
   sub eax, p_lpszStr
   shr eax, 1
   ret

StrLenW endp



Title: Re: Unicode string length
Post by: hutch-- on April 10, 2006, 02:11:01 AM
Compliments,

Looks good.  :U
Title: Re: Unicode string length
Post by: asmfan on April 10, 2006, 05:55:30 AM
As for me i would store 0 in some register in bx for example.
Title: Re: Unicode string length
Post by: jdoe on April 10, 2006, 06:57:03 AM
Quote from: asmfan on April 10, 2006, 05:55:30 AM
As for me i would store 0 in some register in bx for example.

I did the test and there is no gain doing so.
I have used edx instead of ebx because it don't need to be preserved.   :wink

Title: Re: Unicode string length
Post by: Tedd on April 10, 2006, 05:05:06 PM
Just to be really annoying, I'll point out that it will only work for a limited subset of unicode strings (those converted from ansi.) And also, only for little-endian unicode!
The null terminator is two zeroes for a reason :P
Title: Re: Unicode string length
Post by: jdoe on April 10, 2006, 09:47:23 PM
Quote from: Tedd on April 10, 2006, 05:05:06 PM
The null terminator is two zeroes for a reason :P

That's why it is

cmp word ptr [eax], 0


Could you explain whats wrong because I don't get it.

Title: Re: Unicode string length
Post by: MichaelW on April 11, 2006, 12:06:15 AM
jdoe,

Good work :U

As it is currently coded, when running on a P3 your procedure is sensitive to alignment. Using the test string "my other brother darryl", by varying the alignment ahead of the 'align 4' I can cause the cycles to vary from 56 to 77. For example, this will cause the procedure to run in 77 cycles:

align 16
nops 1

align 4
...


And this will cause it to run in 56 cycles:

align 16
nops 5

align 4
...


If I replace the 'align 4' with an 'align 16', varying the alignment ahead of the 'align 16' has only a small effect on the cycles, with the procedure running in 56 or 57 cycles for all of the nop counts that I tried.

For reference, the MASM32 ucLen procedure runs in 85 cycles.

In case you are not familiar with it, nops is a MASM32 macro.

Title: Re: Unicode string length
Post by: lingo on April 11, 2006, 03:17:23 AM
 :lol
OPTION PROLOGUE:NONE          ; turn it off
OPTION EPILOGUE:NONE
Lingo32W                proc  lpst:DWORD
                        mov   eax, [esp+4]
                        mov   edx, 80008000h
                        mov   ecx, [eax]
@@:
                        add   eax, 4
                        add   ecx, 0FEFFFF00h
                        test  edx, ecx
                        mov   ecx, [eax]
                        je    @b

                        test  word ptr [eax-4], 0FFFFh
                        je    C_minus4
                        test  word ptr [eax-4+2], 0FFFFh
                        je    C_minus2
@@:
                        and   ecx, 07F7F7F7FH
                        add   eax, 4
                        add   ecx, 0FEFFFF00h
                        test  ecx, edx
                        mov   ecx, [eax]
                        je    @b

                        test  word ptr [eax-4], 0FFFFh
                        je    C_minus4
test  word ptr [eax-4+2], 0FFFFh
                        jne   @b
C_minus2:
                        sub   eax, [esp+4]
                        add   eax, 2-4
                        shr   eax,1
                        ret   4
C_minus4:
                        sub   eax, [esp+4]
                        add   eax,0-4
                        shr   eax,1
                        ret   4    
Lingo32W                endp

OPTION PROLOGUE:PROLOGUEDEF   ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF
Title: Re: Unicode string length
Post by: MichaelW on April 11, 2006, 07:37:49 AM
Lingo,

I assume your code does substantially better on a P4 than it does on my P3 :eek

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 56 cycles
Lingo32W : 163 cycles
crt_wcslen : 84 cycles



[attachment deleted by admin]
Title: Re: Unicode string length
Post by: hutch-- on April 11, 2006, 08:05:27 AM
Here are the timings on my Prescott PIV.

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 80 cycles
StrLenW : 86 cycles
Lingo32W : 121 cycles
crt_wcslen : 123 cycles
Press any key to exit...
Title: Re: Unicode string length
Post by: jdoe on April 11, 2006, 08:39:11 AM
ADM Athlon 1800+

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 38 cycles
Lingo32W : 78 cycles
crt_wcslen : 101 cycles
Press any key to exit...



@MichaelW

Alignment impact seems to be specific on the processor. On my ADM Athlon, adding "align 16" I notice only one clock cycle decrease (and 10ms less).
Title: Re: Unicode string length
Post by: jdoe on April 11, 2006, 09:54:46 AM
Speed seems to be better with align 8 before the loop.


align 16
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword

   mov eax, p_lpszStr
   sub eax, 8

   align 8
@@:
   add eax, 8
   cmp word ptr [eax], 0
   je Add0
   cmp word ptr [eax+2], 0
   je Add1
   cmp word ptr [eax+4], 0
   je Add2
   cmp word ptr [eax+6], 0
   jne @B

   sub eax, p_lpszStr
   shr eax, 1
   add eax, 3
   ret

Add2:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 2
   ret

Add1:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 1
   ret

Add0:
   sub eax, p_lpszStr
   shr eax, 1
   ret

StrLenW endp





my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 37 cycles
Lingo32W : 78 cycles
crt_wcslen : 97 cycles
Press any key to exit...


Title: Re: Unicode string length
Post by: hutch-- on April 11, 2006, 12:35:10 PM
Here is a test algo for unicode string length. I have utilised an idea of Lingo's to reduce the memory reads by half and using a ROL to read the other end of the register. On this PIV its timing faster than the version in the library by 343 MS to 390 MS. It works on the theory that processor is still a lot faster than memory.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

ucLen2 proc lpwstr:DWORD

    mov ecx, [esp+4]    ; lpwstr
    xor eax, eax
    sub ecx, 4

  lbl0:
    add ecx, 4
    mov eax, [ecx]
    cmp ax, 0
    je lbl1
    rol eax, 16
    cmp ax, 0
    jne lbl0

    sub ecx, [esp+4]    ; lpwstr
    mov eax, ecx
    shr eax, 1
    add eax, 1
    ret 4

  lbl1:
    sub ecx, [esp+4]    ; lpwstr
    mov eax, ecx
    shr eax, 1
    ret 4

ucLen2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: Unicode string length
Post by: six_L on April 11, 2006, 01:07:11 PM
another test
Quotemy other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
ucLen2 return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 71 cycles
ucLen2 : 51 cycles
StrLenW : 38 cycles
Lingo32W : 134 cycles
crt_wcslen : 64 cycles

Press enter to exit...
Title: Re: Unicode string length
Post by: MichaelW on April 11, 2006, 05:26:15 PM
I updated the original attachment to include all of the procedures so far. To help level the playing field I eliminated the stack frame from StrLenW, and for the P6 family of processors, placed an align 16 in front of all the procedures.
For my P3:

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 53 cycles
Lingo32W : 141 cycles
ucLen2 : 59 cycles
crt_wcslen : 84 cycles


And the timings only for my old K5:

ucLen : 87 cycles
StrLenW : 46 cycles
Lingo32W : 69 cycles
ucLen2 : 49 cycles
crt_wcslen : 78 cycles


jdoe,

AFAIK alignment has a greater effect on the P6 family of processors (PPro, P2, P3) than on the P1, PMMX, and P4. For a P3 I can improve on align 16 slightly by adding a sufficient number of nops after the align 16 to place the jump label at or close to a 16-byte boundary. To actually know which alignment is best for your processor I think you should try changing the alignment of the align 8 statement by putting varying numbers of nops in front of it. For a P3, at certain alignments an align 8 slows the procedure down substantially.


Title: Re: Unicode string length
Post by: EduardoS on April 11, 2006, 07:39:18 PM
Very fast...
I try to reduce the number of loads but don't help...

Edu32W proc src:DWORD
    mov eax, [esp+4]
    @@:
    mov edx, [eax]
    mov ecx, [eax+4]
    test dx, dx
    lea eax, [eax+2]
    jz @F
    shr edx, 16
    lea eax, [eax+2]
    jz @F
    test cx, cx
    lea eax, [eax+2]
    jz @F
    shr ecx, 16
    lea eax, [eax+2]
    jnz @B
    @@:
    sub eax, [esp+4]
    shr eax, 1
    dec eax
    ret 4
Edu32W endp


On an Athlon 64

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
Edu32W return value : 23
crt_wcslen return value : 23
ucLen : 87 cycles
StrLenW : 29 cycles
Lingo32W : 68 cycles
Edu32W : 35 cycles
crt_wcslen : 92 cycles
Press any key to exit...

[attachment deleted by admin]
Title: Re: Unicode string length
Post by: jdoe on April 11, 2006, 08:16:29 PM
@Thanks for your comments Michael, it helps a lot. I'm gonna read on code alignment as much as I can to know exactly what I'm doing because for now it is a little "unsure stuff" for me.


I'm at the office rigth now and I work on a P4

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
Edu32W return value : 23
ucLen : 74 cycles
StrLenW : 75 cycles
Lingo32W : 124 cycles
ucLen2 : 69 cycles
crt_wcslen : 110 cycles
Edu32W : 61 cycles

Press any key to exit...


StrLenw seems to have better result on AMD processor (I will post the result when at home).


All togethers in the attachment including the new one of EduardoS  :U



[attachment deleted by admin]
Title: Re: Unicode string length
Post by: EduardoS on April 11, 2006, 09:17:25 PM
The last version:
Quote
my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
Edu32W return value : 23
ucLen : 87 cycles
StrLenW : 26 cycles
Lingo32W : 63 cycles
ucLen2 : 58 cycles
crt_wcslen : 92 cycles
Edu32W : 35 cycles

Press any key to exit...
Title: Re: Unicode string length
Post by: EduardoS on April 12, 2006, 12:58:56 AM
Hi all,
I try two other algos,
One is my last one "joined" with jdoe's one,
The other use SSE (3DNow+, so it run on the first Athlon),

Can someone with a P4 test them?


my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23

StrLenW : 26 cycles
Lingo32W : 63 cycles
ucLen2 : 58 cycles
Edu32W : 35 cycles
Edu32W2 : 24 cycles
EduSSE : 27 cycles

Press any key to exit...

[attachment deleted by admin]
Title: Re: Unicode string length
Post by: hutch-- on April 12, 2006, 01:30:46 AM
This is on a 2.8 gig Prescott PIV.


my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23

StrLenW : 82 cycles
Lingo32W : 122 cycles
ucLen2 : 76 cycles
Edu32W : 72 cycles
Edu32W2 : 59 cycles
EduSSE : 18 cycles

Press any key to exit...


I do have a comment on the test sample though, while it makes sense to test a short string as it tells you the takeoff speed of each algo, it does not address the algo speed on much longer strings where the stack frame size does not matter and where you are more interested in its linear forward speed. probably a string over 64k would be a good idea as well as it avoids the considerations that best suit a short string.
Title: Re: Unicode string length
Post by: jdoe on April 12, 2006, 01:53:17 AM
Quote from: EduardoS on April 12, 2006, 12:58:56 AM
I try two other algos,
One is my last one "joined" with jdoe's one,

In fact StrLenW is just StrLen from masm32 library I played with to fit unicode.   :wink


On my AMD Athlon 1800+

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23

StrLenW : 35 cycles
Lingo32W : 77 cycles
ucLen2 : 72 cycles
Edu32W : 60 cycles
Edu32W2 : 29 cycles
EduSSE : 33 cycles

Press any key to exit...


:clap:

Title: Re: Unicode string length
Post by: hutch-- on April 12, 2006, 03:23:27 AM
Here is a benchmark using the windows.inc file.

These are the time I get on the PIV.

1047 ucLen
968 ucLen2
797 EduSSE
906 Edu32W2
1032 StrLenW
1453 Lingo32W
1032 ucLen
968 ucLen2
797 EduSSE
907 Edu32W2
1031 StrLenW
1453 Lingo32W
1047 ucLen
969 ucLen2
796 EduSSE
891 Edu32W2
1047 StrLenW
1453 Lingo32W
1047 ucLen
969 ucLen2
797 EduSSE
906 Edu32W2
1031 StrLenW
1453 Lingo32W
Press any key to continue ...

[attachment deleted by admin]
Title: Re: Unicode string length
Post by: EduardoS on April 12, 2006, 01:20:03 PM
Thank you for testing  :U,

Quote from: hutch-- on April 12, 2006, 01:30:46 AM
I do have a comment on the test sample though, while it makes sense to test a short string as it tells you the takeoff speed of each algo, it does not address the algo speed on much longer strings where the stack frame size does not matter and where you are more interested in its linear forward speed. probably a string over 64k would be a good idea as well as it avoids the considerations that best suit a short string.

hutch, when i try test a strlen with big strings a guy say me "no one want the strlen of big strings"...
I'm happy seeing someone who thinks diferent...
Title: Re: Unicode string length
Post by: hutch-- on April 12, 2006, 01:25:21 PM
Yeah,

Me. Its not used all that often but being able to test on big stuff has its place from time to time. It does tell you if an algo has more than startup speed though, it tells you the core speed without stuff like stack entry and exit.
Title: Re: Unicode string length
Post by: Mark Jones on April 12, 2006, 03:11:39 PM
And what if you want to know the length of text in an Edit Control? (Assuming Unicode versions exist.) That could easily be >64kB.

Algo idea: perhaps read a dword with STOSD then AND the two "null" positions and break on either being zero?
Title: Re: Unicode string length
Post by: Tedd on April 13, 2006, 10:30:44 AM
Quote from: jdoe on April 10, 2006, 09:47:23 PM
That's why it is

cmp word ptr [eax], 0


Could you explain whats wrong because I don't get it.

Sorry, my bad - I obviously can't read ::)
Title: Re: Unicode string length
Post by: jdoe on December 09, 2006, 06:29:30 PM
Hi,

One question came to my mind about reading past the end of a buffer. Agner Fog algo StrLen reads 3 characters past the end. Is there any danger if I want to read 7 characters past the end for example (for ASCII). Doing so make good speed improvement.

This confusion came to me while playing with StrLenW and reading 3 characters (UNICODE) past the end.

This is what I have for StrLenW...


.586

.MODEL FLAT, STDCALL

OPTION CASEMAP:NONE

.CODE

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

ALGN BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh
     BYTE 0CCh

StrLenW PROC p_lpszStr:DWORD

   mov eax, dword ptr [esp+4]

@@:
   mov ecx, dword ptr [eax]
   mov edx, dword ptr [eax+4]
   add eax, 8
   test ecx, 0FFFFh
   jz @0
   test ecx, 0FFFF0000h
   jz @2
   test edx, 0FFFFh
   jz @4
   test edx, 0FFFF0000h
   jnz @B

@6:
   sub eax, 2
   sub eax, dword ptr [esp+4]
   shr eax, 1
   ret 4

@4:
   sub eax, 4
   sub eax, dword ptr [esp+4]
   shr eax, 1
   ret 4

@2:
   sub eax, 6
   sub eax, dword ptr [esp+4]
   shr eax, 1
   ret 4

@0:
   sub eax, 8
   sub eax, dword ptr [esp+4]
   shr eax, 1
   ret 4

StrLenW ENDP

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF

END



1) Do you see any problems with this StrLenW algo above (2 dword reads ecx-edx)

2) What is the danger of doing the same (2 dword reads ecx-edx) in StrLenA


Thanks
Title: Re: Unicode string length
Post by: PBrennick on December 09, 2006, 10:58:22 PM
JDoe,
I do not think it will pose a problem as long as you know you are doing it and you ignore the extra bytes when processing the information. The thing you should not do is write past the end of the buffer. No problem there, though, right?

Paul
Title: Re: Unicode string length
Post by: Relvinian on December 10, 2006, 09:17:55 AM
The only time reading past a length of a buffer (whether it be strings or not), is when you cross a page-boundary and the next page isn't mark for accessing by your application. Then you will generate an exception -- even if it is one byte past.  So as long as you are aware of this, you can read any length past a buffer. :P

Relvnian
Title: Re: Unicode string length
Post by: Seb on December 25, 2006, 02:15:21 PM
I added my two newbie attempts at calculating the length of a Unicode string. I didn't test them very much, but I hope they are alright and they seem to be working correctly. I also integrated my functions in hutch's windows.inc test. Results are shown below:


my other brother darryl
LENGTHOF : 24
SIZEOF : 48
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
Edu32W return value : 23
Edu32W2 return value : 23
EduSSE return value : 23
SebW return value : 23
SebW2 return value : 23

StrLenW : 25 cycles
Lingo32W : 64 cycles
ucLen2 : 57 cycles
Edu32W : 36 cycles
Edu32W2 : 25 cycles
EduSSE : 28 cycles
SebW : 37 cycles
SebW2 : 29 cycles

Press any key to exit...



1672 ucLen
1453 ucLen2
985 EduSSE
1125 Edu32W2
1125 StrLenW
1656 Lingo32W
1078 SebW
1109 SebW2
1641 ucLen
1484 ucLen2
969 EduSSE
1125 Edu32W2
1110 StrLenW
1671 Lingo32W
1094 SebW
1110 SebW2
1656 ucLen
1469 ucLen2
968 EduSSE
1125 Edu32W2
1110 StrLenW
1672 Lingo32W
1093 SebW
1110 SebW2
1656 ucLen
1469 ucLen2
984 EduSSE
1125 Edu32W2
1109 StrLenW
1657 Lingo32W
1093 SebW
1110 SebW2
Press any key to continue ...



OPTION PROLOGUE:NONE          ; turn it off
OPTION EPILOGUE:NONE

SebW proc src:DWORD
mov eax,[esp+4]
xor ecx,ecx

align 16
@@:
cmp word ptr [eax],0
jz @F
add ecx,1
cmp word ptr [eax+2],0
jz @F
add ecx,1
cmp word ptr [eax+4],0
jz @F
add ecx,1
cmp word ptr [eax+6],0
jz @F
add ecx,1
add eax,8
jmp @B
@@:
mov eax,ecx
ret 4
SebW endp

OPTION PROLOGUE:PROLOGUEDEF   ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF

OPTION PROLOGUE:NONE          ; turn it off
OPTION EPILOGUE:NONE

SebW2 proc src:DWORD
mov eax,[esp+4]
mov ecx,eax

align 16
@@:
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jz @F
add eax,2
cmp word ptr [eax],0
jnz @B
@@:
sub eax,ecx
shr eax,1
ret 4
SebW2 endp

OPTION PROLOGUE:PROLOGUEDEF   ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF


Oh, by the way, I'm on a Athlon 64 X2 Dual.

[attachment deleted by admin]
Title: Re: Unicode string length
Post by: EduardoS on December 25, 2006, 07:31:00 PM
Are we measuring right? I mean, about branch prediction, it isn't counted, running the algo with same data millions times allow the processor to predicts all branches correctly, at least for small strings (in this case), i measured the time for SebW2 with strings from 10 chars to 60 chars length, i used a simple rdtsc before and another after instead of timming macros, and got this result:

23 cicles
25 cicles*
25 cicles
26 cicles
27 cicles
30 cicles*
29 cicles
30 cicles
31 cicles
53 cicles*
33 cicles
47 cicles*
35 cicles
36 cicles
37 cicles
38 cicles
39 cicles
40 cicles
41 cicles
55 cicles*
43 cicles
71 cicles*
45 cicles
46 cicles
47 cicles
48 cicles
49 cicles
63 cicles-
64 cicles
65 cicles
66 cicles
67 cicles
68 cicles
69 cicles
70 cicles
71 cicles
72 cicles
73 cicles
74 cicles
75 cicles
76 cicles
77 cicles
81 cicles-
82 cicles
83 cicles
84 cicles
85 cicles
86 cicles
87 cicles
88 cicles

I got one extra cicle for each extra byte length, with exceptions *, wich seens to be due to processor internal state, and - at length 37 and 53, where the miss-prediction seens to increase.

Finally, shouldn't the branch-prediction taken in account?
Title: Re: Unicode string length
Post by: jdoe on November 06, 2008, 04:26:46 AM

Sorry for this old topic revival but I would like to compare different CPU timing. If few members have time to post timing result for this, it would be appreciated. I added a new one (AzmtStrLenW) that is a good compromise between AMD and INTEL. I can do faster on INTEL but AMD don't like it.


AMD Athlon XP 1800+


my other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71

StrLenW : 112 cycles
Lingo32W : 183 cycles
ucLen2 : 166 cycles
Edu32W : 177 cycles
Edu32W2 : 96 cycles
SebW : 143 cycles
SebW2 : 139 cycles
lstrlenW : 250 cycles
AzmtStrLenW : 77 cycles




[attachment deleted by admin]
Title: Re: Unicode string length
Post by: sinsi on November 06, 2008, 06:25:29 AM
Q6600 2.4GHz

StrLenW : 75 cycles
Lingo32W : 327 cycles
ucLen2 : 72 cycles
Edu32W : 96 cycles
Edu32W2 : 88 cycles
SebW : 95 cycles
SebW2 : 95 cycles
lstrlenW : 197 cycles
AzmtStrLenW : 72 cycles

Title: Re: Unicode string length
Post by: DoomyD on November 06, 2008, 07:24:50 AM
Core 2 Duo 6300 1.86GHzmy other brother darryl my other brother darryl my other brother darryl
LENGTHOF : 40
SIZEOF : 80
StrLenW return value : 71
Lingo32W return value : 71
ucLen2 return value : 71
Edu32W return value : 71
Edu32W2 return value : 71
SebW return value : 71
SebW2 return value : 71
lstrlenW return value : 71
AzmtStrLenW return value : 71

StrLenW : 76 cycles
Lingo32W : 331 cycles
ucLen2 : 73 cycles
Edu32W : 421 cycles
Edu32W2 : 352 cycles
SebW : 94 cycles
SebW2 : 96 cycles
lstrlenW : 201 cycles
AzmtStrLenW : 73 cycles

Press any key to exit...
Title: Re: Unicode string length
Post by: jj2007 on November 06, 2008, 09:14:41 AM
Pentium 4 3.4 GHz

StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles
Title: Re: Unicode string length
Post by: DoomyD on November 06, 2008, 09:25:44 AM
Could someone time the following?
On my machine it yields 72 cycles
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
even
ucsLen   proc   buf
   mov   eax,[esp+4]
   .repeat
      REPEAT 3
      mov    ecx,[eax]
      add    eax,4
      test   ecx,00000FFFFh
      jz     _EVEN_
      test   ecx,0FFFF0000h
      jz     _ODD_
      ENDM
      mov    ecx,[eax]
      add    eax,4
      test   ecx,00000FFFFh
      jz     _EVEN_
      test   ecx,0FFFF0000h
   .until zero?
   _ODD_:
   add   eax,2
   _EVEN_:
   sub   eax,4
   sub   eax,[esp+4]
   shr   eax,1
   retn  4
ucsLen endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Unicode string length
Post by: jj2007 on November 06, 2008, 09:50:47 AM
Quote from: jj2007 on November 06, 2008, 09:14:41 AM
Pentium 4 3.4 GHz

StrLenW : 206 cycles
Lingo32W : 287 cycles
ucLen2 : 192 cycles
Edu32W : 182 cycles
Edu32W2 : 135 cycles
SebW : 227 cycles
SebW2 : 216 cycles
lstrlenW : 371 cycles
AzmtStrLenW : 140 cycles


StrLenW : 201 cycles
Lingo32W : 278 cycles
ucLen2 : 193 cycles
Edu32W : 170 cycles
Edu32W2 : 132 cycles
SebW : 229 cycles
SebW2 : 213 cycles
lstrlenW : 452 cycles
AzmtStrLenW : 151 cycles (unchanged code, most values are a bit slower)

... with a minor modification:

@@:
    mov edx, [eax]
    mov ecx, [eax+4]
  if 1 ; faster
    add eax, 8
    test dx, dx
  else ; slower
    test dx, dx
    lea eax, [eax+8]
  endif
    jz sub8

Title: Re: Unicode string length
Post by: jj2007 on November 06, 2008, 10:37:21 AM
Quote from: DoomyD on November 06, 2008, 09:25:44 AM
Could someone time the following?
On my machine it yields 72 cycles

Edu32W2:        135 cycles
ucsLen:         161 cycles
AzmtStrLenW:    141 cycles

Edu32W2:        135 cycles
ucsLen:         161 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        135 cycles
ucsLen:         162 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        136 cycles
ucsLen:         161 cycles
AzmtStrLenW:    142 cycles

Edu32W2:        134 cycles
ucsLen:         162 cycles
AzmtStrLenW:    140 cycles


New code attached, Edu32W2 is slightly modified.

[attachment deleted by admin]
Title: Re: Unicode string length
Post by: DoomyD on November 06, 2008, 01:51:46 PM
Iteresting results :P
I wonder what makes the difference
StrLenW : 76 cycles
ucLen2 : 73 cycles

Edu32W2: 331 cycles
ucsLen: 71 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 332 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 331 cycles
ucsLen: 72 cycles
AzmtStrLenW: 72 cycles

Edu32W2: 334 cycles
ucsLen: 72 cycles
AzmtStrLenW: 73 cycles

Edu32W2: 333 cycles
ucsLen: 71 cycles
AzmtStrLenW: 72 cycles
Title: Re: Unicode string length
Post by: sinsi on November 06, 2008, 02:03:23 PM

StrLenW : 75 cycles
ucLen2 : 72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    76 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

Edu32W2:        72 cycles
ucsLen:         71 cycles
AzmtStrLenW:    72 cycles

hmmm, what to choose, what to choose...
Title: Re: Unicode string length
Post by: jj2007 on November 06, 2008, 02:08:09 PM
Quote from: DoomyD on November 06, 2008, 01:51:46 PM
Iteresting results :P
I wonder what makes the difference

Let me guess: You run a Core Duo?
Title: Re: Unicode string length
Post by: DoomyD on November 06, 2008, 03:01:26 PM
Of course :U
Yet it's wierd, because shifts are relatively slow compared to a test operation.
Title: Re: Unicode string length
Post by: jdoe on November 07, 2008, 03:50:27 AM

Thanks guys.

I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.

Something that is likely to be faster on INTEL for a function like that is a "test ax, ax" followed by a "shr eax, 16" in a small unroll. I did it on my Intel at the office and it was real fast but I don't have the source anymore.

Anyway, I had the answer I wanted... AzmtStrLenW is a good compromise between AMD and INTEL but I think the time AMD was slightly superior is gone and my next computer will definitely have an Intel processor.

Thanks again.

Title: Re: Unicode string length
Post by: jj2007 on November 07, 2008, 09:45:42 PM
Quote from: jdoe on November 07, 2008, 03:50:27 AM
I realized after many test that AMD processors seems to like long unroll (in the case of AzmtStrLenW 16 caracters in each loop) with good performance, when INTEL perfom with small unroll when it's well written.

Differences between processors are surprisingly huge for this case. Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:
wsLen proc pStr
mov eax, [esp+4] ; mov edx, eax 3 cycles faster but trashes edx
sub eax, 4
.Repeat
add eax, 4
test dword ptr [eax], 00000FFFFh
je @F
test dword ptr [eax], 0FFFF0000h
.Until Zero?
add eax, 2
@@:
sub eax, [esp+4] ; sub eax, edx 3 cycles faster but...
shr eax, 1
retn 4
wsLen endp


Timings on a Core Duo Celeron M:
wsLen:          112 cycles (with 38 bytes)
AzmtStrLenW:    83 cycles (with 243 bytes)
Title: Re: Unicode string length
Post by: jdoe on November 08, 2008, 03:44:42 AM
Quote from: jj2007 on November 07, 2008, 09:45:42 PM
Your Azmt algo seems currently the fastest, although I would consider for my home brew lib this one - tiny, and uses only eax:

jj2007,

You are absolutely right that using only EAX would be a little bit faster but it is by design that in my library almost all my string functions returns the string length in EAX and the destination pointer in EDX. Also, I don't optimize for small code because I do not think it relevant anymore. It is not my goal anyway to do the fastest functions because my satisfaction comes from being faster than the Windows functions and doing something that equal or beat the MASM32 library when it comes to run on AMD or INTEL. Hutch writes pretty good functions and I can hardly be faster than him on Intel but on AMD, his functions are not a good compromise between the two CPU.

I will never be as the good as Lingo when it comes to optimizing functions because this guys can write amazing stuff in ASM (even though I think sometimes he tries to optimize when it is not necessary). I don't want to know all CPU's technology anyway and I won't read manuals like those of Agner Fog on optimizing ASM (about it I liked the comments of Bogdan on his forum saying that optimizing was about experiments and measurements... I fully agree and this is the way I work anyway).

You are right also when saying that the differences between processors are surprisingly huge. When I started playing with optimization, it cause me headaches trying to be the fastest on my computer and when I tried my functions on Intel I realized that my efforts were losing their meaning. Now I only try to do a good compromise between the two CPU without trying to be the fastest. If I didn't started to do it like that I would turns insane and will always lives in a kind of insatisfactions.

Title: Re: Unicode string length
Post by: jj2007 on November 08, 2008, 03:53:03 AM
Quote from: jdoe on November 08, 2008, 03:44:42 AM
it cause me headaches trying to be the fastest on my computer and when I tried my functions on Intel I realized that my efforts were losing their meaning. Now I only try to do a good compromise between the two CPU without trying to be the fastest.

Ok, let's wait then for the Super Dual Core - half Intel, half AMD, with automatic speed-optimising thread switching :green2