News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Unicode string length

Started by jdoe, April 10, 2006, 01:32:54 AM

Previous topic - Next topic

jdoe

Hi,

While doing ANSI to Unicode function conversion for fun, I have got a faster unicode string length function than ucLen from masm32 lib by applying the unroll advantage of szLen.



align 4
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword

   mov eax, p_lpszStr
   sub eax, 8

@@:
   add eax, 8
   cmp word ptr [eax], 0
   je Add0
   cmp word ptr [eax+2], 0
   je Add1
   cmp word ptr [eax+4], 0
   je Add2
   cmp word ptr [eax+6], 0
   jne @B

   sub eax, p_lpszStr
   shr eax, 1
   add eax, 3
   ret

Add2:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 2
   ret

Add1:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 1
   ret

Add0:
   sub eax, p_lpszStr
   shr eax, 1
   ret

StrLenW endp




hutch--

Compliments,

Looks good.  :U
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

asmfan

As for me i would store 0 in some register in bx for example.
Russia is a weird place

jdoe

Quote from: asmfan on April 10, 2006, 05:55:30 AM
As for me i would store 0 in some register in bx for example.

I did the test and there is no gain doing so.
I have used edx instead of ebx because it don't need to be preserved.   :wink


Tedd

Just to be really annoying, I'll point out that it will only work for a limited subset of unicode strings (those converted from ansi.) And also, only for little-endian unicode!
The null terminator is two zeroes for a reason :P
No snowflake in an avalanche feels responsible.

jdoe

Quote from: Tedd on April 10, 2006, 05:05:06 PM
The null terminator is two zeroes for a reason :P

That's why it is

cmp word ptr [eax], 0


Could you explain whats wrong because I don't get it.


MichaelW

jdoe,

Good work :U

As it is currently coded, when running on a P3 your procedure is sensitive to alignment. Using the test string "my other brother darryl", by varying the alignment ahead of the 'align 4' I can cause the cycles to vary from 56 to 77. For example, this will cause the procedure to run in 77 cycles:

align 16
nops 1

align 4
...


And this will cause it to run in 56 cycles:

align 16
nops 5

align 4
...


If I replace the 'align 4' with an 'align 16', varying the alignment ahead of the 'align 16' has only a small effect on the cycles, with the procedure running in 56 or 57 cycles for all of the nop counts that I tried.

For reference, the MASM32 ucLen procedure runs in 85 cycles.

In case you are not familiar with it, nops is a MASM32 macro.

eschew obfuscation

lingo

 :lol
OPTION PROLOGUE:NONE          ; turn it off
OPTION EPILOGUE:NONE
Lingo32W                proc  lpst:DWORD
                        mov   eax, [esp+4]
                        mov   edx, 80008000h
                        mov   ecx, [eax]
@@:
                        add   eax, 4
                        add   ecx, 0FEFFFF00h
                        test  edx, ecx
                        mov   ecx, [eax]
                        je    @b

                        test  word ptr [eax-4], 0FFFFh
                        je    C_minus4
                        test  word ptr [eax-4+2], 0FFFFh
                        je    C_minus2
@@:
                        and   ecx, 07F7F7F7FH
                        add   eax, 4
                        add   ecx, 0FEFFFF00h
                        test  ecx, edx
                        mov   ecx, [eax]
                        je    @b

                        test  word ptr [eax-4], 0FFFFh
                        je    C_minus4
test  word ptr [eax-4+2], 0FFFFh
                        jne   @b
C_minus2:
                        sub   eax, [esp+4]
                        add   eax, 2-4
                        shr   eax,1
                        ret   4
C_minus4:
                        sub   eax, [esp+4]
                        add   eax,0-4
                        shr   eax,1
                        ret   4    
Lingo32W                endp

OPTION PROLOGUE:PROLOGUEDEF   ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF

MichaelW

#8
Lingo,

I assume your code does substantially better on a P4 than it does on my P3 :eek

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 56 cycles
Lingo32W : 163 cycles
crt_wcslen : 84 cycles



[attachment deleted by admin]
eschew obfuscation

hutch--

Here are the timings on my Prescott PIV.

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 80 cycles
StrLenW : 86 cycles
Lingo32W : 121 cycles
crt_wcslen : 123 cycles
Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jdoe

ADM Athlon 1800+

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 38 cycles
Lingo32W : 78 cycles
crt_wcslen : 101 cycles
Press any key to exit...



@MichaelW

Alignment impact seems to be specific on the processor. On my ADM Athlon, adding "align 16" I notice only one clock cycle decrease (and 10ms less).

jdoe

#11
Speed seems to be better with align 8 before the loop.


align 16
;
; Return characters length of lpszStr excluding zero-terminated char
;
StrLenW proc public p_lpszStr:dword

   mov eax, p_lpszStr
   sub eax, 8

   align 8
@@:
   add eax, 8
   cmp word ptr [eax], 0
   je Add0
   cmp word ptr [eax+2], 0
   je Add1
   cmp word ptr [eax+4], 0
   je Add2
   cmp word ptr [eax+6], 0
   jne @B

   sub eax, p_lpszStr
   shr eax, 1
   add eax, 3
   ret

Add2:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 2
   ret

Add1:
   sub eax, p_lpszStr
   shr eax, 1
   add eax, 1
   ret

Add0:
   sub eax, p_lpszStr
   shr eax, 1
   ret

StrLenW endp





my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 88 cycles
StrLenW : 37 cycles
Lingo32W : 78 cycles
crt_wcslen : 97 cycles
Press any key to exit...



hutch--

Here is a test algo for unicode string length. I have utilised an idea of Lingo's to reduce the memory reads by half and using a ROL to read the other end of the register. On this PIV its timing faster than the version in the library by 343 MS to 390 MS. It works on the theory that processor is still a lot faster than memory.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

ucLen2 proc lpwstr:DWORD

    mov ecx, [esp+4]    ; lpwstr
    xor eax, eax
    sub ecx, 4

  lbl0:
    add ecx, 4
    mov eax, [ecx]
    cmp ax, 0
    je lbl1
    rol eax, 16
    cmp ax, 0
    jne lbl0

    sub ecx, [esp+4]    ; lpwstr
    mov eax, ecx
    shr eax, 1
    add eax, 1
    ret 4

  lbl1:
    sub ecx, [esp+4]    ; lpwstr
    mov eax, ecx
    shr eax, 1
    ret 4

ucLen2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

six_L

another test
Quotemy other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
ucLen2 return value : 23
StrLenW return value : 23
Lingo32W return value : 23
crt_wcslen return value : 23
ucLen : 71 cycles
ucLen2 : 51 cycles
StrLenW : 38 cycles
Lingo32W : 134 cycles
crt_wcslen : 64 cycles

Press enter to exit...
regards

MichaelW

I updated the original attachment to include all of the procedures so far. To help level the playing field I eliminated the stack frame from StrLenW, and for the P6 family of processors, placed an align 16 in front of all the procedures.
For my P3:

my other brother darryl
LENGTHOF : 24
SIZEOF : 48
ucLen return value : 23
StrLenW return value : 23
Lingo32W return value : 23
ucLen2 return value : 23
crt_wcslen return value : 23
ucLen : 85 cycles
StrLenW : 53 cycles
Lingo32W : 141 cycles
ucLen2 : 59 cycles
crt_wcslen : 84 cycles


And the timings only for my old K5:

ucLen : 87 cycles
StrLenW : 46 cycles
Lingo32W : 69 cycles
ucLen2 : 49 cycles
crt_wcslen : 78 cycles


jdoe,

AFAIK alignment has a greater effect on the P6 family of processors (PPro, P2, P3) than on the P1, PMMX, and P4. For a P3 I can improve on align 16 slightly by adding a sufficient number of nops after the align 16 to place the jump label at or close to a 16-byte boundary. To actually know which alignment is best for your processor I think you should try changing the alignment of the align 8 statement by putting varying numbers of nops in front of it. For a P3, at certain alignments an align 8 slows the procedure down substantially.


eschew obfuscation