SzCpy vs. lstrcpy

Jimg · May 11, 2005, 02:49:07 AM

Hutch-

Best I could get with your latest is 284. Other than the lingo and it's mods, the best I've found is 255 with-

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
	mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
	add eax,4
    mov cx, [ebx+eax]    
    test cl, cl
    jz move1			; done, go move zero byte
    mov [edx+eax], cx	; save both
    test ch,ch
    jz @f				; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret
	
move1:
	mov [ebx+eax],cl
	ret	
move2:
	mov [edx+eax+2],cl
	ret	
SzCpy12 endp

Mark_Larson · May 11, 2005, 04:16:16 AM

Still tweaking. Hutch's latest runs in 343 cycles. I made some changes to use REPEAT macros and use MOVZX. With the same number of unrolls and MOVZX it runs in 284 cycles.

Code Select


hutch's code above 343
4 unrolls movzx     284
8 unrolls movzx     284
16 unrolls movzx   283
32 unrolls movzx   282
64 unrolls movzx   271
128 unrolls movzx 262 - 128 byte string being tested.  So any further unrolls yields no speed up.

Code Select


align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCopyx2 proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    xor eax, eax
    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -UNROLL_AMTx

  align 4
  @@:
UNROLL_AMTx equ 4
    add ecx, UNROLL_AMTx
looper = 0
REPEAT UNROLL_AMTx
    movzx eax, byte ptr [esi+ecx+looper]
    mov [edx+ecx+looper], al
    test al, al
    jz @F
looper = looper + 1
endm

    jmp @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

lingo · May 11, 2005, 04:41:47 AM

Here is the fastest variant: :lol

Code Select


OPTION  PROLOGUE:NONE 
OPTION  EPILOGUE:NONE
align   16 
db 8Dh, 0A4h,24h, 0, 0, 0, 0, 8Dh, 0A4h, 24h, 0, 0, 0, 0, 90h 
SzCpy15 proc  SzDest:DWORD, SzSource:DWORD
        push  ebx
         xor  eax, eax
         mov  ebx, [esp+2*4+1*4]       ; ebx = source
         mov  edx, [esp+1*4+1*4]       ; edx= destination
        movq  MM0, [ebx]
        pxor  MM7, MM7
@@:
	movq  [edx+eax], MM0     
         add  eax, 8
     pcmpeqb  MM0, MM7
    packsswb  MM0, MM0
        movd  ecx, MM0
        movq  MM0, [ebx+eax]
       jecxz  @B
         pop  ebx
        emms                           ; femms for AMD
         ret  2*4
SzCpy15 endp
OPTION  PROLOGUE:PrologueDef 
OPTION  EPILOGUE:EpilogueDef

"Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose."

Regards,
Lingo

hutch-- · May 11, 2005, 05:16:52 AM

Lingo,

It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?

Mark,

The MOVZX mod looks like a good one, I will have a play with it a bit later.

LATER : Yes, this method is noticably faster, the test piece is now twice as fast as the version in the libray. What I will be interested to see is if this mod works on AMD hardware as well.

Petroizki · May 11, 2005, 05:45:10 AM

Quote from: Mark_Larson on May 10, 2005, 08:28:11 PM
Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
No they don't. They force it to be unaligned. They were put after an "align 16". The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary. On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.

Check it in debugger, the entry point is 3 bytes off, but the inner loop is aligned to 16 bytes.. :eek

hutch-- · May 11, 2005, 10:19:54 AM

This is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

szCopyx proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -4

  align 4
  @@:
    add ecx, 4
    movzx eax, BYTE PTR [esi+ecx]
    mov [edx+ecx], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+1]
    mov [edx+ecx+1], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+2]
    mov [edx+ecx+2], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+3]
    mov [edx+ecx+3], al
    test al, al
    jnz @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

The next one is for novelty value, its slow but it works. It requires a DWORD aligned target buffer.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

xorcopy proc src:DWORD,dst:DWORD

    push edi
    push ebx

    mov edx, src
    mov edi, dst
    mov ecx, -4
    xor ebx, ebx

  align 4
  @@:
    add ecx, 4
    mov DWORD PTR [edi+ecx], ebx

    xor al, al
    xor al, [edx+ecx]
    xor [edi+ecx], al
    jz @F
    xor al, al
    xor al, [edx+ecx+1]
    xor [edi+ecx+1], al
    jz @F
    xor al, al
    xor al, [edx+ecx+2]
    xor [edi+ecx+2], al
    jz @F
    xor al, al
    xor al, [edx+ecx+3]
    xor [edi+ecx+3], al
    jnz @B
  @@:

    pop ebx
    pop edi

    ret

xorcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

MichaelW · May 11, 2005, 10:27:35 AM

I didn't have very long to work on this, but it seems to do well even with a stack frame and with no specific alignment. I didn't have time to include the last 4 (or 5?) versions posted in my timing test.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)              
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:    
    mov   ebx,[eax+4]             ; get next 4 bytes
    add   eax,4
    lea   ecx,[ebx-01010101h]     ; subtract 1 from each byte
    add   edx,4
    test  ecx,80808080h
    jnz   @F                      ; jump if zero byte
    mov   [edx],ebx
    jmp   @B
  @@:
    mov   cl,[eax]
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                    ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

This was after I reversed the order of the arguments for szCopyx, szCopyP, and SzCopy_dw, and running on a P3:

Code Select


 lstrcpy                 : 671 clocks

 SzCpy1 (movsb/cmp)      : 653 clocks
 SzCpy2 (mov/cmp/inc)    : 406 clocks
 SzCpy3 (mov/cmp/add)    : 529 clocks
 SzCpy4 (mov/test/inc)   : 401 clocks
 SzCpy5 (mov/test/add)   : 401 clocks
 szCopyx (hutch unroll 4): 289 clocks
 SzCpy8 (lingo unroll 8) : 147 clocks
 szCopyP (Phoenix)       : 1261 clocks
 szCopy_dw               : 162 clocks

Mark_Larson · May 11, 2005, 11:32:09 AM

It's funny how things get revisited on this board. We've had lots of discussion about string length algorithms, including using MMX and SSE2. Check this thread. http://www.old.masmforum.com/viewtopic.php?t=2632. I even posted a version that used MMX and SSE2 in that thread ( along with some other people who did MMX). My account had gotten messed up, so all my posts will be from "hutch--", but the account type will say "guest". Whereas the real ones from hutch-- say "Site Admin" and have the big red crest with the gold lion.

Hutch, as far as alignment goes, I posted a trick to get around that, when we did either a string length or string copy routine. Basically you do a byte routine until the data is the appropriate alignment, and then you switch to the faster method.

Jimg · May 11, 2005, 02:15:15 PM

Hutch-

QuoteThis is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.

On my Athlon, runs at 284 with best alignment. Exactly the same as your previous routine, i.e. movzx doesn't seem to be any faster or slower on amd chips.

Jimg · May 11, 2005, 02:39:57 PM

Lingo-

Your last version is excellent. Runs in 112 cycles (with three extra bytes inserted after the align 16). Better than twice as fast as any other so far. Good Job :U

Now if I only understood what the heck it was doing :wink

Later.....

Ok, now I see what's going on. Too bad that it requires, in the worst case, at least 7 extra bytes in the destination string. :(

Mark_Larson · May 11, 2005, 03:38:27 PM

This is the SSE2 version that I posted on that other thread. The test string length was 228 bytes instead of the 128 bytes we are using here. So keep that in mind when it gives timings. It also doesn't require that you pad the end with 8 or 16 zeros. But it does requires the strings to have 16 byte alignment for the SSE2 version. I'll see if I can dig up the MMX version I wrote later. Because I didn't post it to the forums. I checked. But for the MMX version, we can do a quick conversion and see how many cycles a byte we are talking about, because I give timing info for both the MMX and SSE2 versions.

168 cycles / 228 bytes = 0.73684210526315789473684210526316 cycles per byte for MMX version I need to dig up off my home computer ( I didn't post it to the thread.)
68 cycles / 228 bytes = 0.29824561403508771929824561403509 cycles per byte for SSE2 version I posted below.

So theoretically the MMX version should do 128 bytes in
168*128 = 228*new time in cycles

168*128
------------ = new time in cycles for MMX version for 128 bytes = 94.315789473684210526315789473684 cycles
228

and for the SSE2 version.
68*128 = 228*new time in cycles

68*128
------------ = new time in cycles for SSE2 version for 128 bytes = 38.175438596491228070175438596491 cycles
228

Obviously putting an SSE2 version of szCopy is out of the question. Not as many people could use it. But I think doing an MMX version won't be a problem. I don't know anyone that has a CPU that doesn't support MMX.

Quote
posted some strlen code under the GoAsm forum. I shaved 48 cycles off his code by doing something similar to what I did with your stuff Michael. I just now converted his MMX code to SSE2. Won't work on a P3 or below. The MMX version of Donkey's code that was changed to run faster on the P4 by me, finds the strlen of a 228-byte string in 168 cycles. The below version using SSE2 finds it in 68 cycles!!! So BIG improvement.

Code:

szLenSSE2 proc near pString:ptr byte

mov eax,[pString] ; mov the offset of the string into eax

pxor xmm0,xmm0 ;remove dependencies
pxor xmm2,xmm2 ;remove dependencies
pxor xmm1,xmm1 ;fill XMM reg 1 with NULLs
pxor xmm3,xmm3 ;FILL XMM reg 3 with NULLs

;align 16 ; align for code isn't needed on the P4
; The following loop starts at exactly 16 bytes into the proc
; it is also exactly 16 bytes long, this makes for optimal execution
loop:
movdqa xmm0,[eax] ; mov 16 bytes of the string into MMX reg 0
pshufd xmm2,[eax+16],0e4h ; mov 16 bytes of the string into MMX reg 0
pcmpeqb xmm0,xmm1 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)
pcmpeqb xmm2,xmm3 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)

pmovmskb ecx,xmm0 ; This moves the most signf. bit of each byte to ecx
add eax,32 ; increment the string address
pmovmskb edx,xmm2 ; This moves the most signf. bit of each byte to ecx
or ecx,ecx ; Was any bit set ?
jnz end_loop
or edx,edx
jz loop ; no then continue the loop, no zero was found
mov ecx,edx ; Move EDX into ECX for the BSF later

end_loop:

sub eax,[pString] ; subtract the original address

bsf ecx,ecx ; find out which bit is set (where the NULL is)
sub eax,32 ; subtract the last increment from eax
add eax,ecx ; add the number of bytes to the NULL

RET
lszLenSSE2 endp

MichaelW · May 11, 2005, 06:16:46 PM

I got my version down to 127 cycles on a P3, using 486-compatible code. I have no way to test on a P4.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)              
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10,13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:
    REPEAT 8                    ; unroll x 8
      mov   ebx,[eax+4]         ; get next 4 bytes
      add   eax,4
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
      add   edx,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],ebx           ; otherwise copy 4 bytes
    ENDM
    jmp   @B
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

[attachment deleted by admin]

Jimg · May 11, 2005, 07:39:44 PM

MichaelW

That's a real nice routine. Unfortunately, it get much slower if there is a high ascii character near the beginning of the string :( Maybe if you replace the last part with one of the other routines when high ascii is found?

Mark_Larson · May 11, 2005, 07:47:53 PM

Michael, here are the timings on my work P4. I believe it's a 1.5

Code Select


127
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <

148 clocks

Press enter to exit...

I'll take a stab at speeding it up some to. Yours is real generic like Hutch's and faster.

EDIT: I haven't tried this since I have a P4. but Michael have you tried adding an Align for the inner loop and for the procedure?

Mark_Larson · May 11, 2005, 09:42:36 PM

I tried a number of things. And a few things gave me some speed ups.

First. The last part re-reads the last dword a byte at a time when it is already in a register.

Code Select


hi_mom:
	shr ebx,8
  @@:
;the dword is already IN a register, so why re-read it from memory?
;    mov   cl,[eax]              ; finish up moving bytes
;    add   eax,1
    mov   [edx],bl
    add   edx,1
    test  bl,bl
    jnz   hi_mom
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

Second, I tried unrolling the loop 128 times with the original algorithm ( not including any other changes), and I got it from 148 cycles to 114 cycles. So if we pick some middle ground, we can probably get a decent speed up from this.

Third, we can try removing the Epilogue and Prologue. That is usually good for a few cycles.

Fourth I re-wrote the inner loop to work differently. It now handles two bytes in semi-parallel. It runs about 8 cycles faster than the original code. But with all the other stuff I posted above, it adds up.

Code Select


SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest

  @@:

    REPEAT 128                   ; unroll x 8
      mov   ebx,[eax]         ; get next 4 bytes
      mov   esi,[eax+4]
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
	add   eax,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      lea   edi,[esi-01010101h] ; subtract 1 from each byte
	add	eax,4
      mov   [edx],ebx           ; otherwise copy 4 bytes
	add   edx,4
      test  edi,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],esi           ; otherwise copy 4 bytes
	add   edx,4
    ENDM
    jmp   @B
  @@:
	sub eax,4
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

Fifth, for non-P4 processors aligning the inner loop should hopefully give a speed up.

News:

SzCpy vs. lstrcpy