News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SzCpy vs. lstrcpy

Started by Mark Jones, May 09, 2005, 01:23:45 AM

Previous topic - Next topic

Jimg

Hutch-

Best I could get with your latest is 284.  Other than the lingo and it's mods, the best I've found is 255 with-

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
add eax,4
    mov cx, [ebx+eax]   
    test cl, cl
    jz move1 ; done, go move zero byte
    mov [edx+eax], cx ; save both
    test ch,ch
    jz @f ; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret

move1:
mov [ebx+eax],cl
ret
move2:
mov [edx+eax+2],cl
ret
SzCpy12 endp


Mark_Larson

  Still tweaking.  Hutch's latest runs in 343 cycles.  I made some changes to use REPEAT macros and use MOVZX.  With the same number of unrolls and MOVZX it runs in 284 cycles.


hutch's code above 343
4 unrolls movzx     284
8 unrolls movzx     284
16 unrolls movzx   283
32 unrolls movzx   282
64 unrolls movzx   271
128 unrolls movzx 262 - 128 byte string being tested.  So any further unrolls yields no speed up.




align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCopyx2 proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    xor eax, eax
    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -UNROLL_AMTx

  align 4
  @@:
UNROLL_AMTx equ 4
    add ecx, UNROLL_AMTx
looper = 0
REPEAT UNROLL_AMTx
    movzx eax, byte ptr [esi+ecx+looper]
    mov [edx+ecx+looper], al
    test al, al
    jz @F
looper = looper + 1
endm

    jmp @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

lingo

Here is the fastest variant:  :lol
OPTION  PROLOGUE:NONE
OPTION  EPILOGUE:NONE
align   16
db 8Dh, 0A4h,24h, 0, 0, 0, 0, 8Dh, 0A4h, 24h, 0, 0, 0, 0, 90h
SzCpy15 proc  SzDest:DWORD, SzSource:DWORD
        push  ebx
         xor  eax, eax
         mov  ebx, [esp+2*4+1*4]       ; ebx = source
         mov  edx, [esp+1*4+1*4]       ; edx= destination
        movq  MM0, [ebx]
        pxor  MM7, MM7
@@:
movq  [edx+eax], MM0     
         add  eax, 8
     pcmpeqb  MM0, MM7
    packsswb  MM0, MM0
        movd  ecx, MM0
        movq  MM0, [ebx+eax]
       jecxz  @B
         pop  ebx
        emms                           ; femms for AMD
         ret  2*4
SzCpy15 endp
OPTION  PROLOGUE:PrologueDef
OPTION  EPILOGUE:EpilogueDef


"Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose."

Regards,
Lingo

hutch--

Lingo,

It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?

Mark,

The MOVZX mod looks like a good one, I will have a play with it a bit later.

LATER : Yes, this method is noticably faster, the test piece is now twice as fast as the version in the libray. What I will be interested to see is if this mod works on AMD hardware as well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Petroizki

Quote from: Mark_Larson on May 10, 2005, 08:28:11 PM
Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
No they don't.  They force it to be unaligned.  They were put after an "align 16".  The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary.  On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.
Check it in debugger, the entry point is 3 bytes off, but the inner loop is aligned to 16 bytes..  :eek

hutch--

This is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCopyx proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -4

  align 4
  @@:
    add ecx, 4
    movzx eax, BYTE PTR [esi+ecx]
    mov [edx+ecx], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+1]
    mov [edx+ecx+1], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+2]
    mov [edx+ecx+2], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+3]
    mov [edx+ecx+3], al
    test al, al
    jnz @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««


The next one is for novelty value, its slow but it works. It requires a DWORD aligned target buffer.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

xorcopy proc src:DWORD,dst:DWORD

    push edi
    push ebx

    mov edx, src
    mov edi, dst
    mov ecx, -4
    xor ebx, ebx

  align 4
  @@:
    add ecx, 4
    mov DWORD PTR [edi+ecx], ebx

    xor al, al
    xor al, [edx+ecx]
    xor [edi+ecx], al
    jz @F
    xor al, al
    xor al, [edx+ecx+1]
    xor [edi+ecx+1], al
    jz @F
    xor al, al
    xor al, [edx+ecx+2]
    xor [edi+ecx+2], al
    jz @F
    xor al, al
    xor al, [edx+ecx+3]
    xor [edi+ecx+3], al
    jnz @B
  @@:

    pop ebx
    pop edi

    ret

xorcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

I didn't have very long to work on this, but it seems to do well even with a stack frame and with no specific alignment. I didn't have time to include the last 4 (or 5?) versions posted in my timing test.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data     
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)             
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10)
   
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:   
    mov   ebx,[eax+4]             ; get next 4 bytes
    add   eax,4
    lea   ecx,[ebx-01010101h]     ; subtract 1 from each byte
    add   edx,4
    test  ecx,80808080h
    jnz   @F                      ; jump if zero byte
    mov   [edx],ebx
    jmp   @B
  @@:
    mov   cl,[eax]
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                    ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


This was after I reversed the order of the arguments for szCopyx, szCopyP, and SzCopy_dw, and running on a P3:

lstrcpy                 : 671 clocks

SzCpy1 (movsb/cmp)      : 653 clocks
SzCpy2 (mov/cmp/inc)    : 406 clocks
SzCpy3 (mov/cmp/add)    : 529 clocks
SzCpy4 (mov/test/inc)   : 401 clocks
SzCpy5 (mov/test/add)   : 401 clocks
szCopyx (hutch unroll 4): 289 clocks
SzCpy8 (lingo unroll 8) : 147 clocks
szCopyP (Phoenix)       : 1261 clocks
szCopy_dw               : 162 clocks

eschew obfuscation

Mark_Larson

   It's funny how things get revisited on this board.  We've had lots of discussion about string length algorithms, including using MMX and SSE2.  Check this thread.  http://www.old.masmforum.com/viewtopic.php?t=2632.  I even posted a version that used MMX and SSE2 in that thread ( along with some other people who did MMX).  My account had gotten messed up, so all my posts will be from "hutch--", but the account type will say "guest".  Whereas the real ones from hutch-- say "Site Admin" and have the big red crest with the gold lion.

  Hutch, as far as alignment goes, I posted a trick to get around that, when we did either a string length or string copy routine.  Basically you do a byte routine until the data is the appropriate alignment, and then you switch to the faster method.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

Hutch-
QuoteThis is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.

On my Athlon, runs at 284 with best alignment.  Exactly the same as your previous routine, i.e. movzx doesn't seem to be any faster or slower on amd chips.

Jimg

Lingo-

Your last version is excellent.  Runs in 112 cycles (with three extra bytes inserted after the align 16).  Better than twice as fast as any other so far.  Good Job :U

Now if I only understood what the heck it was doing :wink

Later.....

Ok, now I see what's going on.  Too bad that it requires, in the worst case, at least 7 extra bytes in the destination string. :(

Mark_Larson

  This is the SSE2 version that I posted on that other thread.  The test string length was 228 bytes instead of the 128 bytes we are using here.  So keep that in mind when it gives timings.  It also doesn't require that you pad the end with 8 or 16 zeros.  But it does requires the strings to have 16 byte alignment for the SSE2 version.  I'll see if I can dig up the MMX version I wrote later.  Because I didn't post it to the forums.  I checked.  But for the MMX version, we can do a quick conversion and see how many cycles a byte we are talking about, because I give timing info for both the MMX and SSE2 versions.

168 cycles / 228 bytes = 0.73684210526315789473684210526316 cycles per  byte for MMX version I need to dig up off my home computer ( I didn't post it to the thread.)
68 cycles / 228 bytes = 0.29824561403508771929824561403509 cycles per byte for SSE2 version I posted below.

So theoretically the MMX version should do 128 bytes in
168*128 = 228*new time in cycles

168*128
------------       = new time in cycles for MMX version for 128 bytes = 94.315789473684210526315789473684 cycles
    228


and for the SSE2 version.
68*128 = 228*new time in cycles

68*128
------------       = new time in cycles for SSE2 version for 128 bytes = 38.175438596491228070175438596491 cycles
    228


  Obviously putting an SSE2 version of szCopy is out of the question.  Not as many people could use it.  But I think doing an MMX version won't be a problem.  I don't know anyone that has a CPU that doesn't support MMX.

Quote
posted some strlen code under the GoAsm forum. I shaved 48 cycles off his code by doing something similar to what I did with your stuff Michael. I just now converted his MMX code to SSE2. Won't work on a P3 or below. The MMX version of Donkey's code that was changed to run faster on the P4 by me, finds the strlen of a 228-byte string in 168 cycles. The below version using SSE2 finds it in 68 cycles!!! So BIG improvement.

Code:

szLenSSE2 proc near pString:ptr byte

   mov eax,[pString] ; mov the offset of the string into eax

   pxor xmm0,xmm0 ;remove dependencies
   pxor   xmm2,xmm2 ;remove dependencies
   pxor xmm1,xmm1 ;fill XMM reg 1 with NULLs
   pxor   xmm3,xmm3 ;FILL XMM reg 3 with NULLs

;align 16 ; align for code isn't needed on the P4
   ; The following loop starts at exactly 16 bytes into the proc
   ; it is also exactly 16 bytes long, this makes for optimal execution
loop:
   movdqa xmm0,[eax] ; mov 16 bytes of the string into MMX reg 0
   pshufd xmm2,[eax+16],0e4h ; mov 16 bytes of the string into MMX reg 0
   pcmpeqb xmm0,xmm1 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)
   pcmpeqb xmm2,xmm3 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)

   pmovmskb ecx,xmm0 ; This moves the most signf. bit of each byte to ecx
   add eax,32 ; increment the string address
   pmovmskb edx,xmm2 ; This moves the most signf. bit of each byte to ecx
   or ecx,ecx ; Was any bit set ?
   jnz end_loop
   or edx,edx
   jz loop ; no then continue the loop, no zero was found
   mov   ecx,edx     ; Move EDX into ECX for the BSF later

end_loop:

   sub eax,[pString] ; subtract the original address

   bsf ecx,ecx ; find out which bit is set (where the NULL is)
   sub eax,32 ; subtract the last increment from eax
   add eax,ecx ; add the number of bytes to the NULL

   RET
lszLenSSE2 endp
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

I got my version down to 127 cycles on a P3, using 486-compatible code. I have no way to test on a P4.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data     
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)             
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10,13,10)
   
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:
    REPEAT 8                    ; unroll x 8
      mov   ebx,[eax+4]         ; get next 4 bytes
      add   eax,4
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
      add   edx,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],ebx           ; otherwise copy 4 bytes
    ENDM
    jmp   @B
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start




[attachment deleted by admin]
eschew obfuscation

Jimg

MichaelW

That's a real nice routine.  Unfortunately, it get much slower if there is a high ascii character near the beginning of the string :(  Maybe if you replace the last part with one of the other routines when high ascii is found?

Mark_Larson

  Michael, here are the timings on my work P4.  I believe it's a 1.5


127
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <

148 clocks

Press enter to exit...




  I'll take a stab at speeding it up some to.  Yours is real generic like Hutch's and faster.

EDIT:  I haven't tried this since I have a P4.  but Michael have you tried adding an Align for the inner loop and for the procedure?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson


  I tried a number of things.  And a few things gave me some speed ups.

  First.  The last part re-reads the last dword a byte at a time when it is already in a register.


hi_mom:
shr ebx,8
  @@:
;the dword is already IN a register, so why re-read it from memory?
;    mov   cl,[eax]              ; finish up moving bytes
;    add   eax,1
    mov   [edx],bl
    add   edx,1
    test  bl,bl
    jnz   hi_mom
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp


  Second, I tried unrolling the loop 128 times with the original algorithm ( not including any other changes), and I got it from 148 cycles to 114 cycles.  So if we pick some middle ground, we can probably get a decent speed up from this.

  Third, we can try removing the Epilogue and Prologue.  That is usually good for a few cycles.

  Fourth I re-wrote the inner loop to work differently.  It now handles two bytes in semi-parallel.  It runs about 8 cycles faster than the original code.  But with all the other stuff I posted above, it adds up.


SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest

  @@:

    REPEAT 128                   ; unroll x 8
      mov   ebx,[eax]         ; get next 4 bytes
      mov   esi,[eax+4]
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
add   eax,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      lea   edi,[esi-01010101h] ; subtract 1 from each byte
add eax,4
      mov   [edx],ebx           ; otherwise copy 4 bytes
add   edx,4
      test  edi,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],esi           ; otherwise copy 4 bytes
add   edx,4
    ENDM
    jmp   @B
  @@:
sub eax,4
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp


  Fifth, for non-P4 processors aligning the inner loop should hopefully give a speed up.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm