News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SzCpy vs. lstrcpy

Started by Mark Jones, May 09, 2005, 01:23:45 AM

Previous topic - Next topic

Phoenix

#15
Results for W2K SP4 / Athlon64 3000+....

128-byte string copy timing results:

lstrcpy                       : 355 clocks

SzCpy1 (movsb/cmp)    MJ      : 663 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 534 clocks
SzCpy3 (mov/cmp/add)  MJ     : 661 clocks
SzCpy4 (mov/test/inc) MJ       : 281 clocks
SzCpy5 (mov/test/add) MJ      : 402 clocks
SzCopy (mov/test/add) Hutch  : 282 clocks
SzCpy7 (mov/add/lea)  Lingo   : 161 clocks
szCopyx (4x unroll)   Hutch     : 230 clocks

Edit:

Same box, but WindowsXP 64bit beta:

lstrcpy                       : 311 clocks

SzCpy1 (movsb/cmp)    MJ      : 661 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 532 clocks
SzCpy3 (mov/cmp/add)  MJ     : 659 clocks
SzCpy4 (mov/test/inc) MJ       : 279 clocks
SzCpy5 (mov/test/add) MJ      : 401 clocks
SzCopy (mov/test/add) Hutch  : 281 clocks
SzCpy7 (mov/add/lea)  Lingo   : 161 clocks
szCopyx (4x unroll)   Hutch     : 230 clocks

I wonder why lstrcpy gets faster? I expected it to be slower because of WOW...

Jimg

QuoteMark, you missed the trick that Lingo was trying to do.  If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results.  He adds the destination and source together in place of a move, and that only works if the destination is all 0's.
Duh...  I get it now, you can't test it multiple times so the normal test a million times loops won't work.  No way to test effectively.

Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles).  So timing-wise if you change your add below to a MOV it should theoretically perform the same.  Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed.  If it doesn't, that would be really surprising.  That will also get rid of your dependency on having an all 0 destination register.
Then the jnz  @B won't have a condition to test (from the add).

Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.

lingo

Mark Jones,  :lol

Here is faster (unrolled by 8) variant:
OPTION  PROLOGUE:NONE
OPTION  EPILOGUE:NONE
align   16
db 90h,90h,90h
SzCpy8  proc  SzDest:DWORD, SzSource:DWORD
        mov   ecx, [esp+8]     ; ecx = source
        mov   edx, [esp+4]     ;  edx= destination
        push  ebx
        xor   eax, eax
        mov   bl, [ecx]        ; read byte
@@:     
        add   [edx+eax], bl    ; write byte
        mov   bl, [ecx+eax+1]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+1], bl  ; write byte
        mov   bl, [ecx+eax+2]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+2], bl  ; write byte
        mov   bl, [ecx+eax+3]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+3], bl  ; write byte
        mov   bl, [ecx+eax+4]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+4], bl  ; write byte
        mov   bl, [ecx+eax+5]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+5], bl  ; write byte
        mov   bl, [ecx+eax+6]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+6], bl  ; write byte
        mov   bl, [ecx+eax+7]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+7], bl  ; write byte
        lea   eax, [eax+8]
        mov   bl, [ecx+eax]    ; read byte
        jnz   @B               ; keep going
@@:     
        pop   ebx
        ret   2*4
SzCpy8  endp
OPTION  PROLOGUE:PrologueDef
OPTION  EPILOGUE:EpilogueDef


Regards,
Lingo

lingo

Jimg,  :lol

Quote
Duh...  I get it now, you can't test it multiple times so the normal test a million times loops won't work.  No way to test effectively.


; -----------------------------------------------   
    print  addr SzSzCpy8
      xor  ecx, ecx
      mov  edx, 128
      mov  eax, offset String2
@@:
      mov  [eax+edx-4], ecx
      add  edx, -4
      jnz  @@B
   
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    invoke SzCpy8, addr String2, addr String1
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250
; -----------------------------------------------

If you add zero or other value to the register
the time is the same and the test is correct.

Regards,
Lingo



Mark_Larson

Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.

  No they don't.  They force it to be unaligned.  They were put after an "align 16".  The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary.  On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

QuoteIf you add zero or other value to the register
the time is the same and the test is correct.
Logically, you are correct.  Realistically, since everything seems to have some strange effect on timing, I want to see the result correct after the million times loop!

After more testing, I am convinced that the repeated overflow condition is affecting the test somehow, but it's just too many step to trace by hand.  Changing the add to a mov, followed by a testbl,bl shouldn't double the execution time.  What's going on here?

Phoenix

What do you think about this one?

szCopyP proc src:DWORD,dst:DWORD

    mov ecx,src      ; ecx = source
    mov edx,dst      ; edx = destination

    xor eax,eax
@@:
    add al,[ecx]     ; read byte
    jz isByte        ; Result = 0?
    shl eax,8        ; shift al > ah
    add al,[ecx+1]   ; read byte   
    jz isWord
    shl eax,8
    add al,[ecx+2]
    jz isByteWord
    shl eax,8
    add al,[ecx+3]
    jz isDWord
    mov [edx],eax
    add ecx,4        ; inc pointers
    add edx,4
    jmp @B

isByte:
    mov [edx],al
    jmp done
isWord:
    mov [edx],ax
    jmp done
isByteWord:
    mov [edx+1],ax
    shr eax,16
    mov [edx],al
    jmp done
isDWord:
    mov [edx],eax

done:
   Ret
szCopyP EndP


Results (W2K SP4 / Athlon64 3000+):

128-byte string copy timing results:

lstrcpy                       : 354 clocks

SzCpy1 (movsb/cmp)    MJ      : 663 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 534 clocks
SzCpy3 (mov/cmp/add)  MJ      : 661 clocks
SzCpy4 (mov/test/inc) MJ      : 281 clocks
SzCpy5 (mov/test/add) MJ      : 403 clocks
SzCopy (mov/test/add) Hutch   : 282 clocks
SzCpy7 (mov/add/lea)  Lingo   : 161 clocks
szCopyx (4x unroll)   Hutch   : 230 clocks
szCopyP (DWORD mov)   Phoenix : 250 clocks


Not too bad, i think?

lingo

Phoenix,  :lol
QuoteNot too bad, i think?

Not too bad but you must move
@@ before xor eax, eax

and pls include my last procedure in your test...

Regards,
Lingo

hutch--

Lingo's technique is an interesting one and while it is not properly general purpose, it will be very useful where you are preallocating a zero filled buffer then repeatedly appending byte data to it at the end of the last write. To benchmark it you would have to allocate a large enough zero filled buffer then progressively append string data to the end of it for a duration of about a half a second and you should get a reasonable idea of how fast it is in that context against the others.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Hutch, :lol

Quote
"..for a duration of about a half a second and you should get a reasonable idea of how fast it is.."

If we use
invoke HeapAlloc, hHeap, dwFlags, dwBytes

it is very interesting how you measured  "a half a second" difference
between the usage of  the two parameters:

dwFlags = 0
and
dwFlags = HEAP_ZERO_MEMORY

Regards,
Lingo

hutch--

Lingo,

The comment was to make the test piece large enough so the test would run for about half a second so that the test results were more reliable. They don't improve in accuracy much over that.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Phoenix

Lingo,

QuoteNot too bad but you must move
@@ before xor eax, eax

:red of course, you are right.... but with the fix, the result is the same: 250 clocks (?)
An align 16 returns the result: 216 clocks...

The result for SzCpy8 is 161 clocks on my box, but...

did you check the resulting strings from SzCopy7 and SzCpy8? I added print addr String2 after your procs and the result seems to be not the expected one.

Edit: The result string is      Ë♂V­ýºÓ▬©­↑¶‗áÓz<∟♦∟♦

Regards, Phoenix


Mark_Larson

Quote from: Jimg on May 10, 2005, 06:39:12 PM
Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles).  So timing-wise if you change your add below to a MOV it should theoretically perform the same.  Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed.  If it doesn't, that would be really surprising.  That will also get rid of your dependency on having an all 0 destination register.
Then the jnz  @B won't have a condition to test (from the add).

Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.

I guess I wasn't clear enough.  I wanted to change it to use ECX, and MOVZX, and then do a JECXZ.  Not sure the speed difference.

This uses a "MOV" instead of an "ADD", a "MOVZX" instead of an 8 byte move to get rid of partial register stalls.  And I got rid of the LEA, since I don't need to preserve the flags and LEA is slow.  That's 3 of my points above.


;esi points to the source
         xor  eax, eax
         movzx  ecx,byte ptr [esi]                    ; read byte
@@:     
         mov  [edx+eax], cl                ; I use the fact ->the destination
;         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
        add eax,8                            ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
       jecxz @F
         movzx  ecx, byte ptr [esi+eax]                ; read byte
         jmp  @B                           ; keep going




   Lingo, you can use a REPEAT macro to make the code smaller.  I didn't unroll it the full 8 times using REPEAT due to the fact you have the LEA in the middle of the sequence.  You could technically swap LEA and the line after 8, and do the full 8 lines of unrolling.  This also lets you easily play with different amounts of unrolling without having to do cut and pasting, and modifying offsets ( ewww).


@@:     
UNROLL_AMT  equ  7
looper = 0
REPEAT UNROLL_AMT
        add   [edx+eax+looper], bl    ; write byte
        mov   bl, [ecx+eax+looper+1]  ; read byte
        jz    @F               ; if zero exit
looper = looper + 1
endm

        add   [edx+eax+UNROLL_AMT], bl  ; write byte
        lea   eax, [eax+UNROLL_AMT+1]
        mov   bl, [ecx+eax]    ; read byte
        jnz   @B               ; keep going

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Jimg

#28
Mark_Larson-

Quote;esi points to the source
         xor  eax, eax
         movzx  ecx,byte ptr [esi]                    ; read byte
@@:     
         mov  [edx+eax], cl                ; I use the fact ->the destination
;         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
        add eax,8                            ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
       jecxz @F
         movzx  ecx, byte ptr [esi+eax]                ; read byte
         jmp  @B                           ; keep going

Looks ok except add eax,8 needs to be add eax,1    doesn't it?

I used this-
align 16
db 8 dup (0)   ; 8=411
SzCpy16 proc SzDest:DWORD, SzSource:DWORD
    mov    edx, SzDest
    mov    esi, SzSource
    xor    eax, eax
    movzx  ecx,byte ptr [esi]   ; read byte
@@:     
    mov    [edx+eax], cl
    add    eax,1
    jecxz  @F
    movzx  ecx, byte ptr [esi+eax]  ; read byte
    jmp    @B                       ; keep going
@@:
    ret
SzCpy16 endp


best time was 411 cycles.


You are right about the lea though.  This mod of lingo runs in 130 cycles

align 16
db 3 dup (0)
SzCpy17   proc uses ebx SzDest:DWORD, SzSource:DWORD
    mov ecx, SzSource       ; ecx = source
    mov edx, SzDest     ;  edx= destination
    xor eax, eax
    mov bl, [ecx]  ; read byte
    sub eax,1
@@:   
    add eax, 1 ;[eax+1]
    add [edx+eax], bl  ; write byte
    mov bl, [ecx+eax]  ; read byte
    jnz  @B               ; keep going
    ret
SzCpy17 endp


but I still can't figure out why the add should be three times faster than a mov followed by test bl,bl

hutch--

I played with the unrolled version a little more, the register alternation was not making it any faster so I reduced the entry and exit overhead and removed the stack frame and it showing a little faster on my PIV. Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCopyx proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    xor eax, eax
    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -4

  align 4
  @@:
    add ecx, 4
    mov al, [esi+ecx]
    mov [edx+ecx], al
    test al, al
    jz @F

    mov al, [esi+ecx+1]
    mov [edx+ecx+1], al
    test al, al
    jz @F

    mov al, [esi+ecx+2]
    mov [edx+ecx+2], al
    test al, al
    jz @F

    mov al, [esi+ecx+3]
    mov [edx+ecx+3], al
    test al, al
    jnz @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php