Print Page - SzCpy vs. lstrcpy

Title: SzCpy vs. lstrcpy
Post by: Mark Jones on May 09, 2005, 01:23:45 AM

Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+

Code Select


128-byte string copy timing results:

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 540 clocks
SzCpy3 (mov/cmp/add) : 666 clocks
SzCpy4 (mov/test/inc): 344 clocks
SzCpy5 (mov/test/add): 479 clocks

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 09, 2005, 02:22:51 AM

Thes are the results I got Mark.

128-byte string copy timing results:

lstrcpy : 755 clocks

SzCpy1 (movsb/cmp) : 1185 clocks
SzCpy2 (mov/cmp/inc) : 582 clocks
SzCpy3 (mov/cmp/add) : 454 clocks
SzCpy4 (mov/test/inc): 593 clocks
SzCpy5 (mov/test/add): 445 clocks

I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count.

Code Select


align 4

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

comment * -----------------------------------------------
        copied length minus terminator is returned in EAX
        ----------------------------------------------- *
align 16

szCopy proc src:DWORD,dst:DWORD

    push ebp

    mov edx, [esp+8]
    mov ebp, [esp+12]
    xor ecx, ecx        ; ensure no previous content in ECX
    mov eax, -1

  @@:
    add eax, 1
    mov cl, [edx+eax]
    mov [ebp+eax], cl
    test cl, cl
    jnz @B

    pop ebp

    ret 8

szCopy endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Box is a 2.8 gig Prescott PIV.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 09, 2005, 03:24:39 AM

Hi Mark-

The Athlon is very sensitive to alignment. I tried some alignment tests. Before the changes, my times were virtually identical to yours. Here are my results with alignment changes.

(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).

lstrcpy : 432 clocks

SzCpy1 (movsb/cmp) : 540 clocks
SzCpy2 (mov/cmp/inc) : 526 clocks
SzCpy3 (mov/cmp/add) : 408 clocks
SzCpy4 (mov/test/inc): 325 clocks
SzCpy5 (mov/test/add): 342 clocks

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: mnemonic on May 09, 2005, 03:40:59 AM

Hi Mark,

here the results from my machine (AMD Athlon 500Mhz):

Code Select

128-byte string copy timing results:

lstrcpy              : 378 clocks

SzCpy1 (movsb/cmp)   : 536 clocks
SzCpy2 (mov/cmp/inc) : 535 clocks
SzCpy3 (mov/cmp/add) : 660 clocks
SzCpy4 (mov/test/inc): 341 clocks
SzCpy5 (mov/test/add): 421 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 09, 2005, 10:34:45 AM

Here are the times on my Sempron 2.4

128-byte string copy timing results:

lstrcpy : 398 clocks

SzCpy1 (movsb/cmp) : 532 clocks
SzCpy2 (mov/cmp/inc) : 534 clocks
SzCpy3 (mov/cmp/add) : 655 clocks
SzCpy4 (mov/test/inc): 339 clocks
SzCpy5 (mov/test/add): 424 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: roticv on May 09, 2005, 12:37:54 PM

My celeron,

128-byte string copy timing results:

lstrcpy : 608 clocks

SzCpy1 (movsb/cmp) : 1366 clocks
SzCpy2 (mov/cmp/inc) : 644 clocks
SzCpy3 (mov/cmp/add) : 511 clocks
SzCpy4 (mov/test/inc): 575 clocks
SzCpy5 (mov/test/add): 456 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 09, 2005, 01:45:11 PM

P3-500, including the procedure that Hutch posted:

lstrcpy : 672 clocks

SzCpy1 (movsb/cmp) : 654 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc): 402 clocks
SzCpy5 (mov/test/add): 402 clocks
szCopy : 289 clocks

And some strange and not very useful timings for an AMD K5:

lstrcpy : 531 clocks

SzCpy1 (movsb/cmp) : 984 clocks
SzCpy2 (mov/cmp/inc) : 984 clocks
SzCpy3 (mov/cmp/add) : 984 clocks
SzCpy4 (mov/test/inc): 984 clocks
SzCpy5 (mov/test/add): 984 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 09, 2005, 10:23:35 PM

:lol
WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure
and my procedure SzCpy7
It is 1st because I want to use the zeroes from the
destination buffer...:

Code Select


OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
align 16
db 90h,90h,90h
SzCpy7   proc SzDest:DWORD, SzSource:DWORD
         mov  ecx, [esp+8]                 ; ecx = source
         mov  edx, [esp+4]                 ; edx= destination
         push ebx
         xor  eax, eax
         mov  bl, [ecx]                    ; read byte
 @@:     
         add  [edx+eax], bl                ; I use the fact ->the destination
         lea  eax, [eax+1]                 ;   buffer is zero filled!!! 	
         mov  bl, [ecx+eax]                ; read byte
         jnz  @B                           ; keep going
         pop  ebx
         ret  2*4
SzCpy7   endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef



Microsoft (R) Macro Assembler Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: SzCpy.asm
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Press any key to continue . . .
 SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom)
 Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
 Please terminate any high-priority tasks and press ENTER to begin.


 128-byte string copy timing results:

 SzCpy7 (mov/add/lea->Lingo): 172 clocks
 lstrcpy              : 646 clocks

 SzCpy1 (movsb/cmp)   : 1236 clocks
 SzCpy2 (mov/cmp/inc) : 667 clocks
 SzCpy3 (mov/cmp/add) : 532 clocks
 SzCpy4 (mov/test/inc): 602 clocks
 SzCpy5 (mov/test/add): 453 clocks
 SzCpy6 (mov/test/add->Hutch): 545 clocks

 Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: deroko on May 09, 2005, 11:25:10 PM

Athlon64 2800+

Code Select


128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 157 clocks
lstrcpy              : 327 clocks

SzCpy1 (movsb/cmp)   : 714 clocks
SzCpy2 (mov/cmp/inc) : 574 clocks
SzCpy3 (mov/cmp/add) : 712 clocks
SzCpy4 (mov/test/inc): 301 clocks
SzCpy5 (mov/test/add): 432 clocks
SzCpy6 (mov/test/add->Hutch): 303 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 10, 2005, 03:23:07 AM

Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCopyx proc src:DWORD,dst:DWORD

    push ebx
    push esi
    push edi

    xor ecx, ecx
    mov esi, src
    mov edi, dst
    mov ebx, src
    mov edx, dst

    mov eax, -4

  align 4
  @@:
    add eax, 4
    mov cl, [esi+eax]
    mov [edi+eax], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+1]
    mov [edx+eax+1], cl
    test cl, cl
    jz @F

    mov cl, [esi+eax+2]
    mov [edi+eax+2], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+3]
    mov [edx+eax+3], cl
    test cl, cl
    jnz @B

  @@:

    pop edi
    pop esi
    pop ebx

    ret

szCopyx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Title: Re: SzCpy vs. lstrcpy
Post by: Mark Jones on May 10, 2005, 02:30:18 PM

Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll:

XP SP1/AMD XP 1800+

Code Select


 128-byte string copy timing results:

 lstrcpy                       : 433 clocks

 SzCpy1 (movsb/cmp)    MJ      : 540 clocks
 SzCpy2 (mov/cmp/inc)  MJ      : 541 clocks
 SzCpy3 (mov/cmp/add)  MJ      : 666 clocks
 SzCpy4 (mov/test/inc) MJ      : 344 clocks
 SzCpy5 (mov/test/add) MJ      : 477 clocks
 SzCopy (mov/test/add) Hutch   : 367 clocks
 SzCpy7 (mov/add/lea)  Lingo   : 148 clocks
 szCopyx (4x unroll)   Hutch   : 300 clocks

 Press ENTER to exit...

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 10, 2005, 04:25:03 PM

The latest one gives me

Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA
SzCpy.exe : fatal error LNK1120: 1 unresolved externals

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 10, 2005, 04:25:55 PM

How about moving a word at a time?

Code Select

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
	mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
	add eax,4
    mov cx, [ebx+eax]    
    test cl, cl
    jz move1			; done, go move zero byte
    mov [edx+eax], cx	; save both
    test ch,ch
    jz @f				; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret
	
move1:
	mov [ebx+eax],cl
	ret	
move2:
	mov [ebx+eax+2],cl
	ret	
SzCpy12 endp

runs in 255 on my Athlon

p.s.
Did anyone try Lingo's routine? It doesn't give me the correct results, I must be doing something wrong.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 10, 2005, 04:30:28 PM

Mark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's.

Lingo, couple points. I haven't tried any of this, but just stuff that probably could make it faster.

1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.

2) "mov bl, [ecx] " is going to cause a partial register stall. Even on the P4 you can still get "false partial register stalls". change it to a MOVZX or add a "xor ebx,ebx" before the move

3) Lea is the slowest way to increment a number on a P4. ADD/SUB is the fastest. Also try INC/DEC..

Code Select


         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte

4) the above code snippet has a stall ( read after write dependeny) from the previous line. Instead try switching the two lines and using "ecx+eax+1" for the memory address. That should free up a stall.

5) comment below.

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h     - I am clueless why you added 3 NOPs before the procedure?  this makes it not aligned.  For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance.
SzCpy7   proc SzDest:DWORD, SzSource:DWORD

6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls. Just need to free up enough registers. If you'd like me to explain this in more detail, just ask.

I'll post a bit of it. I have not verified this runs faster, but it should help break up some stalls. I also used your original code, so I didn't make any of the changes I suggested above.

Code Select


;ecx is source
;edi is destination - changed this from EDX to EDI, to free up a byte accessible register.
;esi is the same as eax, just a different offset
;all new code left justified
         xor  eax, eax
mov esi,1
         mov  bl, [ecx]                    ; read byte
mov dl,[ecx+1]
@@:     
         add  [edi+eax], bl                ; I use the fact ->the destination
add [edi+esi],dl
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
lea esi,[esi+1]
         mov  bl, [ecx+eax]                ; read byte
mov dl,[ecx+esi]

Title: Re: SzCpy vs. lstrcpy
Post by: Petroizki on May 10, 2005, 05:46:19 PM

Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.

The three nop's make the loop itself to be aligned to 16.

Title: Re: SzCpy vs. lstrcpy
Post by: Phoenix on May 10, 2005, 06:08:08 PM

Results for W2K SP4 / Athlon64 3000+....

128-byte string copy timing results:

lstrcpy : 355 clocks

SzCpy1 (movsb/cmp) MJ : 663 clocks
SzCpy2 (mov/cmp/inc) MJ : 534 clocks
SzCpy3 (mov/cmp/add) MJ : 661 clocks
SzCpy4 (mov/test/inc) MJ : 281 clocks
SzCpy5 (mov/test/add) MJ : 402 clocks
SzCopy (mov/test/add) Hutch : 282 clocks
SzCpy7 (mov/add/lea) Lingo : 161 clocks
szCopyx (4x unroll) Hutch : 230 clocks

Edit:

Same box, but WindowsXP 64bit beta:

lstrcpy : 311 clocks

SzCpy1 (movsb/cmp) MJ : 661 clocks
SzCpy2 (mov/cmp/inc) MJ : 532 clocks
SzCpy3 (mov/cmp/add) MJ : 659 clocks
SzCpy4 (mov/test/inc) MJ : 279 clocks
SzCpy5 (mov/test/add) MJ : 401 clocks
SzCopy (mov/test/add) Hutch : 281 clocks
SzCpy7 (mov/add/lea) Lingo : 161 clocks
szCopyx (4x unroll) Hutch : 230 clocks

I wonder why lstrcpy gets faster? I expected it to be slower because of WOW...

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 10, 2005, 06:39:12 PM

QuoteMark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's.

Duh... I get it now, you can't test it multiple times so the normal test a million times loops won't work. No way to test effectively.

Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.

Then the jnz @B won't have a condition to test (from the add).

Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 10, 2005, 07:44:31 PM

Mark Jones, :lol

Here is faster (unrolled by 8) variant:

Code Select


OPTION  PROLOGUE:NONE 
OPTION  EPILOGUE:NONE 
align   16
db 90h,90h,90h
SzCpy8  proc  SzDest:DWORD, SzSource:DWORD
        mov   ecx, [esp+8]     ; ecx = source
        mov   edx, [esp+4]     ;  edx= destination
        push  ebx
        xor   eax, eax
        mov   bl, [ecx]        ; read byte
 @@:     
        add   [edx+eax], bl    ; write byte
        mov   bl, [ecx+eax+1]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+1], bl  ; write byte
        mov   bl, [ecx+eax+2]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+2], bl  ; write byte
        mov   bl, [ecx+eax+3]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+3], bl  ; write byte
        mov   bl, [ecx+eax+4]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+4], bl  ; write byte
        mov   bl, [ecx+eax+5]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+5], bl  ; write byte
        mov   bl, [ecx+eax+6]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+6], bl  ; write byte
        mov   bl, [ecx+eax+7]  ; read byte
        jz    @F               ; if zero exit
        add   [edx+eax+7], bl  ; write byte
        lea   eax, [eax+8]
        mov   bl, [ecx+eax]    ; read byte
        jnz   @B               ; keep going
 @@:     
        pop   ebx
        ret   2*4
SzCpy8  endp
OPTION  PROLOGUE:PrologueDef 
OPTION  EPILOGUE:EpilogueDef

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 10, 2005, 08:19:14 PM

Jimg, :lol

Quote
Duh... I get it now, you can't test it multiple times so the normal test a million times loops won't work. No way to test effectively.

Code Select


; -----------------------------------------------   
    print  addr SzSzCpy8
      xor  ecx, ecx
      mov  edx, 128
      mov  eax, offset String2
@@:
      mov  [eax+edx-4], ecx
      add  edx, -4
      jnz  @@B
    
    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
    invoke SzCpy8, addr String2, addr String1
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250
; -----------------------------------------------

If you add zero or other value to the register
the time is the same and the test is correct.

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 10, 2005, 08:28:11 PM

Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.

No they don't. They force it to be unaligned. They were put after an "align 16". The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary. On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 10, 2005, 08:43:37 PM

QuoteIf you add zero or other value to the register
the time is the same and the test is correct.

Logically, you are correct. Realistically, since everything seems to have some strange effect on timing, I want to see the result correct after the million times loop!

After more testing, I am convinced that the repeated overflow condition is affecting the test somehow, but it's just too many step to trace by hand. Changing the add to a mov, followed by a testbl,bl shouldn't double the execution time. What's going on here?

Title: Re: SzCpy vs. lstrcpy
Post by: Phoenix on May 10, 2005, 09:23:27 PM

What do you think about this one?

Code Select

szCopyP proc src:DWORD,dst:DWORD

    mov ecx,src      ; ecx = source
    mov edx,dst      ; edx = destination

    xor eax,eax
@@:
    add al,[ecx]     ; read byte
    jz isByte        ; Result = 0?
    shl eax,8        ; shift al > ah
    add al,[ecx+1]   ; read byte    
    jz isWord
    shl eax,8
    add al,[ecx+2]
    jz isByteWord
    shl eax,8
    add al,[ecx+3]
    jz isDWord
    mov [edx],eax
    add ecx,4        ; inc pointers
    add edx,4
    jmp @B
	
isByte:
    mov [edx],al
    jmp done
isWord:
    mov [edx],ax
    jmp done
isByteWord:
    mov [edx+1],ax
    shr eax,16
    mov [edx],al
    jmp done
isDWord:
    mov [edx],eax

done:
   Ret
szCopyP EndP

Results (W2K SP4 / Athlon64 3000+):

Code Select

128-byte string copy timing results:

lstrcpy                       : 354 clocks

SzCpy1 (movsb/cmp)    MJ      : 663 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 534 clocks
SzCpy3 (mov/cmp/add)  MJ      : 661 clocks
SzCpy4 (mov/test/inc) MJ      : 281 clocks
SzCpy5 (mov/test/add) MJ      : 403 clocks
SzCopy (mov/test/add) Hutch   : 282 clocks
SzCpy7 (mov/add/lea)  Lingo   : 161 clocks
szCopyx (4x unroll)   Hutch   : 230 clocks
szCopyP (DWORD mov)   Phoenix : 250 clocks

Not too bad, i think?

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 10, 2005, 10:07:27 PM

Phoenix, :lol

QuoteNot too bad, i think?

Not too bad but you must move
@@ before xor eax, eax

and pls include my last procedure in your test...

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 10, 2005, 10:21:15 PM

Lingo's technique is an interesting one and while it is not properly general purpose, it will be very useful where you are preallocating a zero filled buffer then repeatedly appending byte data to it at the end of the last write. To benchmark it you would have to allocate a large enough zero filled buffer then progressively append string data to the end of it for a duration of about a half a second and you should get a reasonable idea of how fast it is in that context against the others.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 10, 2005, 10:40:17 PM

Hutch, :lol

Quote
"..for a duration of about a half a second and you should get a reasonable idea of how fast it is.."

If we use
invoke HeapAlloc, hHeap, dwFlags, dwBytes

it is very interesting how you measured "a half a second" difference
between the usage of the two parameters:

dwFlags = 0
and
dwFlags = HEAP_ZERO_MEMORY

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 10, 2005, 10:51:22 PM

Lingo,

The comment was to make the test piece large enough so the test would run for about half a second so that the test results were more reliable. They don't improve in accuracy much over that.

Title: Re: SzCpy vs. lstrcpy
Post by: Phoenix on May 11, 2005, 12:09:36 AM

Lingo,

QuoteNot too bad but you must move
@@ before xor eax, eax

:red of course, you are right.... but with the fix, the result is the same: 250 clocks (?)
An align 16 returns the result: 216 clocks...

The result for SzCpy8 is 161 clocks on my box, but...

did you check the resulting strings from SzCopy7 and SzCpy8? I added print addr String2 after your procs and the result seems to be not the expected one.

Edit: The result string is Ë♂VýºÓ▬©↑¶‗áÓz<∟♦∟♦

Regards, Phoenix

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 01:04:27 AM

Quote from: Jimg on May 10, 2005, 06:39:12 PM
Quote1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.
Then the jnz @B won't have a condition to test (from the add).

Adding any form of zeroing the output or comparing bl puts it in the same speed range as the rest.

I guess I wasn't clear enough. I wanted to change it to use ECX, and MOVZX, and then do a JECXZ. Not sure the speed difference.

This uses a "MOV" instead of an "ADD", a "MOVZX" instead of an 8 byte move to get rid of partial register stalls. And I got rid of the LEA, since I don't need to preserve the flags and LEA is slow. That's 3 of my points above.

Code Select


;esi points to the source
         xor  eax, eax
         movzx  ecx,byte ptr [esi]                    ; read byte
@@:     
         mov  [edx+eax], cl                ; I use the fact ->the destination
;         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
        add eax,8                            ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
       jecxz @F
         movzx  ecx, byte ptr [esi+eax]                ; read byte
         jmp  @B                           ; keep going

Lingo, you can use a REPEAT macro to make the code smaller. I didn't unroll it the full 8 times using REPEAT due to the fact you have the LEA in the middle of the sequence. You could technically swap LEA and the line after 8, and do the full 8 lines of unrolling. This also lets you easily play with different amounts of unrolling without having to do cut and pasting, and modifying offsets ( ewww).

Code Select


 @@:     
UNROLL_AMT  equ  7
looper = 0
REPEAT UNROLL_AMT
        add   [edx+eax+looper], bl    ; write byte
        mov   bl, [ecx+eax+looper+1]  ; read byte
        jz    @F               ; if zero exit
looper = looper + 1
endm

        add   [edx+eax+UNROLL_AMT], bl  ; write byte
        lea   eax, [eax+UNROLL_AMT+1]
        mov   bl, [ecx+eax]    ; read byte
        jnz   @B               ; keep going

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 11, 2005, 01:29:04 AM

Mark_Larson-

Quote;esi points to the source
xor eax, eax
movzx ecx,byte ptr [esi] ; read byte
@@:
mov [edx+eax], cl ; I use the fact ->the destination
; lea eax, [eax+1] ; buffer is zero filled!!!
add eax,8 ; we can do that because we don't have to worry about the flags ,since we are using JECXZ
jecxz @F
movzx ecx, byte ptr [esi+eax] ; read byte
jmp @B ; keep going

Looks ok except add eax,8 needs to be add eax,1 doesn't it?

I used this-

Code Select

align 16
db 8 dup (0)   ; 8=411
SzCpy16 proc SzDest:DWORD, SzSource:DWORD
    mov    edx, SzDest
    mov    esi, SzSource
    xor    eax, eax
    movzx  ecx,byte ptr [esi]   ; read byte
@@:     
    mov    [edx+eax], cl
    add    eax,1
    jecxz  @F
    movzx  ecx, byte ptr [esi+eax]  ; read byte
    jmp    @B                       ; keep going
@@:
    ret
SzCpy16 endp

best time was 411 cycles.

You are right about the lea though. This mod of lingo runs in 130 cycles

Code Select

align 16
db 3 dup (0)
SzCpy17   proc uses ebx SzDest:DWORD, SzSource:DWORD
    mov ecx, SzSource       ; ecx = source
    mov edx, SzDest     ;  edx= destination
    xor eax, eax
    mov bl, [ecx]  ; read byte
    sub eax,1
 @@:   
    add	eax, 1 ;[eax+1]
    add [edx+eax], bl  ; write byte
    mov bl, [ecx+eax]  ; read byte
    jnz  @B               ; keep going
    ret
SzCpy17 endp

but I still can't figure out why the add should be three times faster than a mov followed by test bl,bl

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 11, 2005, 02:33:39 AM

I played with the unrolled version a little more, the register alternation was not making it any faster so I reduced the entry and exit overhead and removed the stack frame and it showing a little faster on my PIV. Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

szCopyx proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    xor eax, eax
    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -4

  align 4
  @@:
    add ecx, 4
    mov al, [esi+ecx]
    mov [edx+ecx], al
    test al, al
    jz @F

    mov al, [esi+ecx+1]
    mov [edx+ecx+1], al
    test al, al
    jz @F

    mov al, [esi+ecx+2]
    mov [edx+ecx+2], al
    test al, al
    jz @F

    mov al, [esi+ecx+3]
    mov [edx+ecx+3], al
    test al, al
    jnz @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 11, 2005, 02:49:07 AM

Hutch-

Best I could get with your latest is 284. Other than the lingo and it's mods, the best I've found is 255 with-

Code Select

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
	mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
	add eax,4
    mov cx, [ebx+eax]    
    test cl, cl
    jz move1			; done, go move zero byte
    mov [edx+eax], cx	; save both
    test ch,ch
    jz @f				; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret
	
move1:
	mov [ebx+eax],cl
	ret	
move2:
	mov [edx+eax+2],cl
	ret	
SzCpy12 endp

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 04:16:16 AM

Still tweaking. Hutch's latest runs in 343 cycles. I made some changes to use REPEAT macros and use MOVZX. With the same number of unrolls and MOVZX it runs in 284 cycles.

Code Select


hutch's code above 343
4 unrolls movzx     284
8 unrolls movzx     284
16 unrolls movzx   283
32 unrolls movzx   282
64 unrolls movzx   271
128 unrolls movzx 262 - 128 byte string being tested.  So any further unrolls yields no speed up.

Code Select


align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szCopyx2 proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    xor eax, eax
    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -UNROLL_AMTx

  align 4
  @@:
UNROLL_AMTx equ 4
    add ecx, UNROLL_AMTx
looper = 0
REPEAT UNROLL_AMTx
    movzx eax, byte ptr [esi+ecx+looper]
    mov [edx+ecx+looper], al
    test al, al
    jz @F
looper = looper + 1
endm

    jmp @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 11, 2005, 04:41:47 AM

Here is the fastest variant: :lol

Code Select


OPTION  PROLOGUE:NONE 
OPTION  EPILOGUE:NONE
align   16 
db 8Dh, 0A4h,24h, 0, 0, 0, 0, 8Dh, 0A4h, 24h, 0, 0, 0, 0, 90h 
SzCpy15 proc  SzDest:DWORD, SzSource:DWORD
        push  ebx
         xor  eax, eax
         mov  ebx, [esp+2*4+1*4]       ; ebx = source
         mov  edx, [esp+1*4+1*4]       ; edx= destination
        movq  MM0, [ebx]
        pxor  MM7, MM7
@@:
	movq  [edx+eax], MM0     
         add  eax, 8
     pcmpeqb  MM0, MM7
    packsswb  MM0, MM0
        movd  ecx, MM0
        movq  MM0, [ebx+eax]
       jecxz  @B
         pop  ebx
        emms                           ; femms for AMD
         ret  2*4
SzCpy15 endp
OPTION  PROLOGUE:PrologueDef 
OPTION  EPILOGUE:EpilogueDef

"Note that this version like most of the rest does not require any zero filled buffer and can be considered general purpose."

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 11, 2005, 05:16:52 AM

Lingo,

It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?

Mark,

The MOVZX mod looks like a good one, I will have a play with it a bit later.

LATER : Yes, this method is noticably faster, the test piece is now twice as fast as the version in the libray. What I will be interested to see is if this mod works on AMD hardware as well.

Title: Re: SzCpy vs. lstrcpy
Post by: Petroizki on May 11, 2005, 05:45:10 AM

Quote from: Mark_Larson on May 10, 2005, 08:28:11 PM
Quote from: Petroizki on May 10, 2005, 05:46:19 PM
Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.
No they don't. They force it to be unaligned. They were put after an "align 16". The code will be aligned to a 16 byte boundary up to the ALIGN 16, and then the code will be 3 bytes off of a 16 byte boundary. On a P4 you won't notice the difference, because it doesn't really suffer from code alignment problems.

Check it in debugger, the entry point is 3 bytes off, but the inner loop is aligned to 16 bytes.. :eek

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 11, 2005, 10:19:54 AM

This is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

szCopyx proc src:DWORD,dst:DWORD

    mov [esp-4],  esi

    mov esi, [esp+4]
    mov edx, [esp+8]

    mov ecx, -4

  align 4
  @@:
    add ecx, 4
    movzx eax, BYTE PTR [esi+ecx]
    mov [edx+ecx], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+1]
    mov [edx+ecx+1], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+2]
    mov [edx+ecx+2], al
    test al, al
    jz @F

    movzx eax, BYTE PTR [esi+ecx+3]
    mov [edx+ecx+3], al
    test al, al
    jnz @B

  @@:
    mov esi, [esp-4]

    ret 8

szCopyx endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

The next one is for novelty value, its slow but it works. It requires a DWORD aligned target buffer.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

xorcopy proc src:DWORD,dst:DWORD

    push edi
    push ebx

    mov edx, src
    mov edi, dst
    mov ecx, -4
    xor ebx, ebx

  align 4
  @@:
    add ecx, 4
    mov DWORD PTR [edi+ecx], ebx

    xor al, al
    xor al, [edx+ecx]
    xor [edi+ecx], al
    jz @F
    xor al, al
    xor al, [edx+ecx+1]
    xor [edi+ecx+1], al
    jz @F
    xor al, al
    xor al, [edx+ecx+2]
    xor [edi+ecx+2], al
    jz @F
    xor al, al
    xor al, [edx+ecx+3]
    xor [edi+ecx+3], al
    jnz @B
  @@:

    pop ebx
    pop edi

    ret

xorcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 11, 2005, 10:27:35 AM

I didn't have very long to work on this, but it seems to do well even with a stack frame and with no specific alignment. I didn't have time to include the last 4 (or 5?) versions posted in my timing test.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)              
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:    
    mov   ebx,[eax+4]             ; get next 4 bytes
    add   eax,4
    lea   ecx,[ebx-01010101h]     ; subtract 1 from each byte
    add   edx,4
    test  ecx,80808080h
    jnz   @F                      ; jump if zero byte
    mov   [edx],ebx
    jmp   @B
  @@:
    mov   cl,[eax]
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                    ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

This was after I reversed the order of the arguments for szCopyx, szCopyP, and SzCopy_dw, and running on a P3:

Code Select


 lstrcpy                 : 671 clocks

 SzCpy1 (movsb/cmp)      : 653 clocks
 SzCpy2 (mov/cmp/inc)    : 406 clocks
 SzCpy3 (mov/cmp/add)    : 529 clocks
 SzCpy4 (mov/test/inc)   : 401 clocks
 SzCpy5 (mov/test/add)   : 401 clocks
 szCopyx (hutch unroll 4): 289 clocks
 SzCpy8 (lingo unroll 8) : 147 clocks
 szCopyP (Phoenix)       : 1261 clocks
 szCopy_dw               : 162 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 11:32:09 AM

It's funny how things get revisited on this board. We've had lots of discussion about string length algorithms, including using MMX and SSE2. Check this thread. http://www.old.masmforum.com/viewtopic.php?t=2632. I even posted a version that used MMX and SSE2 in that thread ( along with some other people who did MMX). My account had gotten messed up, so all my posts will be from "hutch--", but the account type will say "guest". Whereas the real ones from hutch-- say "Site Admin" and have the big red crest with the gold lion.

Hutch, as far as alignment goes, I posted a trick to get around that, when we did either a string length or string copy routine. Basically you do a byte routine until the data is the appropriate alignment, and then you switch to the faster method.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 11, 2005, 02:15:15 PM

Hutch-

QuoteThis is the version that has used MArk's mod. It is testing nearly twice as fast as the original library module with its unrolled form. I tested it on the Sempron 2.4 and it retains the same time difference ratio to the timings I get on my PIV so it seems mark's mod works on both Intel and AMD hardware.

On my Athlon, runs at 284 with best alignment. Exactly the same as your previous routine, i.e. movzx doesn't seem to be any faster or slower on amd chips.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 11, 2005, 02:39:57 PM

Lingo-

Your last version is excellent. Runs in 112 cycles (with three extra bytes inserted after the align 16). Better than twice as fast as any other so far. Good Job :U

Now if I only understood what the heck it was doing :wink

Later.....

Ok, now I see what's going on. Too bad that it requires, in the worst case, at least 7 extra bytes in the destination string. :(

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 03:38:27 PM

This is the SSE2 version that I posted on that other thread. The test string length was 228 bytes instead of the 128 bytes we are using here. So keep that in mind when it gives timings. It also doesn't require that you pad the end with 8 or 16 zeros. But it does requires the strings to have 16 byte alignment for the SSE2 version. I'll see if I can dig up the MMX version I wrote later. Because I didn't post it to the forums. I checked. But for the MMX version, we can do a quick conversion and see how many cycles a byte we are talking about, because I give timing info for both the MMX and SSE2 versions.

168 cycles / 228 bytes = 0.73684210526315789473684210526316 cycles per byte for MMX version I need to dig up off my home computer ( I didn't post it to the thread.)
68 cycles / 228 bytes = 0.29824561403508771929824561403509 cycles per byte for SSE2 version I posted below.

So theoretically the MMX version should do 128 bytes in
168*128 = 228*new time in cycles

168*128
------------ = new time in cycles for MMX version for 128 bytes = 94.315789473684210526315789473684 cycles
228

and for the SSE2 version.
68*128 = 228*new time in cycles

68*128
------------ = new time in cycles for SSE2 version for 128 bytes = 38.175438596491228070175438596491 cycles
228

Obviously putting an SSE2 version of szCopy is out of the question. Not as many people could use it. But I think doing an MMX version won't be a problem. I don't know anyone that has a CPU that doesn't support MMX.

Quote
posted some strlen code under the GoAsm forum. I shaved 48 cycles off his code by doing something similar to what I did with your stuff Michael. I just now converted his MMX code to SSE2. Won't work on a P3 or below. The MMX version of Donkey's code that was changed to run faster on the P4 by me, finds the strlen of a 228-byte string in 168 cycles. The below version using SSE2 finds it in 68 cycles!!! So BIG improvement.

Code:

szLenSSE2 proc near pString:ptr byte

mov eax,[pString] ; mov the offset of the string into eax

pxor xmm0,xmm0 ;remove dependencies
pxor xmm2,xmm2 ;remove dependencies
pxor xmm1,xmm1 ;fill XMM reg 1 with NULLs
pxor xmm3,xmm3 ;FILL XMM reg 3 with NULLs

;align 16 ; align for code isn't needed on the P4
; The following loop starts at exactly 16 bytes into the proc
; it is also exactly 16 bytes long, this makes for optimal execution
loop:
movdqa xmm0,[eax] ; mov 16 bytes of the string into MMX reg 0
pshufd xmm2,[eax+16],0e4h ; mov 16 bytes of the string into MMX reg 0
pcmpeqb xmm0,xmm1 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)
pcmpeqb xmm2,xmm3 ; cmp each byte with NULL, mov 0Fh to the matching byte if equal (ie Saturate)

pmovmskb ecx,xmm0 ; This moves the most signf. bit of each byte to ecx
add eax,32 ; increment the string address
pmovmskb edx,xmm2 ; This moves the most signf. bit of each byte to ecx
or ecx,ecx ; Was any bit set ?
jnz end_loop
or edx,edx
jz loop ; no then continue the loop, no zero was found
mov ecx,edx ; Move EDX into ECX for the BSF later

end_loop:

sub eax,[pString] ; subtract the original address

bsf ecx,ecx ; find out which bit is set (where the NULL is)
sub eax,32 ; subtract the last increment from eax
add eax,ecx ; add the number of bytes to the NULL

RET
lszLenSSE2 endp

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 11, 2005, 06:16:46 PM

I got my version down to 127 cycles on a P3, using 486-compatible code. I have no way to test on a P4.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)              
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10,13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to find the terminating null was adapted
; from the MASM32 STRLEN procedure, which was adapted from an
; algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest
    sub   eax,4
    sub   edx,4
  @@:
    REPEAT 8                    ; unroll x 8
      mov   ebx,[eax+4]         ; get next 4 bytes
      add   eax,4
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
      add   edx,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],ebx           ; otherwise copy 4 bytes
    ENDM
    jmp   @B
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 11, 2005, 07:39:44 PM

MichaelW

That's a real nice routine. Unfortunately, it get much slower if there is a high ascii character near the beginning of the string :( Maybe if you replace the last part with one of the other routines when high ascii is found?

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 07:47:53 PM

Michael, here are the timings on my work P4. I believe it's a 1.5

Code Select


127
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <
Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwX
xYyZz Now I Know My ABC's, Won't You Come Play <

148 clocks

Press enter to exit...

I'll take a stab at speeding it up some to. Yours is real generic like Hutch's and faster.

EDIT: I haven't tried this since I have a P4. but Michael have you tried adding an Align for the inner loop and for the procedure?

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 11, 2005, 09:42:36 PM

I tried a number of things. And a few things gave me some speed ups.

First. The last part re-reads the last dword a byte at a time when it is already in a register.

Code Select


hi_mom:
	shr ebx,8
  @@:
;the dword is already IN a register, so why re-read it from memory?
;    mov   cl,[eax]              ; finish up moving bytes
;    add   eax,1
    mov   [edx],bl
    add   edx,1
    test  bl,bl
    jnz   hi_mom
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

Second, I tried unrolling the loop 128 times with the original algorithm ( not including any other changes), and I got it from 148 cycles to 114 cycles. So if we pick some middle ground, we can probably get a decent speed up from this.

Third, we can try removing the Epilogue and Prologue. That is usually good for a few cycles.

Fourth I re-wrote the inner loop to work differently. It now handles two bytes in semi-parallel. It runs about 8 cycles faster than the original code. But with all the other stuff I posted above, it adds up.

Code Select


SzCopy_dw proc uses ebx lpsrc:DWORD,lpdest:DWORD
    mov   eax,lpsrc
    mov   edx,lpdest

  @@:

    REPEAT 128                   ; unroll x 8
      mov   ebx,[eax]         ; get next 4 bytes
      mov   esi,[eax+4]
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
	add   eax,4
      test  ecx,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      lea   edi,[esi-01010101h] ; subtract 1 from each byte
	add	eax,4
      mov   [edx],ebx           ; otherwise copy 4 bytes
	add   edx,4
      test  edi,80808080h       ; test for any -1 bytes
      jnz   @F                  ; jump if zero byte present
      mov   [edx],esi           ; otherwise copy 4 bytes
	add   edx,4
    ENDM
    jmp   @B
  @@:
	sub eax,4
  @@:
    mov   cl,[eax]              ; finish up moving bytes
    add   eax,1
    mov   [edx],cl
    add   edx,1
    test  cl,cl
    jnz   @B
    sub   eax,lpsrc
    sub   eax,1                 ; return length w/o null
    ret
SzCopy_dw endp

Fifth, for non-P4 processors aligning the inner loop should hopefully give a speed up.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 11, 2005, 10:57:29 PM

Hutch :lol

"Lingo,
It looks good, how does it perform on unaligned string data as source and destination and lengths that are not 4 byte boundaries ?"

My algos are for ASM professionals rather than for newbies...

MichaelW :lol

1. This is buliaNaza's algo rather than A.Fog's algo
A.Fog's algo has additional 2 instructions in the main loop:
XOR EBX, -1 ; invert all bytes
AND ECX, EBX ; and these two

I know it very well because in a long time ago I improved buliaNaza's algo:
http://board.win32asmcommunity.net/index.php?topic=8330.msg77056#msg77056

2. As Jimg stated above

Quote
.... Unfortunately, it get much slower if there is a high ascii character near the beginning of the string
Maybe if you replace the last part with one of the other routines when high ascii is found?

That is true. "high ascii characters" are characters above ASCII code 80h and
if we have them in our string your algo will work in "slow" last part, copy one byte at a time rather than a dword...

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 11, 2005, 11:18:52 PM

:bg

Lingo,

> My algos are for ASM professionals rather than for newbies...

I gather this means that it does not handle byte aligned data and does not handle lengths that are not multiples of 4. This is not a criticism as the algo looks good and it should be fast as the testing shows but its not a general purpose algorithm as byte data is usually of random length. One of the normal tasks for zero terminated string copy is to append to the end of a buffer where the location is not 4 byte aligned and the string being copied to it is not a multiple of 4.

Compliments though, for its task range it is a very good algo design.

Jim,

just for my reference, what model Athlon are you testing on ? My last ancient AMD died recently and I have a Sempron 2.4 in its place which deliverss similar timing spacing to the two PIVs I use so its worth me knowing this for writing general purpose code that is basically mixed model algo design.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 12, 2005, 01:57:33 AM

Hey Hutch--, there was one other problem with Lingo's code. It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes. So it copies extra data past the end of the source string to the destination string. In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash. In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.

Here's the MMX code. I accidentally posted a string length routine in the previous post. I haven't finished optimizing it yet. But I figured I'd post it to see if anyone can use it for their own ideas. It does properly handle non-divisible by 8 string sizes. It runs in 107 cycles on my P4 for the 128 byte sample string.

Code Select


szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]

align 16
qword_copy:
   pxor	mm1,mm1
   movq	mm0,[eax] 
   pcmpeqb	mm1,mm0
   add	eax,8 
   pmovmskb	ecx,mm1
   or		ecx,ecx 
   jnz	finish_rest
   movq	[esi],mm0
   add	esi,8
   jmp	qword_copy

finish_rest:

   sub	eax,8
byte_copy:
   movzx	ecx,byte ptr [eax]
   add	eax,1
   mov 	[esi],cl
   add	 esi,1
   test	 cl,cl
   jz		 zero_byte
   jmp	 byte_copy
zero_byte:

;   emms 

   RET
szCopyMMX endp

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 12, 2005, 04:13:06 AM

Mark,

This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 12, 2005, 07:25:55 AM

Jim,

Thanks for pointing out my stupidity :bg As it turns out, and as I see lingo has now pointed out, that was what the extra instructions in Agner Fog's algorithm were for. Using another method after the first high-ascii character did not seem too promising, so I added the two additional instructions from Agner Fog's algorithm so the result bytes are now 80h for a null byte, and zero for any other byte value.

Mark,

Thanks for the suggestions. Running on a P3, with the sample strings, the optimum unroll is 8. Higher values slow the code down. The current code copies the sample string in 170 cycles. The dword loop still contains a dependency, but I can't see any reasonable way to eliminate it.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    SzCopy_dw PROTO :DWORD,:DWORD
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      strx db 80h,"ample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
      str2 db 128 dup(0)
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    invoke SzCopy_dw,ADDR str1,ADDR str2
    print ustr$(eax)
    print chr$(13,10)
    print ADDR str1
    print chr$("<",13,10)
    print ADDR str2
    print chr$("<",13,10,13,10)
    
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      invoke SzCopy_dw,ADDR strx,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" clocks",13,10)
    invoke Sleep,250            ; reduced cycle count by ~26 ??

    mov   eax,input(13,10,"Press enter to exit...")
    exit

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; The technique used to detect the terminating null, without
; detecting the first high-ascii character as a null, was
; adapted from the MASM32 STRLEN procedure, which was adapted
; from an algorithm written by Agner Fog.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

; Corrected another mistake that MASM was actually ingoring:
; SzCopy_dw proc uses ebx edi lpsrc:DWORD,lpdest:DWORD

SzCopy_dw proc lpsrc:DWORD,lpdest:DWORD
    push  ebx
    mov   eax,[esp+8]           ; lpsrc
    mov   edx,[esp+12]          ; lpdest
    align 4
  @@:
    REPEAT 8                    ; unroll x 8
      mov   ebx,[eax]           ; get next 4 bytes
      add   eax,4
      lea   ecx,[ebx-01010101h] ; subtract 1 from each byte
      not   ebx                 ; invert all bytes      
      and   ecx,ebx             ; and each byte with inversion
      not   ebx                 ; restore bytes
      and   ecx,80808080h       ; bytes=80h for null,otherwise 0
      jnz   @F                  ; jump if zero byte present
      mov   [edx],ebx           ; copy 4 bytes
      add   edx,4
    ENDM
    jmp   @B
  @@:
    mov   [edx],bl              ; move first byte
    add   edx,1
    test  bl,bl
    jz    @F                    ; jump if was null
    mov   [edx],bh              ; move next byte
    add   edx,1
    test  bh,bh
    jz    @F                    ; jump if was null
    shr   ebx,16
    mov   [edx],bl              ; move next byte
    add   edx,1
    test  bl,bl
    jz    @F                    ; jump if was null
    mov   [edx],bh              ; move last byte
  @@:
    mov   eax,edx               ; return length w/o null
    sub   eax,[esp+12]          ; lpdest

    pop   ebx
    ret   8
SzCopy_dw endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 12, 2005, 01:02:10 PM

Quote from: hutch-- on May 12, 2005, 04:13:06 AM
Mark,

This is looking good, very few machines don't have MMX capacity so it would be close to general purpose and certainly fast enough. Now if I understand correctly, the use of MOVQ would require at least 8 byte aligned memory in both source and destination so it cannot be used with byte aligned data.

It doesn't require 8 byte aligned source and data strings. It just runs slower like a dword algorithm with unaligned dword data. Only SSE and SSE2 require aligned source and data otherwise they generate an exception. I also got it down to 86 cycles last night after posting it. I need to check for bugs and make sure it works correctly before posting the changes.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 12, 2005, 02:00:08 PM

Michael-

Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 12, 2005, 03:27:54 PM

Mark_Larson

Mark- is it ever possible to have a page fault or other error reading in the string when using movq where there is only 1 byte left and it happens to be at the end of one's address space? Or is there always enough extra space that it could never happen?

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 12, 2005, 03:37:21 PM

Quote from: Jimg on May 12, 2005, 02:00:08 PM
Michael-

Good job. The best non-fpu routine so far (that works in all conditions). On my athlon, the best was 7 repeats,7=162 cycles, 8=171, 9=168, 6=168, 4=166, 3=188.

Jim, MMX is also non-fpu. I think you meant "486 compatible" code.

Hutch, what about having two routines? An szCopy and an szCopyMMX? And let the programmer pick which one they want to use. Because the programmer is best going to know the minimum processor supported for his code.

I did some bench testing for un-aligned MMX. I used the current version I have and did some fancy batch files to auto-check them all. The current version I have is running in 87 cycles at work. So I am going to use that as my baseline. I am using Michael's posted code except I changed the timing macros to use REALTIME.

I made some changes to the ASM file and the batch file to compile the code. First let's look at the changes to the assembler code. You pass in DO_TESTING using the /D switch to ML.EXE to make it do the "special" compile of forcing non-alignment. You also pass in "FRED_VALUE" to ML.EXE with different values ranging from 1 to 7. So in the code snippet below the code gets both DO_TESTING and FRED_VALUE from the command line using the /D command line switch. Because of the "ALIGN 16" right before the DB this forces it to be 1 to 7 bytes off of a 16 byte alignment.

Code Select


align 16
ifdef DO_TESTING
db FRED_VALUE dup (0)
endif ; ifdef DO_TESTING
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0

I created 7 different batch files that each pass in the values ranging from 1 to 7 for FRED_VALUE. Here is an example of setting FRED_VALUE=2, and it also turns DO_TESTING on.

Code Select


\masm32\bin\ml /D"DO_TESTING=1" /D"FRED_VALUE=2" /c /coff "szcopytest.asm"

So now let's look at some timings. I ran each one 5 times and took the fastest time running at REALTIME priority.

Code Select


1 byte off of 16 byte alignment          275 clocks
2 byte off of 16 byte alignment          266 clocks
3 byte off of 16 byte alignment          283 clocks
4 byte off of 16 byte alignment          256 clocks
5 byte off of 16 byte alignment          284 clocks
6 byte off of 16 byte alignment          282 clocks
7 byte off of 16 byte alignment          284 clocks
8 byte off of 16 byte alignment          87 - ran this one manually outside of the process to verify it goes back to being the same speed.
                                          8 bytes off of a 16 byte alignment means it's 8 byte aligned, which is all MMX requires for speed.

So the moral of the story, ALIGN ALIGN ALIGN ( which is exactly what I said on my webpage). However the penalty for un-aligned dword is a lot less than un-aligned MMX.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 12, 2005, 04:10:19 PM

MarkL-

QuoteJim, MMX is also non-fpu. I think you meant "486 compatible" code.

:red (too old I guess)

pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )

Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 12, 2005, 05:33:20 PM

.mmx = MMX as normal
.xmm = MMX, SSE and SSE2

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 12, 2005, 08:01:29 PM

Quote from: Jimg on May 12, 2005, 04:10:19 PM

pmovmskb is an instruction requiring .xmm pseudo, correct? (xmm is not sse, right? it's so confusing ::) )

Yep pmovmskb was added with SSE support. If I wanted to make it exclusively MMX I'd have to change that to some other instruction.

XMM is for both SSE and SSE2. Things get confusing, because when Intel added SSE support they also added additional MMX instructions. They did a similar thing with SSE2 and added some new MMX instructions along with the new double precision floating point support. So when I say "SSE" support I mean both the MMX instructions they added for SSE support as well as the single precision floating point instructions they added. So you have to use .xmm to use "pmovmskb" even though it's not a single precision floating point instruction because it was added as part of P3's "SSE support". make sense? The way you tell is to pull up book 2 ( instruction set reference), and go to Appendix B, they have all the different instructions broken up by type. So the ones added with "SSE" support whether they are floating point or MMX instructions are under the sub-heading "SSE" in appendix B. very handy. I use that all the time to see what processors support what instructions.

Quote
Also- I tried inserting a various number of bytes before the test string to observe the effect. On an athlon, there was minimal differences with or without extra bytes. I was using the routine you entered above, runs about 150 cycles, where is the new one that runs in 57 cycles?

Thanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better. It's 87 cycles, and it hasn't been fully checked for bugs, so I hate posting. It isn't very professional, and people usually jump all over you :). I guess I can post it if everyone realizes I haven't finished checking it for bugs yet.

Quote from: AeroASM on May 12, 2005, 05:33:20 PM
.mmx = MMX as normal
.xmm = MMX, SSE and SSE2

This is correct. I use .xmm in all my problems and it enables MMX, SSE, and SSE2. It's just a lot easier. This in my opinion is a bug. Try doing this, and it'll give you errors if you trying doing SSE or SSE2 code. The reason is the .XMm and .mmx aren't additive, it only uses the last one found in the current code. I found that out the hard way. So I just use .xmm in all my code to be extra sure. You also need the ".xmm" AFTER your "option casemap" or it will start complaining that the code isn't in all CAPITAL letters ( ewww).

Code Select


    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
    .xmm
   .mmx                           ; ignores .xmm and just enables MMX support.

Take the code with a grain of salt, since I have not finished debugging it yet. The only string I have tested it with is the one in Michael's test code. It works fine with that, but with all the "special cases" I added, I need to do a lot more testing before it's ready.

I am only showing the portion I changed. I modified the second loop, where it is copying byte by byte, because there is less than 8 bytes left to copy.

Because of PCMPEQB and PMOVMSKB in the code ECX will have all 0's for each bit that is not a 0. Each bit that is a 0 will have a 1. So BSF let's me get, which bit was set in a single instruction, which corresponds to where the 0 is. If you have more than one 0 in a row, the first one is the one we care about, and BSF scans from the start of the DWORD, so it will find the very first one ( which is the only one we care about). After determine, which bit is set, I jump to special case code to handle different numbers of bytes left.

Code Select


bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2

do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret

do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret

do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret


do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret


do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret

do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret

;cmp ecx,1
;je do_1
movzx	ecx,byte ptr [eax-8]
mov 	[esi],cl
ret 8 ; ret

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 13, 2005, 12:58:30 AM

QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.

The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 01:28:31 AM

Quote from: Jimg on May 13, 2005, 12:58:30 AM
QuoteThanks for trying that. I had already planned on switching to AMD with my next purchase, but that's another reason why. From a programmer's perspective, AMD is a lot better.
The Athlon works good enough for me, but it's really sensitive to code alignment. Much more difficult to fix than data alignment, imho. And I can't get anywhere near your 87 cycles. The code change you just presented dropped it from 149 to about 132. I'm using an Athlon XP 3000+ at 2.16 ghz, 1g of ram. Perhaps I have something else amiss.

In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.

Code Select


PROLOGUE:NONE
OPTION EPILOGUE:NONE


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[esp+4]
   mov esi,[esp+8]

Code Select


szCopyMMX endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 13, 2005, 02:11:51 AM

Quote from: Mark_Larson on May 13, 2005, 01:28:31 AM
In general the P4 runs MMX/SSE/SSE2 a lot faster than other processors. Intel actually did a good job with it. I also did one other thing to speed it up. I got rid of the epilogue and prologue. I forgot to post that. It usually saves some cycles. In case I didn't post it, align the inner loop to 16 bytes, and see how well it does.

There is only a cycle or two difference when using the prologue/epilogue code on the Athon. Certainly not enough to bother with.

After extensive testing, inserting this runs the fastest-

Code Select

align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,src_str
   mov esi,dst_str
   jmp qword_copy
   db 7 dup (0)	;7=159
qword_copy:
   pxor mm1,mm1
   movq mm0,[eax]
.
.
.

Quote

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 04:24:09 AM

That is equivalent to doing an "ALIGN 16" in the inner loop, which what I kept trying to get you to try :) Here's why it is the same as an align ( without the jump of course). Let's look at all the code before the first loop. That consists of exactly two lines of code.

Code Select


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]


;THIS gets translated into the following.  The first two lines are the "prologue"
; the second two lines are the first two lines of the code.  You will notice that it
; also gives the opcodes for the instructions.  

00401260 55                   push        ebp
00401261 8B EC                mov         ebp,esp
00401263 8B 45 08             mov         eax,dword ptr [ebp+8]
00401266 8B 75 0C             mov         esi,dword ptr [ebp+0Ch]

push ebp = 55h
mov ebp,esp = 8Bh ECh
mov eax,dword ptr [ebp+8] = 8Bh 45h 08h
mov esi,dword ptr [ebp+0Ch] = 8Bh 75h 0Ch

If you add up the opcodes contained in those 4 instructions it comes out to 9. Now if you remember right you added 7 bytes to just before the label "qword_copy". 9+7 = 16 bytes. Since the first line of the routine is 16-byte aligned, this forces the "qword_copy" label to also be 16-byte aligned. It's just easier to use ALIGN 16 in front of the label.

Code Select


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov	eax,[src_str] 
   mov	esi,[dst_str]

align 16
qword_copy:

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 13, 2005, 04:43:59 AM

Right. But it measures 8 cycles longer to work through the nop's rather than the jump.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 01:23:25 PM

Quote from: Jimg on May 13, 2005, 04:43:59 AM
Right. But it measures 8 cycles longer to work through the nop's rather than the jump.

That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.

Code Select


00401269    8D A4 24 00 00 00 00    lea         esp,[esp]

So if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.

2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.

Code Select


jmp qword_copy
align 16
qword_copy:

Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.

Code Select


align 16
      str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\
              "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0
align 16
      str2 db 128 dup(0),0

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 13, 2005, 03:04:52 PM

Quote from: Mark_Larson on May 13, 2005, 01:23:25 PM
That's a common fallacy that people have with how ALIGN works in assemblers. They don't always stick NOPs in there to "pad" it to a certain alignment. They have different sized "NOP"s that they use for different numbers of bytes they use. One of the 7 bytes ones they use looks like this. That's all the code it added to mine to align the first loop to 16 bytes, because it needed exactly 7 bytes to do that. Basically when I say "NOP" in regards to ALIGN I mean any instruction that does nothing. Not necessarily the NOP instruction. The below instruction does nothing, but it uses up 7 bytes.

Code Select Expand
00401269 8D A4 24 00 00 00 00 lea esp,[esp]

Yes, that's exactly the code being inserted.

QuoteSo if it is coming out 8 cycles faster I'd try a few things.

1) Make sure your timings are accurate to within 8 cycles. Try two different things to make sure it is extremely accurate. Try moving the priority class from HIGH to REALTIME, and also bump up the number of loops by a factor of 10. I have seen cases where the timing code wasn't accurate enough and you could get +/-10 cycles or even more. It should be taking 5 to 10 or more seconds to run with the extra loops and REALTIME priority class. If it's still running a lot faster than that, bump up the number of loops again. It takes a long time to run, but it will give you an extremely accurate number of cycles.

Yes, I was using Realtime, and went from 1000000 to 10000000 with no difference.

Quote
2) do the ALIGN 16 at the loop but add a JMP right before the instruction. That does exactly the same thing you are doing with the JMP and DBs. You did TWO things to try and make it faster, not one. So I am removing one of them to see, which one made it faster. The advantage of ALIGN is if I go back and add any other instruction in between the first loop and the entry to the routine it will still work, whereas you'd have to re-tweak the number of DBs you have to get it to work right. Make sense? I just woke up, so my brain isn't awake yet. So I am not sure I am explaining it right.
Code Select Expand
jmp qword_copy align 16 qword_copy:

Ok, I tried that. Still slower. In this case, it inserted the instruction
05 00000000 ADD EAX,0

Quote
Getting back to optimization, also make sure the strings are aligned as well. I do it on a 16-byte boundary in case I do SSE/SSE2 later.
Code Select Expand
align 16 str1 db "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoP",\ "pQqRrSsTtUuVvWwXxYyZz Now I Know My ABC's, Won't You Come Play ",0 align 16 str2 db 128 dup(0),0

Yes, I did that also. I've attached my test code so you can see what I'm doing. Sorry for the mess, it's a work in progress :wink

My results are printed in the description when the program runs.

align 16 only - 158

jmp and db 7 dup (0) - 149

jmp and align 16 - 159

The only explanation I can think of is the Athlon loads the lea esp,[esp] insturction in the prefetch and thinks about it for awhile, but the zero require no thought :dazzled:

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 07:31:30 PM

For shoots and grins I compared both the "JMP and DB" and the "JMP and ALIGN" to the normal code execution time. They all execute at the same speed on the P4, which was what I was expecting. I know the P4 extremely well, but I don't know AMD as well, since I have never owned an AMD. I am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 13, 2005, 07:38:35 PM

QuoteI am willing to bet that it is in AMD's optimization manual or you can use their Code Analyst. Have you tried either to see if you can find a clue in there?

Nope. I'll take a look tonight.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 07:42:43 PM

Found the optimization PDF, and an online HTML webpage for optimization. The one I grabbed was for AMD64, it should be similar to XP.

http://63.204.158.36/amd/optimization/wwhelp/wwhimpl/js/html/wwhelp.htm
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 13, 2005, 08:01:20 PM

I think I found the slowdown in the central loop for AMD. I'll post more shortly.

EDIT: Here's the update.

One of the things I do, when trying to optimize for a specific processor is write down timing information ( latency, execution unit, etc). For AMD you don't want to use instructions that use Vectorpath. If you notice PMOVMSKB uses VectorPath.

Code Select


qword_copy:
   pxor	          mm1,mm1	DirectPath	FADD/FMULL		2
   movq	          mm0,[eax]	DirectPath	FADD/FMULL/FSTORE	4
   pcmpeqb	  mm1,mm0	DirectPath	FADD/FMUL		4
   add	          eax,8         DirectPath	ALU			1
   pmovmskb	  ecx,mm1	VectorPath	FADD/FMULl		3
   or		  ecx,ecx       DirectPath	ALU			1
   jnz	          finish_rest	DirectPath	ALU			1
   movq 	  [esi],mm0	DirectPath	FSTORE		        2
   add	          esi,8		DirectPath	ALU			1
   jmp	          qword_copy	DirectPath	ALU			1

finish_rest:

You can try using PSADBW instead. It is DirectPath and runs in 3 cyclces. It subtracts two MMX registers taking the absolute value, and then sums the bytes in a register. By making the second register all 0's, you basically get a way of doing a horizontal sum within a register ( COOL!). After the PCMPEQB every single byte in MM1 is going to be a "F" or a "0". Nothing else. So after the PSADBW you can move the result into a CPU register using MOVD. If the result is not 7F8h then you know you have a zero-value in there some where. The 7F8h comes from adding 8 FF's together, which is the result if all the bytes are non-zero. I have not verified the code works, so it might need testing.

Code Select


   pcmpeqb	  mm1,mm0	DirectPath	FADD/FMUL		4
;you can actually move the PXOR outside the loop.
pxor mm2,mm2		            DirectPath,    FADD/FMUL		2
   add	          eax,8         DirectPath	ALU			1
;   pmovmskb	  ecx,mm1	VectorPath	FADD/FMULl		3
psadbw         mm1,mm2		DirectPath	FADD			3
movd	        ecx,mm1	  	Double				4
;   or		  ecx,ecx       DirectPath	ALU			1
cmp             ecx,7F8h
;   jnz	          finish_rest	DirectPath	ALU			1
jb            	finish_rest

You can also use PACKUSWB which is also DirectPath and takes 2 cycles. And follow that by a MOVD to a CPU register, and compare it to a 0FFFFFFFFh, ,which is what it should be if no bytes are 0. I also have not verified this works, so use it with a grain of salt.

Code Select


   pcmpeqb	mm1,mm0		DirectPath	FADD/FMUL		4
   add	eax,8 		DirectPath	ALU			1
;   pmovmskb	ecx,mm1		VectorPath	FADD/FMULl		3
packuswb mm1,mm2		DirectPath	FADD/FMUL		2
movd	ecx,mm1		Double				4
;   or		ecx,ecx 		DirectPath	ALU			1
cmp ecx,0FFFFFFFFh
   jnz	finish_rest		DirectPath	ALU			1

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 14, 2005, 01:05:35 AM

Quite entertaining, but several times slower :'(
Edited-
Scratch that, I need to do more testing.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 15, 2005, 03:45:19 AM

Ok, I think I got it now.

First example:

Code Select

    pxor mm1,mm1
    pxor mm2,mm2
qword_copy:
    movq mm0,[eax]
    pcmpeqb   mm1,mm0 ;DirectPath FADD/FMUL 4
    add   eax,8  ;       DirectPath ALU   1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    psadbw mm1,mm2 ;DirectPath FADD 3
    movd  ecx,mm1   ;Double 4
    or   ecx,ecx      ; DirectPath ALU 1
;   cmp  ecx,7F8h
    jnz  finish_rest ;DirectPath ALU 1
;   jb            finish_rest
    movq [esi],mm0
   	add esi,8
   	jmp qword_copy

the cmp ecx,7F8h/jb always jumped so I changed it as above.
runs in 156 vs. previous best of 149

Example 2:

Code Select

qword_copy:
    pxor mm1,mm1
    movq mm0,[eax]
    pcmpeqb mm1,mm0  ;DirectPath FADD/FMUL  4
   	add eax,8        ;DirectPath ALU 1
;   pmovmskb ecx,mm1 VectorPath FADD/FMULl 3
    packuswb mm1,mm2    ;DirectPath FADD/FMUL 2
    movd ecx,mm1        ;Double 4
    or ecx,ecx ;DirectPath ALU 1
;   cmp ecx,0FFFFFFFFh
   	jnz finish_rest
   	movq [esi],mm0
   	add esi,8
   	jmp qword_copy

I either don't understand hou packuswb if supposed to work or it doesn't work in this context.
packuswb of the word 7700h gives 00h, not 77h

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 15, 2005, 04:18:57 PM

I analyzed your new ending, and here the same thing rewritten as an alternate ending for you. Probably slightly slower in strings not a multiple of 4 bytes including the 0 byte terminator. Runs in 130 here.

Code Select

finish_rest:
bsf ecx,ecx
cmp ecx,4
jb lowerx
mov edx,[eax-8]
mov [esi],edx

lowerx:
mov edx,[eax+ecx-11]   ; misaligned 3/4 of the time but a lot less code.
mov [esi+ecx-3],edx
ret

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 15, 2005, 05:06:03 PM

93 cycles with MMX
96 cycles with XMM

Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 15, 2005, 06:28:15 PM

I just realized my previous post of a different ending could corrupt the string preceding the destination for source strings less than 4 bytes long. Not good at all :(

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 15, 2005, 11:39:08 PM

Aero,

Did you try to run the code that you posted? On my system (a P3) the MMX procedure hangs, and when it does, with REALTIME_PRIORITY_CLASS, it takes Windows down with it. This is why I have recently been posting code with HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, to help the user recover from any mistakes I may have made.

BTW, you could save at least some of us some time and effort if you would indicate that the code requires MASM 7, or add-on macro support for the earlier versions.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 16, 2005, 02:30:49 AM

Ok, here's an ending without the bugs:

Code Select

finish_rest:
    bsf ecx,ecx
    cmp ecx,3
    jb lowerx
    mov edx,[eax-8]        ; save first 4 bytes
    mov [esi],edx

    mov edx,[eax+ecx-11]   ; do the rest
    mov [esi+ecx-3],edx    ; faster to do it than test
    ret

lowerx:		; was 2 or 1 or 0
    test cl,cl
    jnz ItsOneOrTwo
    mov byte ptr [esi],0    ; was zero, just save terminator
    ret
ItsOneOrTwo:
    movzx ecx,word ptr[eax-8]   ; either xx0?  or x0??  
    mov [esi],cx
    cmp ecx,2
    je do_2
    ret	
do_2:            ; xx0? ????
    mov byte ptr [esi+2],0
    ret  ; ret

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 16, 2005, 02:35:51 AM

AeroASM, :lol
You can use my zip file...

Code Select


OPTION   PROLOGUE:NONE
OPTION   EPILOGUE:NONE
comment  *MMX by Lingo*
align    16
@@:
         movq     MM1, [ecx+eax]
         movq     [edx+eax-8], MM0
@@2:
         movq     MM0, [ecx+eax]
         pcmpeqb  MM1, MM7
         packsswb MM1, MM1
         movd     ebx, MM1
         test     bl,  bl
         lea      eax, [eax+8]
         je       @B
@@:							
         movzx    ebx, byte ptr [ecx+eax-8]
         add      eax, 1
         test     bl,  bl
         mov      [edx+eax-9], bl
         jnz      @B
         mov      ebx, [esp+2*4]
         ret      2*4
align    16
SzCpy10  proc    SzDest:DWORD, SzSource:DWORD
         mov      ecx, [esp+2*4]   ; ecx = source
         xor      eax, eax
         mov  	 edx, [esp+1*4]   ; edx= destination
         pxor     MM7, MM7
         movq     MM1, [ecx]
         mov      [esp+2*4], ebx   ; save ebx register
         je       @@2
SzCpy10  endp
OPTION   PROLOGUE:PrologueDef 
OPTION   EPILOGUE:EpilogueDef


Here is the results:
P4 3.6GHz Prescott

 Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
 Please terminate any high-priority tasks and press ENTER to begin.


 128-byte string copy timing results:

SzCpy10 (Lingo -> MMX): 120 clocks
SzCpy19 (Mark Larson ->MMX):171 clocks
SzCpy11 (Lingo-> SSE): 116 clocks
SzCpy18 (Mark Larson ->SSE):132 clocks

 Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 06:30:45 AM

MichaelW: I did test it many times, and it takes about 15 seconds each time on a 1.5GHz.

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 16, 2005, 07:09:48 AM

Aero,

Correction, the procedure does not hang, but with REALTIME_PRIORITY_CLASS Windows did. After a change to HIGH_PRIORITY_CLASS the procedure takes ~40 seconds to run (P3-500), so perhaps this was just too long for REALTIME_PRIORITY_CLASS.

I added a function test using your source string, and another one that that substituted a high-ascii character at the start of the source string, and in both cases the string copied OK. The cycle count for the MMX version was a uniform 139. The XMM version generated errors, so I'm guessing it contains at least one instruction that my P3 does not support.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 16, 2005, 03:14:01 PM

Quote from: lingo on May 16, 2005, 02:35:51 AM

Code Select Expand
SzCpy10 (Lingo -> MMX): 120 clocks SzCpy19 (Mark Larson ->MMX):171 clocks SzCpy11 (Lingo-> SSE): 116 clocks SzCpy18 (Mark Larson ->SSE):132 clocks Press ENTER to exit...

Regards,
Lingo

Three things Lingo.

1) You modified my code, yet still list the timings as mine. If you are going to benchmark my code, and list it as my code, don't change the code. ( szpy18 routine) Not to mention this throws off any other comparative timings people are doing with my code.

2) I made an update on the previous page to my code. With the update my code runs in 87 cycles. You didn't pick that up either.

3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4. Considering the original timing without the update was running at about 100 cycles. When I updated your timing code to do REALTIME and updated the number of loops, the updated code I posted last week runs in 87 cycles as I was expecting.

EDIT: Here are the timings after I changed my code back to its original form and added the code update from last week, along with the more accurate timings . Now when I run it mutliple times, I only get a 1 cycle clock variance.

Code Select


 128-byte string copy timing results:

SzCpy10 (Lingo -> MMX): 116 clocks
SzCpy19 (Mark Larson ->MMX):119 clocks
SzCpy11 (Lingo-> SSE): 81 clocks
SzCpy18 (Mark Larson ->SSE):87 clocks

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 16, 2005, 04:54:04 PM

Good morning Mark-

May I have a full copy of your latest 87 clock version. The bits and pieces to construct it are confusing me. Also, doesn't Lingo's version still suffer from the problem of writing outside the boundary of the destination string if it's exactly the size to receive the source string, and it is not a multiple of 8 bytes?

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 16, 2005, 05:14:38 PM

Mark Larson, :(

Quote
1) You modified my code, yet still list the timings as mine.....

No, I didn't modify it because your code is not so valuable for me and it is not in my style..

I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page

Quote
Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

So it should be Mark Larson@AeroASM or not.. ?..

Quote
2) I made an update on the previous page to my code. With the update my code runs in 87 cycles.

Congratulations

Quote
3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4

"There's no way the SSE code I posted"
I don't know your "original" code
I just use "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page

" is running in 132 cycles on your P4"

It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 05:38:58 PM

Quote from: lingo on May 16, 2005, 05:14:38 PM
There is an error in usage of source and destination parameters
in the beginning of the "your" or AeroASM code and that is the reason
for "Why longer with XMM?" question.

It is my implementation of Mark's algorithm.
What is the error?

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 16, 2005, 05:43:43 PM

I have always found some humour in the direction that such topics develop and how far they wander from the original design. The first examples from Mark Jones were general purpose byte aligned source and destination that could be used under almost any circumstances as is characterised by unaligned string data.

I now see code designs that require complicated starting alignment correction on the source that still require aligned targets which reduce the algos to more or less novelties for general purpose byte aligned string data.

I would like to have the luxury of performing OWORD data transfers in an efficient manner but the latest hardware only barely supports it with a fudge in 64 bit and all of the later hardware requires at least natural alignment of the data size to run properly.

I would like to thank those who helped with the BYTE aligned code testing as I have ended up with an algo that is about 90% faster than the original I had in the library and it still satsfies the criterion of being a general purpose algorithm.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 16, 2005, 06:14:04 PM

:lol

QuoteHey Hutch--, there was one other problem with Lingo's code. It writes in multiples of 8 bytes, and doesn't handle non-divisible by 8 string sizes. So it copies extra data past the end of the source string to the destination string. In addition if the destination string was dynamically allocated you might be writing to unowned memory, and cause a crash. In either case you will be writing to variables that exist past the destination string if it isn't always 7 or more bytes longer than the source.

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?
(We don't know the lenght of the source string}

2.How long is the buffer with destination string and how we allocated it? Is it equal of the source buffer or not?

3.Who desided the lenght of the destination buffer to be equal of the lenght of the source string (not buffer)?
(We don't know the lenght of the source string}
A. Mark Larson and I saw lack of the programmer's experience here...Why?
Because as a result we must write additional slow code (to copy last bytes byte by byte). Who is gilty about the slower code?
Mark Larson and all the people that believe to him... :lol

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 06:16:21 PM

Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 16, 2005, 06:22:11 PM

Aero,

I think its because internally a 32 bit processor emulates 64 and 128 bit data transfers in 32 bit chunks.

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 06:42:51 PM

Actually the later Pentiums have 64 bit data buses but I think you are right. Also I think that XMM is not yet very well developed and refined.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 16, 2005, 06:45:11 PM

AeroASM, :lol

It is the result from your "old" file szCopyXMMX.asm

Quote
C:\0A1>szcopy~1.exe
173 cycles with MMX
132 cycles with XMM [/b]

Regards,
Lingo

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 16, 2005, 07:20:29 PM

Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 16, 2005, 08:09:36 PM

Quote from: AeroASM on May 16, 2005, 06:16:21 PM
Here is the latest incarnation of szCopyXMMX. MMX runs in 81 and XMM in 98 on my Pentium M 1.5GHz (have I said that too many times?)
I am still puzzling over alignment: align szCopyMMX to 16 slows it down to 107.
I am also still puzzling over why tf the XMM is far slower than the MMX.

That's because the latency on P4 for SSE2 is a lot slower than MMX. You can't convert it and always expect it to run faster. There are additional tricks you have to do. I'll take a look at it later. I'm swamped at work at the moment. The general trick I is to do TWO of whatever you are doing in the main loop to help break dependencies and give a speed up.

Quote from: lingo on May 16, 2005, 05:14:38 PM

Quote3) You need to run your code in realtime with a few more loops. There's no way the SSE code I posted is running in 132 cycles on your P4

It is the true...There are people here with P4 prescott and WinXP pro SP2 and they can test my file too...

You missed the point Lingo. I compiled and ran your code multiple times and I got large differences in the number of cycles it took to execute. In your own SSE code, I saw a 12 cycle variance. If you are going to do code optimization for low cycle count procedures you shouldn't have more than a 1 or 2 cycle variance.

Quote from: lingo on May 16, 2005, 05:14:38 PM
Mark Larson, :(

Quote
1) You modified my code, yet still list the timings as mine.....

No, I didn't modify it because your code is not so valuable for me and it is not in my style..

I just used "your" code from the file szCopyXMMX.zip
posted by AeroASM in prev page

That's cuz Aero made changes to my code. So when you post timings, you need to post that it's my code but modified by Aero.

JimG, here's my latest. I haven't had much chance to tweak since I've been super busy lately. I should be able to drop another 10-15 cycles maybe more after playing with it. I wanted to try PSADBW in place of PMOVMSKB on P4 and see if it's faster. And a few other things.

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE


align 16
szCopyMMX proc near src_str:DWORD, dst_str:DWORD
   mov eax,[esp+4]
   mov esi,[esp+8]


align 16
qword_copy:
   pxor	mm1,mm1
   movq	mm0,[eax] 
   pcmpeqb	mm1,mm0
   add	eax,8 
   pmovmskb	ecx,mm1
   or		ecx,ecx 
   jnz	finish_rest
   movq	[esi],mm0
   add	esi,8
   jmp	qword_copy

finish_rest:

;if 0
bsf ecx,ecx
cmp ecx,7
je do_7
cmp ecx,6
je do_6
cmp ecx,5
je do_5
cmp ecx,4
je do_4
cmp ecx,3
je do_3
cmp ecx,2
je do_2

do_7:
mov ecx,[eax-8]
mov edx,[eax+4-8] ;really copy 8 bytes to include the 0
mov [esi],ecx
mov [esi+4],edx
ret 8 ; ret

do_6:
mov ecx,[eax-8]
movzx edx,word ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dx
mov byte ptr [esi+6],0
ret 8 ; ret

do_5:
mov ecx,[eax-8]
movzx edx,byte ptr[eax+4-8]
mov [esi],ecx
mov [esi+4],dl
mov byte ptr [esi+5],0
ret 8 ; ret


do_4:
mov ecx,[eax-8]
mov [esi],ecx
mov byte ptr [esi+4],0
ret 8 ; ret


do_3:
movzx ecx,word ptr[eax-8]
movzx edx,byte ptr[eax+2-8]
mov [esi],cx
mov [esi+2],dl
mov byte ptr [esi+3],0
ret 8 ; ret

do_2:
movzx ecx,word ptr[eax-8]
mov byte ptr [esi+2],0
ret 8 ; ret

;cmp ecx,1
;je do_1
movzx	ecx,byte ptr [eax-8]
mov 	[esi],cl
ret 8 ; ret

szCopyMMX endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Quote

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 16, 2005, 08:30:54 PM

Quote from: AeroASM on May 15, 2005, 05:06:03 PM
93 cycles with MMX
96 cycles with XMM

Why longer with XMM?

Test piece attached: I nicked the MMX algorithm from Mark Larson and optimised it and commented it and converted it to XMM.
Timings are for my Pentium M 1.5GHz

Took me a while to run your code, because I was so busy. I forgot earlier that you said you had a Pentium 4 M. They are more like a P3 in optimiztion than a P4. A lot of the long cycled instructions on a P4 are lower latency on a P4 M. However you still might want to try doing 2 per loop to break dependencies. I'll play with it if I get a chance.

Code Select


117 cycles with MMX
97 cycles with XMM

Your modified MMX routine of my code runs 10 cycles slower on my P4 than my original code. But I am sure it runs faster on your P4 M or you wouldn't have posted it. I am better at optimizing for a standard P4, since that's what I have :)

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 08:37:52 PM

Not Pentium 4-M, but Pentium M. After the P3, Intel souped it up to make the P4, added mobility (battery saving) stuff and made the Pentium M, then put some of the mobility stuff on the P4 to make the P4-M (AFAIK).

You can save clocks by organising the qword loop like this:

Instead of this:

Code Select


;preliminary stuff
qword_copy:
;stuff 1
jnz somewhere_else
;stuff 2
jmp qword_copy
somewhere_else:

Put this:

Code Select


;preliminary stuff
jmp copy_start
qword_copy:
;stuff 2
copy_start:
;stuff 1
jz qword_copy
somewhere_else:

this is faster because you do not have to fall through the jnz each loop.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 16, 2005, 08:59:01 PM

The Pentium 4-M and Pentium M are the same processor. I just like using "Pentium 4 M" for people who aren't familiar with the different processor lines, so they can see it's a P4 part ( has SSE2 support). Though it was designed more from a P3 than a P4. Whereas the P4 is a significant departure from the P3. Intel has gotten worse and worse with their processors. Prescott has even longer instruction latencies ( eww) than the original P4, and slower L1 and L2 cache access times. One of the things that Intel has always beaten AMD on is their L1 cache latency access ( 2 clock cycles verus AMD having 3). The current prescott's have 3 cycles of L1 cache latency. The just released Intel parts are even worse with 4 cycles of L1 cache latency. So reads from memory even if they are in the L1 cache are getting more expensive.

Title: Re: SzCpy vs. lstrcpy
Post by: AeroASM on May 16, 2005, 09:01:17 PM

I swear they are not, but I am probably mistaken.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 16, 2005, 09:07:29 PM

Currently their are no desktop systems that are based off a Pentium M ( which is what you were saying). Tom's Hardware modified a desktop motherboard to plug in one of the Pentium M chips, so they could use it on the desktop. That's the closest a desktop version of it has come. The Pentium 4 M and Pentium M are the same processor. Intel has made several comments in the press that they will eventually release a Pentium M for the desktop, but none has come out yet. Because of my job I have to know low level hardware very well.

You might find this link interesting. I did a search on "Pentium 4-M" on google. Tom's Hardware is comparing 4 mobile ( read for laptops) Pentium 4-M ( calls it that exact name).

http://www.tomshardware.com/mobile/20030212/

I also did a search on "Pentium M", and found another link on Tom's Hardware doing reviews of laptop systems, where they call it "Pentium M"

http://www.tomshardware.com/mobile/20030205/

Both reviews are on the same site, and both reviews use both names.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 16, 2005, 10:11:22 PM

still slow [without preserving esi) :lol

Code Select



 Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
 Please terminate any high-priority tasks and press ENTER to begin.


 128-byte string copy timing results:

SzCpy10   - >   Lingo ->  MMX: 123 clocks
szCopyMMX ->Mark Larson ->MMX: 135 clocks
SzCpy11 -- >     Lingo ->     MMX  ->    Fast: 113 clocks
szCopyMMX1  - >  Mark Larson   ->  MMX-> Fast: 124 clocks

 Press ENTER to exit...

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 17, 2005, 02:21:41 PM

Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 17, 2005, 03:05:28 PM

Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

I missed this the first time it was posted. Been super busy at work and been working a lot of overtime. That always happens when you are at a new place, because you spend a lot of time ramping up. So that counts down on my ability to optimize. I can't spend as much time on it as I'd like to.

I have heard of cases where reading past the end of a buffer can cause a fault ( dynamically allocated buffer). However, I have not experienced it for myself. And Agner Fog's string length routine actually reads past the end of the buffer since it reads data a dword at a time. In general you should be much safer doing a read past the end of the buffer versus doing a write. I'd never do a write that could potentially go past the end of the buffer. In the Intel Optimization manual I believe it says something about reads going past a page boundary causing a fault. But I could be misremembering. It's been a while, and I've never observed the behavior. Agner Fog's strlen routine has been in use in MASM32 for some time. Hutch-- made some modifications, and I've never heard of anyone reporting a problem. So if you do read past the end of some data, just be aware that there could be a problem, and be cautious.

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 17, 2005, 04:24:46 PM

Quote from: lingo on May 16, 2005, 10:11:22 PM
still slow [without preserving esi) :lol

Code Select Expand
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/ Please terminate any high-priority tasks and press ENTER to begin. 128-byte string copy timing results: SzCpy10 - > Lingo -> MMX: 123 clocks szCopyMMX ->Mark Larson ->MMX: 135 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 113 clocks szCopyMMX1 - > Mark Larson -> MMX-> Fast: 124 clocks Press ENTER to exit...

Your code is using SSE2 not SSE. SSE doesn't support the full 16 byte ALU instructions such as MOVDQA, PXOR, etc. I had been limiting myself to SSE since that's what I thought you were using, and I wanted to keep apples to apples. I just converted my SSE code to use SSE2 like yours uses, and now I am running in 62 cycles. Yours runs in 76 cycles. You can do the same conversion yourself on my code by simply converting the instruction to use SSE2 ( movdqa, etc).

For those that want a visual aid, here's the main loop.

Code Select


align 16
dqword_copy:
   movdqa xmm0,[eax]
   pxor xmm1,xmm1
   pcmpeqb xmm1,xmm0
   add eax,16
   pmovmskb ecx,xmm1

   test ecx,ecx
   jnz finish_rest

   movdqa [esi],xmm0
   add esi,16
   jmp dqword_copy

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 17, 2005, 05:16:39 PM

I don't know what you talking about? :naughty:
In my zip file I have just MMX code...

Where is your zip file with tests to understand what you mean?

Title: Re: SzCpy vs. lstrcpy
Post by: MichaelW on May 17, 2005, 06:21:02 PM

Reading past the end of a buffer (under Windows 2000 SP4) causes no problems that I can see, at least without a debugger. I didn't have time to test buffers allocated by other means, or writes past the end.

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .486                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        teststr db 4097 dup('X'),0
        lpMem   dd 0
        junkstr db 0,0,0,0,0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov   eax,input("Read past end of statically allocated buffer, press enter to continue...")

    invoke StrLen,ADDR teststr
    mov   ebx,eax
    print chr$("Returned string length :")
    print ustr$(ebx)
    print chr$(13,10)
    mov   ebx,OFFSET teststr
    add   ebx,4097+16
    mov   eax,[ebx]

    invoke GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,4096
    mov   lpMem,eax
    invoke GlobalSize,lpMem
    mov   ebx,eax
    print chr$("Size of dynamically allocated buffer :")
    print ustr$(ebx)
    print chr$(13,10)

    invoke memfill,lpMem,4096,'LLIF'

    mov   eax,input("Read past end of dynamically allocated buffer, press enter to continue...")    

    mov   ebx,lpMem
    add   ebx,4095
    mov   eax,[ebx]
    mov   edx,[ebx+16]
    mov   DWORD PTR junkstr,eax
    print ADDR junkstr

    free lpMem

    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 17, 2005, 07:42:30 PM

Quote from: lingo on May 17, 2005, 05:16:39 PM
Where is your zip file with tests to understand what you mean?

For timing purposes I added my routine to your code, so that we used the same test data. I just modified the timing to use REALTIME and added 10 times as many loops. That way I can get more accurate benchmarking against your code since we are using the same test data.

Quote from: lingo on May 17, 2005, 05:16:39 PM
I don't know what you talking about? :naughty:
In my zip file I have just MMX code...

Not true. Here's the text you print for your szCpy11 routine:

Code Select


SzSzCpy11    DB  "SzCpy11 (Lingo-> SSE): ",0

So you are saying it is just SSE. However here is some code from your szCpy11 routine.

Code Select


comment * XMM by Lingo*
align 16
@@:
		movdqa		[edx+eax],XMM0
		add				eax,16
@@XM:
		pcmpeqb		XMM1, [ecx+eax]
		movdqa		XMM0, [ecx+eax]

MOVDQA is an SSE2 instruction, not SSE. Doing PCMPEQB with an XMM register is an SSE2 instruction, not SSE. Doing any of the old MMX instructions on an XMM register instead of an MMX register requires SSE2 support. If you are unsure in the future, check Appendix B in the 2nd instruction manual for the P4. It gives a complete listing of what instructions were added, which each type of support. It has a section for all the MMX instructions, SSE, and SSE2. So you can clearly see what processor you need to use your code when you use a given instruction. If someone without SSE2 support tries to run your code, it'll die the big death ( invalid opcode). That means most people with AMDs, since unless you have an AMD64 you won't have SSE2 support. And anyone with a P3 or older Intel processor.

Title: Re: SzCpy vs. lstrcpy
Post by: lingo on May 17, 2005, 08:35:08 PM

Bla, bla... blah..... :naughty:

I'm afraid you mean my older zip file with AeroASM's code in it...

Use my last zip file (named StillSlow.zip) at this page (page 7)

Where is your zip file with tests to understand all people how slow your code is?

No file, just bla, bla blah :naughty:

Title: Re: SzCpy vs. lstrcpy
Post by: Mark_Larson on May 17, 2005, 08:50:12 PM

Quote from: lingo on May 17, 2005, 08:35:08 PM
Bla, bla... blah..... :naughty:

I'm afraid you mean my older zip file with AeroASM's code in it...

Use my last zip file (named StillSlow.zip) at this page (page 7)

Where is your zip file with tests to understand all people how slow your code is?

No file, just bla, bla blah :naughty:

I got better things to do than waste my time on someone who is rude and arrogant. You are just upset because my SSE2 code is faster than yours. Grow up. And BTW the timings on my computer for the latest code you posted are the same speed. And you still haven't gotten it right, your latest code says it's MMX, and it's not. It's SSE. If you spent more time listening instead of being rude, you would have already learned that fact.

Title: Re: SzCpy vs. lstrcpy
Post by: Jimg on May 17, 2005, 08:55:46 PM

Quote from: Jimg on May 17, 2005, 02:21:41 PM
Quote from: Jimg on May 16, 2005, 07:20:29 PM
Quote from: lingo on May 16, 2005, 06:14:04 PM

1. What is the lenght of the buffer with source string and is it OK to read the data past the end of the source string?

That's a very good question. Is it possible to have the source string right at the end of your alloted space and cause a page fault by trying to read past it? Every one of these routines has this possible problem :eek

Would someone please address this for the clueless?

Well, I had to answer my own question....

Code Select

; assemble and link parameters
;Assemble=/c /coff /Cp /nologo /Sn
;Link=/SUBSYSTEM:WINDOWS /RELEASE /VERSION:4.0

.586
.model	flat, stdcall
option	casemap :none   ; case sensitive
include user32.inc
includelib user32.lib
include kernel32.inc
includelib kernel32.lib
.data
bln db ' ',0
	db 4087 dup (0)
done db 'done',0
badstring db '0'
.code
Program:
	invoke	GetModuleHandle, 0
    mov edx,offset badstring
@@:	mov eax,[edx]	; get 4 bytes as most of these routines are doing
	invoke MessageBox,0,addr done,addr bln,0   	
	invoke	ExitProcess, eax
end Program

The mov eax,[edx] bombs off because it trys to read past it's own address space.

This could be a very hard problem to debug.

So, what to do about all these general purpose routines????

[attachment deleted by admin]

Title: Re: SzCpy vs. lstrcpy
Post by: Mark Jones on May 18, 2005, 03:47:54 AM

Could the routine used at compile time be dependent on the aligning of the data structures? Say we made a macro called szCopy which would choose between a general-purpose and custom algorithm(s) at compile-time. If the data is not aligned then a default byte-wise copy algorithm is used, whereas if the coder aligned all called datatypes to DWORD offsets, then an MMX routine would be used instead? Just an idle thought, I know very little about macro capability. :red

Title: Re: SzCpy vs. lstrcpy
Post by: hutch-- on May 19, 2005, 12:33:33 AM

I am locking this topic because of a number of insults that have been passed in it, not because the topic has not been interesting or useful.

This is a technical forum with rules about good manners among members and this is not going to change. Please respect this as the alternative drives skiled and useful people away which is a lose for everyone.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Mark Jones on May 09, 2005, 01:23:45 AM