News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

memory copy ...

Started by James Ladd, May 15, 2005, 12:36:49 AM

Previous topic - Next topic

James Ladd

if there a simple memory copy routine out there for MASM ?

mnemonic

Hi,

yes there is one shipped together with the masm32 package:
MemCopy proc public uses esi edi Source:PTR BYTE,Dest:PTR BYTE,ln:DWORD

Regards
Be kind. Everyone you meet is fighting a hard battle.--Plato
-------
How To Ask Questions The Smart Way

Tedd

* Tedd waits for the torrent of memcopy implementations by members..  :bdg
No snowflake in an avalanche feels responsible.

AeroASM

Can we move this to the Lab so we can indeed have competitions about who can copy memory the fastest?

hutch--

For what its worth, the REP MOVSD version in the masm32 library works OK as a general purpose memory copy but last time I played with a 4 DWORD version that paid attention to avoiding read after write stalls using more registers it was a bit faster on both the PIVs I work on. The REP MOVSD versions all suffer the same problem in that they are slow for the 1st 64 bytes or so until the special case circuitry kicks in.

If you are repeatedly hammering memory copies of under about 250 bytes, an incremented pointer version is a lot faster but the REP MOVSD method catches up fat after that. MOVQ versions are faster when done properly and if the hardware support is there, an XMM version distinguishing between temporal and non temporal reads and writes is faster again.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark_Larson


  Striker this issue comes up all the time.  The general answer is it depends on how much data you are copying.

  1) less than 64-byte use a MOV or MOVQ or MOVDQA
  2) 64+ bytes, use REP MOVSD
  3) 1KB plus use MOVQ or MOVDQA
  4) 2MB plus use MOVNTDQ or other non-temporal stores.

  This is a rough estimate, and might vary from processor to procsesor.  So look at your data size, and then try a few different ways, to see which works best for your specific application.  A lot of it is dependent on memory speed.  The biggest gain is with non-temporal stores in the multi-megabyte range.  I was showing someone on win32 how to do that for their graphics program.  They did different types of copies for differnet sizes sets of data, and have all the results tabulated, but for really large sizes it was running slow, so I showed them how to speed it up using non-temporal stores ( scroll down)


http://board.win32asmcommunity.net/index.php?topic=20846.0

here's a bit of a cut and paste, showing differnet sized buffers and how much faster MOVNTPS ( that's what I showed him to use since it does 16 byte writes and is supported with SSE).  With 80mb on his system MOVNTPS ( move non-tepmoral) is 2.5 times faster.  As the size of the buffer drops, the speed difference between the 2 algorithm drops.  In the example below MOVNTPS is slower up till about .614 MB.  So that's why I tend to use 1MB as a guidelines to try to use it.  As always benchmark your code.  Notebook systems and older systems will have slower memory bandwidth.


;mem write
;-------------
;movaps  [esi+n],xmm0,  80MB, LU=4 :     1.75 GB/s
;movntps [esi+n],xmm0,  80MB, LU=4 :     4.25 GB/s  :-)
;
;movaps  [esi+n],xmm0,  0.2MB, LU=4 :     8.47 GB/s  <= Should make sense
;movntps [esi+n],xmm0,  0.2MB, LU=4 :     4.22 GB/s
;
;movaps  [esi+n],xmm0,  0.4MB, LU=4 :     6.84 GB/s
;movntps [esi+n],xmm0,  0.4MB, LU=4 :     4.26 GB/s
;
;movaps  [esi+n],xmm0,  0.614MB, LU=4 :   4.38 GB/s  (640x480x16 Buffer in ram)
;movntps [esi+n],xmm0,  0.614MB, LU=4 :   4.23 GB/s  (640x480x16 Buffer in ram)
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

Here is a quick play. Its an unrolled DWORD copy timed against the REP MOVSD algo in the masm32 library. The main loop is two blocks of 8 movs, the design was to seperate reads from write using the same registers to avoid read after write stalls. Thw two blocks are an unroll by 2.

On the Prescott PIV I an using, aligning the label for the main block to 16 made no difference so I left it at 4.

Algo assumes at least 4 byte alignment for both source and destination buffers.

The timings I am getting on a 200 meg block of memory aligned by at least 4 bytes is 437 MS for REP MOVSD and 375 for the version posted below.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

dcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL dcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt
    cmp ecx, 32
    jl tail

    shr ecx, 5      ; div by 32
    mov dcnt, ecx

  align 4
  body:
    mov eax, [esi]
    mov ebx, [esi+4]
    mov ecx, [esi+8]
    mov edx, [esi+12]
    mov [edi],    eax
    mov [edi+4],  ebx
    mov [edi+8],  ecx
    mov [edi+12], edx

    mov eax, [esi+16]
    mov ebx, [esi+20]
    mov ecx, [esi+24]
    mov edx, [esi+28]
    mov [edi+16], eax
    mov [edi+20], ebx
    mov [edi+24], ecx
    mov [edi+28], edx

    add esi, 32
    add edi, 32
    sub dcnt, 1
    jnz body

    mov ecx, cnt
    and ecx, 31
    jz bcend

  tail:
    mov al, [esi]
    add esi, 1
    mov [edi], al
    add edi, 1
    sub ecx, 1
    jnz tail

  bcend:

    pop edi
    pop esi
    pop ebx

    ret

dcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?

These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.

111111
MemCopy - rep movsd                         : 273834438 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 269703703 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 412743712 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274738762 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles



[attachment deleted by admin]
eschew obfuscation

hutch--

This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.

It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.



      nops MACRO number
        REPEAT number
          nop
        ENDM
      ENDM


mmxcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt

    xor ebx, ebx                ; zero ebx

    shr ecx, 6                  ; div by 64
    mov lcnt, ecx

    mov ecx, esi
    mov edx, edi

  align 4
  stlp:
    movq mm(0), [esi]
    movq mm(1), [ecx+8]
    movq mm(2), [esi+16]
    movq mm(3), [ecx+24]
    movq mm(4), [esi+32]
    movq mm(5), [ecx+40]
    movq mm(7), [esi+56]
    movq mm(6), [ecx+48]

    add esi, 64
    add ecx, 64

    movntq [edi],    mm(0)
    movntq [edx+8],  mm(1)
    movntq [edi+16], mm(2)
    movntq [edx+24], mm(3)
    movntq [edi+32], mm(4)
    movntq [edx+40], mm(5)
    movntq [edi+48], mm(6)
    movntq [edx+56], mm(7)

    add edi, 64
    add edx, 64

    ;; nops 96

    sub lcnt, 1
    jnz stlp

  quit:

    pop edi
    pop esi
    pop ebx

    ret

mmxcopy endp
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

James Ladd

I have my answer but I suspect this thread will continue for some time :)

roticv

Hutch,

Unaligned data will slow your code. I suggest that you include another loop so that your data is aligned before going to the main loop.

hutch--

Victor,

The MMX version needs at least 8 byte alignment but thats not what I am testing with it, its the absolute data transfer rate which does not seem to be all that much faster than the integer versions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv

I think you are missing out prefetch codes.

Mark_Larson

Quote from: hutch-- on May 18, 2005, 05:17:41 AM
This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.

It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.

   Also try moving where the "prefetchnta" instruction is.  I wrote a program to automatically try all combinations of offsets and the location of the instruction in a loop and find the "optimum one".  I'll see if I can find it and post it.  You only want to prefetch the source not the destination since you are writing directly to memory.  I'll take a try at your code and see if I can speed it up any, when I get a chance.

EDIT:  Memory copies are heavily dependent on the maximum peak memory bandwidth.  So some systems might  run your code really slow and others really fast.  Currently it is running in 828 ms for 200MB on mine with no modfiications.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

This is the SSE version of the same algo and on my PIV it runs at exactly the same timing as the MMX version. The box is a 2.8 gig Prescott PIV with an 800 meg FSB Intel board and 2 gig of DDR400 memory. I have done this style of testing on a number of different generations of hardware and they all seem to exhibit the same characteristics which suggests to me that Mark's comment on memory bandwidth is the limiting factor is correct.

the factor that fascinated me is the amount of spare time floating around in the loops in both the MMX and XMM versions. I tried a hybrib that did both MMX and normal integer copy but it was really slow so the memory access times seem to be the problem. The only gain I can so far get from MMX or SSE code is the non temporal writes which reduce cache pollution.

Being able to padd the loop with a large number of NOPS shows that there is processing time being wasted which says the processor is still a lot faster than the DDR400 memory.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

xmmcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt
    shr ecx, 6
    mov lcnt, ecx

  align 4
  stlp:

    movdqa xmm(0), [esi]
    movdqa xmm(1), [esi+16]
    movdqa xmm(2), [esi+32]
    movdqa xmm(3), [esi+48]

    add esi, 64

    movntdq [edi], xmm(0)
    movntdq [edi+16], xmm(1)
    movntdq [edi+32], xmm(2)
    movntdq [edi+48], xmm(3)

    add edi, 64

    sub lcnt, 1
    jnz stlp

    pop edi
    pop esi
    pop ebx

    ret

xmmcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php