The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: James Ladd on May 15, 2005, 12:36:49 AM

Title: memory copy ...
Post by: James Ladd on May 15, 2005, 12:36:49 AM
if there a simple memory copy routine out there for MASM ?
Title: Re: memory copy ...
Post by: mnemonic on May 15, 2005, 12:45:33 AM
Hi,

yes there is one shipped together with the masm32 package:
MemCopy proc public uses esi edi Source:PTR BYTE,Dest:PTR BYTE,ln:DWORD

Regards
Title: Re: memory copy ...
Post by: Tedd on May 16, 2005, 12:08:36 PM
/me waits for the torrent of memcopy implementations by members..  :bdg
Title: Re: memory copy ...
Post by: AeroASM on May 16, 2005, 01:35:03 PM
Can we move this to the Lab so we can indeed have competitions about who can copy memory the fastest?
Title: Re: memory copy ...
Post by: hutch-- on May 16, 2005, 11:45:23 PM
For what its worth, the REP MOVSD version in the masm32 library works OK as a general purpose memory copy but last time I played with a 4 DWORD version that paid attention to avoiding read after write stalls using more registers it was a bit faster on both the PIVs I work on. The REP MOVSD versions all suffer the same problem in that they are slow for the 1st 64 bytes or so until the special case circuitry kicks in.

If you are repeatedly hammering memory copies of under about 250 bytes, an incremented pointer version is a lot faster but the REP MOVSD method catches up fat after that. MOVQ versions are faster when done properly and if the hardware support is there, an XMM version distinguishing between temporal and non temporal reads and writes is faster again.
Title: Re: memory copy ...
Post by: Mark_Larson on May 17, 2005, 09:53:30 PM

  Striker this issue comes up all the time.  The general answer is it depends on how much data you are copying.

  1) less than 64-byte use a MOV or MOVQ or MOVDQA
  2) 64+ bytes, use REP MOVSD
  3) 1KB plus use MOVQ or MOVDQA
  4) 2MB plus use MOVNTDQ or other non-temporal stores.

  This is a rough estimate, and might vary from processor to procsesor.  So look at your data size, and then try a few different ways, to see which works best for your specific application.  A lot of it is dependent on memory speed.  The biggest gain is with non-temporal stores in the multi-megabyte range.  I was showing someone on win32 how to do that for their graphics program.  They did different types of copies for differnet sizes sets of data, and have all the results tabulated, but for really large sizes it was running slow, so I showed them how to speed it up using non-temporal stores ( scroll down)


http://board.win32asmcommunity.net/index.php?topic=20846.0

here's a bit of a cut and paste, showing differnet sized buffers and how much faster MOVNTPS ( that's what I showed him to use since it does 16 byte writes and is supported with SSE).  With 80mb on his system MOVNTPS ( move non-tepmoral) is 2.5 times faster.  As the size of the buffer drops, the speed difference between the 2 algorithm drops.  In the example below MOVNTPS is slower up till about .614 MB.  So that's why I tend to use 1MB as a guidelines to try to use it.  As always benchmark your code.  Notebook systems and older systems will have slower memory bandwidth.


;mem write
;-------------
;movaps  [esi+n],xmm0,  80MB, LU=4 :     1.75 GB/s
;movntps [esi+n],xmm0,  80MB, LU=4 :     4.25 GB/s  :-)
;
;movaps  [esi+n],xmm0,  0.2MB, LU=4 :     8.47 GB/s  <= Should make sense
;movntps [esi+n],xmm0,  0.2MB, LU=4 :     4.22 GB/s
;
;movaps  [esi+n],xmm0,  0.4MB, LU=4 :     6.84 GB/s
;movntps [esi+n],xmm0,  0.4MB, LU=4 :     4.26 GB/s
;
;movaps  [esi+n],xmm0,  0.614MB, LU=4 :   4.38 GB/s  (640x480x16 Buffer in ram)
;movntps [esi+n],xmm0,  0.614MB, LU=4 :   4.23 GB/s  (640x480x16 Buffer in ram)
Title: Re: memory copy ...
Post by: hutch-- on May 18, 2005, 12:53:14 AM
Here is a quick play. Its an unrolled DWORD copy timed against the REP MOVSD algo in the masm32 library. The main loop is two blocks of 8 movs, the design was to seperate reads from write using the same registers to avoid read after write stalls. Thw two blocks are an unroll by 2.

On the Prescott PIV I an using, aligning the label for the main block to 16 made no difference so I left it at 4.

Algo assumes at least 4 byte alignment for both source and destination buffers.

The timings I am getting on a 200 meg block of memory aligned by at least 4 bytes is 437 MS for REP MOVSD and 375 for the version posted below.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

dcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL dcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt
    cmp ecx, 32
    jl tail

    shr ecx, 5      ; div by 32
    mov dcnt, ecx

  align 4
  body:
    mov eax, [esi]
    mov ebx, [esi+4]
    mov ecx, [esi+8]
    mov edx, [esi+12]
    mov [edi],    eax
    mov [edi+4],  ebx
    mov [edi+8],  ecx
    mov [edi+12], edx

    mov eax, [esi+16]
    mov ebx, [esi+20]
    mov ecx, [esi+24]
    mov edx, [esi+28]
    mov [edi+16], eax
    mov [edi+20], ebx
    mov [edi+24], ecx
    mov [edi+28], edx

    add esi, 32
    add edi, 32
    sub dcnt, 1
    jnz body

    mov ecx, cnt
    and ecx, 31
    jz bcend

  tail:
    mov al, [esi]
    add esi, 1
    mov [edi], al
    add edi, 1
    sub ecx, 1
    jnz tail

  bcend:

    pop edi
    pop esi
    pop ebx

    ret

dcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: memory copy ...
Post by: MichaelW on May 18, 2005, 05:14:14 AM
I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?

These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.

111111
MemCopy - rep movsd                         : 273834438 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 269703703 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 412743712 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274738762 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles



[attachment deleted by admin]
Title: Re: memory copy ...
Post by: hutch-- on May 18, 2005, 05:17:41 AM
This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.

It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.



      nops MACRO number
        REPEAT number
          nop
        ENDM
      ENDM


mmxcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt

    xor ebx, ebx                ; zero ebx

    shr ecx, 6                  ; div by 64
    mov lcnt, ecx

    mov ecx, esi
    mov edx, edi

  align 4
  stlp:
    movq mm(0), [esi]
    movq mm(1), [ecx+8]
    movq mm(2), [esi+16]
    movq mm(3), [ecx+24]
    movq mm(4), [esi+32]
    movq mm(5), [ecx+40]
    movq mm(7), [esi+56]
    movq mm(6), [ecx+48]

    add esi, 64
    add ecx, 64

    movntq [edi],    mm(0)
    movntq [edx+8],  mm(1)
    movntq [edi+16], mm(2)
    movntq [edx+24], mm(3)
    movntq [edi+32], mm(4)
    movntq [edx+40], mm(5)
    movntq [edi+48], mm(6)
    movntq [edx+56], mm(7)

    add edi, 64
    add edx, 64

    ;; nops 96

    sub lcnt, 1
    jnz stlp

  quit:

    pop edi
    pop esi
    pop ebx

    ret

mmxcopy endp
Title: Re: memory copy ...
Post by: James Ladd on May 18, 2005, 07:30:49 AM
I have my answer but I suspect this thread will continue for some time :)
Title: Re: memory copy ...
Post by: roticv on May 18, 2005, 11:54:56 AM
Hutch,

Unaligned data will slow your code. I suggest that you include another loop so that your data is aligned before going to the main loop.
Title: Re: memory copy ...
Post by: hutch-- on May 18, 2005, 02:18:22 PM
Victor,

The MMX version needs at least 8 byte alignment but thats not what I am testing with it, its the absolute data transfer rate which does not seem to be all that much faster than the integer versions.
Title: Re: memory copy ...
Post by: roticv on May 18, 2005, 04:18:57 PM
I think you are missing out prefetch codes.
Title: Re: memory copy ...
Post by: Mark_Larson on May 18, 2005, 04:27:00 PM
Quote from: hutch-- on May 18, 2005, 05:17:41 AM
This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.

It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.

   Also try moving where the "prefetchnta" instruction is.  I wrote a program to automatically try all combinations of offsets and the location of the instruction in a loop and find the "optimum one".  I'll see if I can find it and post it.  You only want to prefetch the source not the destination since you are writing directly to memory.  I'll take a try at your code and see if I can speed it up any, when I get a chance.

EDIT:  Memory copies are heavily dependent on the maximum peak memory bandwidth.  So some systems might  run your code really slow and others really fast.  Currently it is running in 828 ms for 200MB on mine with no modfiications.

Title: Re: memory copy ...
Post by: hutch-- on May 19, 2005, 12:58:00 AM
This is the SSE version of the same algo and on my PIV it runs at exactly the same timing as the MMX version. The box is a 2.8 gig Prescott PIV with an 800 meg FSB Intel board and 2 gig of DDR400 memory. I have done this style of testing on a number of different generations of hardware and they all seem to exhibit the same characteristics which suggests to me that Mark's comment on memory bandwidth is the limiting factor is correct.

the factor that fascinated me is the amount of spare time floating around in the loops in both the MMX and XMM versions. I tried a hybrib that did both MMX and normal integer copy but it was really slow so the memory access times seem to be the problem. The only gain I can so far get from MMX or SSE code is the non temporal writes which reduce cache pollution.

Being able to padd the loop with a large number of NOPS shows that there is processing time being wasted which says the processor is still a lot faster than the DDR400 memory.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

xmmcopy proc src:DWORD,dst:DWORD,cnt:DWORD

    LOCAL lcnt  :DWORD

    push ebx
    push esi
    push edi

    mov esi, src
    mov edi, dst
    mov ecx, cnt
    shr ecx, 6
    mov lcnt, ecx

  align 4
  stlp:

    movdqa xmm(0), [esi]
    movdqa xmm(1), [esi+16]
    movdqa xmm(2), [esi+32]
    movdqa xmm(3), [esi+48]

    add esi, 64

    movntdq [edi], xmm(0)
    movntdq [edi+16], xmm(1)
    movntdq [edi+32], xmm(2)
    movntdq [edi+48], xmm(3)

    add edi, 64

    sub lcnt, 1
    jnz stlp

    pop edi
    pop esi
    pop ebx

    ret

xmmcopy endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: memory copy ...
Post by: u on May 19, 2005, 04:58:22 AM
Seems like I'm the only AMD fan :)

Taken from:
"Using Block Prefetch for Optimized Memory Performance" - Advanced Micro Devices, author Mike Wall

Quote from PDF:
These code samples were run on an AMD AthlonXP Processor 1800+ with CAS2 DDR2100 memory, and VIA KT266A chipset. Data sizes were several megabytes, i.e. much larger than the cache.

To compare, "rep movsd" has bandwidth of ~640 MB/sec


; Note: copies qwords, to copy the left 0..7 bytes we should add a few more lines.
; Also we will want to make 8-byte alignment of copied range

;bandwidth: ~1976 MB/sec  (up 300% vs. baseline) .


CACHEBLOCK equ 400h ; number of QWORDs in a chunk
mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; total number of QWORDS (8 bytes)
; (assumes len / CACHEBLOCK = integer) lea esi, [esi+ecx*8] lea edi, [edi+ecx*8]
neg ecx

mainloop:

mov eax, CACHEBLOCK / 16 ; note: prefetch loop is unrolled 2X
add ecx, CACHEBLOCK ; move up to end of block

prefetchloop:

mov ebx, [esi+ecx*8-64] ; read one address in this cache line...
mov ebx, [esi+ecx*8-128] ; ... and one in the previous line
sub ecx, 16 ; 16 QWORDS = 2 64-byte cache lines
dec eax
jnz prefetchloop
mov eax, CACHEBLOCK / 8
;-----------[ Block copy ]-------------------\
writeloop:
movq mm0, qword ptr [esi+ecx*8]
movq mm1, qword ptr [esi+ecx*8+8]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
movntq qword ptr [edi+ecx*8], mm0
movntq qword ptr [edi+ecx*8+8], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
;-------------------------------------------/
add ecx, 8
dec eax
jnz writeloop
or ecx, ecx
jnz mainloop
sfence
emms

This code, using block prefetch and the MOVNTQ streaming store, achieves an overall memory bandwidth of 1976 MB/sec, which is over 90% of the theoretical maximum possible with DDR2100 memory.


Note on some instructions:
Now that the MMX registers are being used, the code can employ a very special instruction: MOVNTQ. This is a streaming store instruction, for writing data to memory. This instruction bypasses the on-chip cache, and sends data directly into a write combining buffer. And because the MOVNTQ allows the CPU to avoid reading the old data from the memory destination address, MOVNTQ can effectively double the total write bandwidth. (note that an SFENCE is required after the data is written, to flush the write buffer)
Title: Re: memory copy ...
Post by: hutch-- on May 19, 2005, 05:01:40 AM
Thanks Ultrano,

I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.
Title: Re: memory copy ...
Post by: Mark_Larson on May 19, 2005, 07:35:58 PM
Quote from: hutch-- on May 19, 2005, 05:01:40 AM
Thanks Ultrano,

I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.

  Yep, software prefetch.  First came across it in Abrash's book ( Zen of Assembly Lnguage).  Personally I thought that was really funny considering the author used SSE code to do the MOVTNQ but then didn't try a prefetch instruction.  I wanted to try that and see what the speed difference is.  Technically a prefetch instruction just grabs the data over the bus and into the cache.  However you don't get any register dependency stalls ( yea!), and you can force it to only fetch into one cache level which usually makes it faster.  Right now the code is dying the big dog in my program.  I need to double check it. 
Title: Re: memory copy ...
Post by: Mark_Larson on May 19, 2005, 08:30:09 PM

  I tried prefetchnta and it ran in the same speed.  No difference.  Which indicates to me that it's heavily write i/o bound.  Since technically the prefetch should have sped it up ( the author's code prefetches into the L1/L2 caches and I only prefetched into the L1, which is a faster way to do it).

  Ultano you might want to try running it on a 100 MB data buffer, to test those bandwidth numbers.  I tried 100MB, 10MB, and 1MB and got some varied results.  So I think 100MB will probably give you more accurate info.

Algorithm from the book:
100MB = 0.280 seconds = 357 MB/s
10MB   = 0.025 seconds = 400 MB/s
1MB     = 0.002 seconds = 500 MB/s 

REP MOVSD
100MB = 0.445 seconds = 225 MB/s
10MB   = 0.042 seconds = 238 MB/s
1MB     = 0.003 seconds = 333 MB/s 

 
  From your comments it sounded like you tried 1 or 2 MB.  The MB/s between 10MB and 100MB didn't change as much compared to the timing between 1MB and 10MB.  So you probably need to at least run a 10MB buffer or bigger to get accurate results.  So your benchmarking is going to be off.

  As a comparison hutch's code runs in .318 seconds for 100MB buffer on my machine.  I have a slow system at work.  My system at home has a 3.2 GB/s peak memory bandwidth ( dual channel rambus pc800).
Title: Re: memory copy ...
Post by: u on May 19, 2005, 08:57:08 PM
I just copy/pasted from that PDF, and fixed-up the code from inline (for C/C++ compilers)  to normal asm. While this is completely useless for me, I guessed some of you might find it useful (and I see AMD docs are not something rarely anyone discusses on-so I'm showing a peek of them).
What are the specs of the PC you measured "1MB = 0.002 seconds = 500 MB/s " from ?
I haven't tested it yet though ^^" - but my PC must beat the pdf's benchmark results twofold:  AthlonXP2000+, 512MB DDR2@400MHz (PC2-3200).
Title: Re: memory copy ...
Post by: Mark_Larson on May 19, 2005, 09:34:29 PM
  I don't know since I was given the system at work.  It has a 1.7 GHz P4.  I went to the control panel and it has an ICH0 (82801AB I/O controller), which dates it as really old.     The MCH ( memory controller hub, which would be responsible for the bandwidth to the memory) is a 82850 ( 850 is the chipset).  I am trying to dig up the peak memory bandwidth for the chipset.  But for a P4 system is seems low.  Most P4 systems have a lot of memory bandwidth.

EDIT: The PDF for the 82850 says that it supports a peak bandwidth of 3.2 GB/s, which leaves the memory installed as the slow part.  I'll see, if I can find out what kind.  I work for a hardware company, so there are a lot of "loose parts" floating around.  When new people start, generally they grab parts from all over to build your system.  So wouldn't surprise me if I had some slow memory in this system.

EDIT2: Found the memory on samsung's website.  It's ECC, which is probably the problem.  ECC memory is alwasy slower than non-ECC.  The website doesn't say it's ECC, the side of the RIMM has "ECC" written on it.  So they probably had both types ( ECC and non-ECC)

http://www.samsung.com/Products/Semiconductor/DRAM/RDRAM/RDRAMmodule/NormalRIMM/MR18R082GBN1/MR18R082GBN1.htm
Title: Re: memory copy ...
Post by: Momoass on May 27, 2005, 11:03:24 PM
Quote from: MichaelW on May 18, 2005, 05:14:14 AM
I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?

These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.

111111
MemCopy - rep movsd                         : 273834438 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 269703703 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 412743712 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274738762 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles



111111
MemCopy - rep movsd                         : 267229419 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 241144161 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 257438941 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274631556 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 231717334 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 233434611 cycles


These result are for my SP2600+,why the xcopy fastest?
Title: Re: memory copy ...
Post by: daydreamer on June 25, 2005, 05:35:07 PM
so you copy while cpus alu units is sitting idle+fpu sits idle x millis?
I am not that good with knowing clock cycles/operand, so why couldnt we experiment doing y math operations while copying x millis at the same time is possible without slowing down memorycopy?
interleave copy with a few math op? and you do z math operations for later use
should that be possible to research and test on?
it wont make memory copy faster, but following calculation proc has been speedup by z math operations being precalculated
Title: Re: memory copy ...
Post by: daydreamer on June 25, 2005, 05:42:11 PM
what I mean research so you know for each mb copy you are able to interleave without affect copyspeed
for example 8000 fsincos or 40000 muls
or 1 meg add/sub per execution unit
or 1 meg boolean per execution (and or xor)
etc

Title: Re: memory copy ...
Post by: Farabi on July 01, 2005, 05:35:18 PM
Great Im just gonna test it and this discussion is this far. But the optimize one cannot run on p3 machine rigth? SSE2 is not available?
Title: Re: memory copy ...
Post by: Webring on July 01, 2005, 08:02:51 PM
if you want a really fast mem copy try

; make DS:ESI point to the source code
; make ES:EDI point to the destination code
; ECX = length of code to be moved
; the code length is calculated in 16 bytes chunks

   mov_loop:
   fild qword ptr [esi]
   fild qword ptr [esi+16]
   fxch
   fistp qword ptr es:[edi]
   fistp qword ptr es:[edi+16]
   add edi, 16
   add esi, 16
   sub ecx, 16
   jns mov_loop

I got this from a lengendary site, that shared with those interested whats happened in the last decade, as far as the technical aspect.It features some on the most lengendary asm coders in the world, and  It was recently shutdown thanks to my WONDERFUL goverment, and fear in not understanding...  :tdown
Title: Re: memory copy ...
Post by: Farabi on July 02, 2005, 05:52:49 AM
FPU is slow but I dont know if fistp and fild will be fast.
Title: Re: memory copy ...
Post by: u on July 02, 2005, 07:27:18 AM
fxch on older AMD cpus will be a major slowdown. On my K6-2 it took ... 30-60 cycles.
Title: Re: memory copy ...
Post by: meeku on July 19, 2005, 12:15:53 PM
Hey all,

I've been running similar bandwidth tests on my various machines at work and at home and sofar I haven't really been that impressed with the performance.
I put together about 15 different memory copy routines and tested them on various sized buffers (512kb, 1Mb, 4Mb, 100Mb and 200Mb).

Sofar the maximum sustained transfer rate I can achieve on the 200Mb buffer is 1.2Gb/s ... Clearly the larger the buffer and the longer the test runs for the move accurate the mean rate becomes. For small buffers you might get considerably higher bandwidth due to measurement innaccuracy or having small chunks already in cache etc.
 
The results I've had sofar have confirmed that the streaming stores do only show an advantage somewhere over the 1Mb range.
I've found that the prefetchnta[...] instructions haven't offered any noticeable performance gain. I suspect this instruction is a bit like the Java GC, it's rumoured to do something but when and how is a bit mystery. Software pre-fetching as described in that AMD PDF does definately yield an improvement.

The machines I'm testing with are: IBM Pentium M notebook 1.70Ghz and a P4 3.2Ghz HT, 2gig of Dual DDR400, ATI Radeon 9800pro.

Oddly though the MMX versions seem to be faster than the SSE ones... on both machines.. even using movdqa, movntdq with SSE is still slower than the older movq, movntq...

I really have to say based on the machine specs, and the "theoretical" peak transfer of memory I would've expected to see something more like 2 -3 Gb/s .... anyways
Title: Re: memory copy ...
Post by: hutch-- on July 19, 2005, 12:50:28 PM
Hi meeku,

Welcome on board, you have done some interesting tests. I have much the same comment on the difference between MM and XMM data transfer, I can get about the same data transfer rate if I seperate the read and write streams of data using the non-temporal writes but I suggest tha the real limiting factor is memory bandwidth. This also says that processor is still a lot faster than memory. With each generation of hardware I have written a set of test pieces and the only improve I can get with raw data transfer is with non-temporal writes.

This may change as the x86-64 architecture starts to take over as the technical data say the internally transfer 64 bit chunks and can pair these to perform 128 bit data transfer but the true 32 bit hardware seems to handle data transfer in 32 bit chunks internally so they will always have the memory bandwidth problem.
Title: Re: memory copy ...
Post by: meeku on July 19, 2005, 01:02:32 PM
Hey Hutch  :wink
Good to be on board!

Something I'd actually be interested to try is using the same asm routines on a linux box. I did quite a bit of work on putting together my own os and kernel based on an old protected mode extender i wrote for dos and I found that using software in the kernel to manage virtual address spaces, memory management and task switching was considerably faster and more reliable than using the built in task switching/v86 mode stuff on the cpu.. basically you leave the cpu running in a flat memory model with no paging and all at ring 0, and let the kernel manage the rest... anyhow im digressing, the point is that memory access was at least 20% faster, so perhaps the results would vary on linux.

Anyone doing asm coding under linux want to give us some performance results for the same tests?  :bg
Title: Re: memory copy ...
Post by: meeku on July 20, 2005, 11:49:29 AM
Hey,

Managed to marginally improve the performance by adding one more read prior to prefetch to prime the TLB buffer (as per Intel's optimisation recommendations).
Sitting at 1.1Gb/s tested with a 100 and 200Mb buffer and various number of iterations on the timing loop.

This was the performance from my 1.7Ghz Pentium Mobile IBM laptop... in theory this result should be much better on a decent spec desktop.

Here are the two fastest versions sofar:



    mov esi, data1ptr
    mov edi, data2ptr

    mov ecx, DATASIZE

    lea esi, [esi+ecx*8]
    lea edi, [edi+ecx*8]

    neg ecx

align 16
mainloop:

    mov eax, (CACHEBLOCK / 16)
    add ecx, CACHEBLOCK

    mov edx, [esi+ecx*8-128]           
prefetchloop:                           
    mov ebx, [esi+ecx*8-64]
    mov ebx, [esi+ecx*8-128]
    sub ecx, 16
    dec eax
    jnz short prefetchloop


    mov eax, (CACHEBLOCK / 8)

writeloop:
    movdqa xmm0, [esi+ecx*8]
    movdqa xmm1, [esi+ecx*8+16]
    movdqa xmm2, [esi+ecx*8+32]
    movdqa xmm3, [esi+ecx*8+48]

    movntdq [edi+ecx*8], xmm0
    movntdq [edi+ecx*8+16], xmm1
    movntdq [edi+ecx*8+32], xmm2
    movntdq [edi+ecx*8+48], xmm3


    add ecx, 8
    dec eax
    jnz  writeloop

    or ecx, ecx
    jnz  mainloop

AND...

    mov esi, data1ptr
    mov edi, data2ptr

    mov ecx, DATASIZE

    lea esi, [esi+ecx*8]
    lea edi, [edi+ecx*8]

    neg ecx

align 16
mainloop:

    mov eax, (CACHEBLOCK / 16)
    add ecx, CACHEBLOCK

    mov edx, [esi+ecx*8-128]            ; Prime TLB
prefetchloop:                           ; Software Prefetch (touch) loop.
    mov ebx, [esi+ecx*8-64]
    mov ebx, [esi+ecx*8-128]
    sub ecx, 16
    dec eax
    jnz short prefetchloop


    mov eax, (CACHEBLOCK / 8)

writeloop:
    movq mm0, qword ptr [esi+ecx*8]
    movq mm1, qword ptr [esi+ecx*8+8]
    movq mm2, qword ptr [esi+ecx*8+16]
    movq mm3, qword ptr [esi+ecx*8+24]
    movq mm4, qword ptr [esi+ecx*8+32]
    movq mm5, qword ptr [esi+ecx*8+40]
    movq mm6, qword ptr [esi+ecx*8+48]
    movq mm7, qword ptr [esi+ecx*8+56]

    movntq qword ptr [edi+ecx*8], mm0
    movntq qword ptr [edi+ecx*8+8], mm1
    movntq qword ptr [edi+ecx*8+16], mm2
    movntq qword ptr [edi+ecx*8+24], mm3
    movntq qword ptr [edi+ecx*8+32], mm4
    movntq qword ptr [edi+ecx*8+40], mm5
    movntq qword ptr [edi+ecx*8+48], mm6
    movntq qword ptr [edi+ecx*8+56], mm7

    add ecx, 8
    dec eax
    jnz short writeloop

    or ecx, ecx
    jnz short mainloop



It's basically almost identical to the AMD reference one.. with a few minor changes.
I think this is about as good as it's going to get IMO.
Title: Re: memory copy ...
Post by: Human on January 29, 2006, 06:11:26 PM
webring fild is slow instruction, and even when i tried it on 486 it doesnt gave better performance, fld was faster
but one problem in time of dos remained, and i didnt knew why copied data wasnt same. reason was fpu and his accurancy bits, so if some game left fpu doing all calculations in 24bit then we had problems.

for my sdr133 ram best solution is movq and movntq, compared to rep movsd it gives increase in speed from 272 to 512mb/s when software prefetch used to copy 64mb and do prefetch for my athlon xp 1700+ 1467mhz for 64KB L1 cache, P4 can do only 8kb :P
i already was using mmx to copy data and this only gives 220 mln ticks compared to baseline 350 mln, with movntq and prefetch it droped to 194 mln ticks. using sse movaps doesnt boost pefrormance, have to but new amd64 and ddr2 memory
Title: Re: memory copy ...
Post by: Seb on March 02, 2007, 10:45:48 PM
Sorry for bumping this thread, but I implemented two of my own approaches to memory copying (using MMX), and got interesting results on my AMD Athlon 64 X2 4400+ and thought I'd share it with the rest of the community:

11111111

MemCopy - rep movsd                         : 176845573 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 164048385 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 172751465 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 151329157 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 154956402 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 148054639 cycles
mmxcopy  - movq x 8 : 147378815 cycles
mmxcopy2  - movq x 8 : 26129470 cycles

Press enter to exit...


mmxcopy and mmxcopy2 are my own functions and the second one is in all cases (and I've tried a LOT) much, much faster.

Edit: I sent it to a friend and he got results in a similar "order" on his laptop.

[attachment deleted by admin]
Title: Re: memory copy ...
Post by: ecube on March 03, 2007, 12:13:02 AM
amd 64 3800+

11111111

MemCopy - rep movsd                         : 282149047 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 279833356 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 280034282 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 277556675 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 277835608 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 276572415 cycles
mmxcopy  - movq x 8 : 276869072 cycles
mmxcopy2  - movq x 8 : 41037911 cycles

Press enter to exit...

nice job!  :U
Title: Re: memory copy ...
Post by: Seb on March 03, 2007, 12:18:58 AM
Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.
Title: Re: memory copy ...
Post by: ic2 on March 03, 2007, 02:24:27 PM
INTEL P3  846  512RAM


11111111

MemCopy - rep movsd                         : 522935521 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 523243442 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 524413281 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 523513785 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 511182237 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 521955504 cycles
mmxcopy  - movq x 8 : 524072757 cycles
mmxcopy2  - movq x 8 : 83558428 cycles

Press enter to exit...
Title: Re: memory copy ...
Post by: TNick on March 03, 2007, 02:56:29 PM
On Intel Celeron 2,53 GHz:

Quote
11111111

MemCopy - rep movsd                         : 396270757 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 373467620 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 375417875 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 402793093 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 398304671 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 404616784 cycles
mmxcopy  - movq x 8 : 406699997 cycles
mmxcopy2  - movq x 8 : 47871561 cycles


Nick
Title: Re: memory copy ...
Post by: Seb on March 03, 2007, 03:27:02 PM
Thanks, guys, for testing. :U
Title: Re: memory copy ...
Post by: u on March 03, 2007, 03:37:57 PM
Sempron 3000+ (64-bit)  [wow it beats an amd 64 3800+]

11111111

MemCopy - rep movsd                         : 176175708 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 177890633 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 173873833 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 167199286 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 169962053 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 169044058 cycles
mmxcopy  - movq x 8 : 167693297 cycles
mmxcopy2  - movq x 8 : 36151686 cycles

But strangely, on subsequent runs, the results are always around

MemCopy - rep movsd                         : 218944142 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 219306860 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 197413306 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 197132039 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 196587334 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 195489834 cycles
mmxcopy  - movq x 8 : 196248279 cycles
mmxcopy2  - movq x 8 : 36618720 cycles
  :eek
Title: Re: memory copy ...
Post by: Seb on March 03, 2007, 03:46:36 PM
Interesting, mmxcopy2 seems to be the fastest method on both AMD and Intel CPUs. Does anyone have a theory on why it's so much faster than the rest?
Title: Re: memory copy ...
Post by: u on March 03, 2007, 04:03:21 PM
It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.
Title: Re: memory copy ...
Post by: Seb on March 03, 2007, 04:12:13 PM
Quote from: Ultrano on March 03, 2007, 04:03:21 PM
It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.

I expected it to be a "hidden" bug that caused the "magic" result, thanks for letting me know. So what should it be?
Title: Re: memory copy ...
Post by: u on March 03, 2007, 04:45:35 PM
change "shr ecx,6" into "shr ecx,3",
change "add eax,1" into "add eax,8"
But even better:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

mmxcopy2 proc src:DWORD,dst:DWORD,lenx:DWORD
mov ecx,[esp+3*4] ; len
add esp,-3*4 ; 3 DWORDs
mov [esp],ecx ;
mov [esp-4],esi ;
mov [esp-8],edi ;
mov esi,[esp+1*4+(3*4)] ; src
mov edi,[esp+2*4+(3*4)] ; dst
cmp ecx,64 ;
;jz @exit ;
jb @tail ;
shr ecx,3 ;
mov edx,8

;
align 16 ;
@@:
sub ecx,edx ;
movq mm0,[esi+ecx*8+0] ;
movq mm1,[esi+ecx*8+8] ;
movq mm2,[esi+ecx*8+16] ;
movq mm3,[esi+ecx*8+24] ;
movq mm4,[esi+ecx*8+32] ;
movq mm5,[esi+ecx*8+40] ;
movq mm6,[esi+ecx*8+48] ;
movq mm7,[esi+ecx*8+56] ;
;
movq [edi+ecx*8+0],mm0 ;
movq [edi+ecx*8+8],mm1 ;
movq [edi+ecx*8+16],mm2 ;
movq [edi+ecx*8+24],mm3 ;
movq [edi+ecx*8+32],mm4 ;
movq [edi+ecx*8+40],mm5 ;
movq [edi+ecx*8+48],mm6 ;
movq [edi+ecx*8+56],mm7 ;

jz @F
;
jmp @B
@@:
;
and dword ptr [esp],63 ;
jz @exit ;
mov ecx,[esp] ;
;
@tail: ;
;cld ;
rep movsb ;
@exit: ;
mov edi,[esp-8] ;
mov esi,[esp-4] ;
add esp,3*4 ;
ret 3*4 ;
mmxcopy2 endp

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF

And still, this is not faster than mmxcopy()
Title: Re: memory copy ...
Post by: daydreamer on March 04, 2007, 12:09:12 PM
Quote from: Seb on March 03, 2007, 12:18:58 AM
Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.
I think what mobo/memorystick configuration is more relevant to know whats fastest
I mean most interesting to know how dualchannel DDR2, compares to a single DDR memorystick
mobos that are hyped to give dualchannel performance, despite the cpu doesnt support dualchannel should be interesting to check if its true or not