News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

memory copy ...

Started by James Ladd, May 15, 2005, 12:36:49 AM

Previous topic - Next topic

u

Seems like I'm the only AMD fan :)

Taken from:
"Using Block Prefetch for Optimized Memory Performance" - Advanced Micro Devices, author Mike Wall

Quote from PDF:
These code samples were run on an AMD AthlonXP Processor 1800+ with CAS2 DDR2100 memory, and VIA KT266A chipset. Data sizes were several megabytes, i.e. much larger than the cache.

To compare, "rep movsd" has bandwidth of ~640 MB/sec


; Note: copies qwords, to copy the left 0..7 bytes we should add a few more lines.
; Also we will want to make 8-byte alignment of copied range

;bandwidth: ~1976 MB/sec  (up 300% vs. baseline) .


CACHEBLOCK equ 400h ; number of QWORDs in a chunk
mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; total number of QWORDS (8 bytes)
; (assumes len / CACHEBLOCK = integer) lea esi, [esi+ecx*8] lea edi, [edi+ecx*8]
neg ecx

mainloop:

mov eax, CACHEBLOCK / 16 ; note: prefetch loop is unrolled 2X
add ecx, CACHEBLOCK ; move up to end of block

prefetchloop:

mov ebx, [esi+ecx*8-64] ; read one address in this cache line...
mov ebx, [esi+ecx*8-128] ; ... and one in the previous line
sub ecx, 16 ; 16 QWORDS = 2 64-byte cache lines
dec eax
jnz prefetchloop
mov eax, CACHEBLOCK / 8
;-----------[ Block copy ]-------------------\
writeloop:
movq mm0, qword ptr [esi+ecx*8]
movq mm1, qword ptr [esi+ecx*8+8]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
movntq qword ptr [edi+ecx*8], mm0
movntq qword ptr [edi+ecx*8+8], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
;-------------------------------------------/
add ecx, 8
dec eax
jnz writeloop
or ecx, ecx
jnz mainloop
sfence
emms

This code, using block prefetch and the MOVNTQ streaming store, achieves an overall memory bandwidth of 1976 MB/sec, which is over 90% of the theoretical maximum possible with DDR2100 memory.


Note on some instructions:
Now that the MMX registers are being used, the code can employ a very special instruction: MOVNTQ. This is a streaming store instruction, for writing data to memory. This instruction bypasses the on-chip cache, and sends data directly into a write combining buffer. And because the MOVNTQ allows the CPU to avoid reading the old data from the memory destination address, MOVNTQ can effectively double the total write bandwidth. (note that an SFENCE is required after the data is written, to flush the write buffer)
Please use a smaller graphic in your signature.

hutch--

Thanks Ultrano,

I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark_Larson

Quote from: hutch-- on May 19, 2005, 05:01:40 AM
Thanks Ultrano,

I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.

  Yep, software prefetch.  First came across it in Abrash's book ( Zen of Assembly Lnguage).  Personally I thought that was really funny considering the author used SSE code to do the MOVTNQ but then didn't try a prefetch instruction.  I wanted to try that and see what the speed difference is.  Technically a prefetch instruction just grabs the data over the bus and into the cache.  However you don't get any register dependency stalls ( yea!), and you can force it to only fetch into one cache level which usually makes it faster.  Right now the code is dying the big dog in my program.  I need to double check it. 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson


  I tried prefetchnta and it ran in the same speed.  No difference.  Which indicates to me that it's heavily write i/o bound.  Since technically the prefetch should have sped it up ( the author's code prefetches into the L1/L2 caches and I only prefetched into the L1, which is a faster way to do it).

  Ultano you might want to try running it on a 100 MB data buffer, to test those bandwidth numbers.  I tried 100MB, 10MB, and 1MB and got some varied results.  So I think 100MB will probably give you more accurate info.

Algorithm from the book:
100MB = 0.280 seconds = 357 MB/s
10MB   = 0.025 seconds = 400 MB/s
1MB     = 0.002 seconds = 500 MB/s 

REP MOVSD
100MB = 0.445 seconds = 225 MB/s
10MB   = 0.042 seconds = 238 MB/s
1MB     = 0.003 seconds = 333 MB/s 

 
  From your comments it sounded like you tried 1 or 2 MB.  The MB/s between 10MB and 100MB didn't change as much compared to the timing between 1MB and 10MB.  So you probably need to at least run a 10MB buffer or bigger to get accurate results.  So your benchmarking is going to be off.

  As a comparison hutch's code runs in .318 seconds for 100MB buffer on my machine.  I have a slow system at work.  My system at home has a 3.2 GB/s peak memory bandwidth ( dual channel rambus pc800).
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

u

I just copy/pasted from that PDF, and fixed-up the code from inline (for C/C++ compilers)  to normal asm. While this is completely useless for me, I guessed some of you might find it useful (and I see AMD docs are not something rarely anyone discusses on-so I'm showing a peek of them).
What are the specs of the PC you measured "1MB = 0.002 seconds = 500 MB/s " from ?
I haven't tested it yet though ^^" - but my PC must beat the pdf's benchmark results twofold:  AthlonXP2000+, 512MB DDR2@400MHz (PC2-3200).
Please use a smaller graphic in your signature.

Mark_Larson

  I don't know since I was given the system at work.  It has a 1.7 GHz P4.  I went to the control panel and it has an ICH0 (82801AB I/O controller), which dates it as really old.     The MCH ( memory controller hub, which would be responsible for the bandwidth to the memory) is a 82850 ( 850 is the chipset).  I am trying to dig up the peak memory bandwidth for the chipset.  But for a P4 system is seems low.  Most P4 systems have a lot of memory bandwidth.

EDIT: The PDF for the 82850 says that it supports a peak bandwidth of 3.2 GB/s, which leaves the memory installed as the slow part.  I'll see, if I can find out what kind.  I work for a hardware company, so there are a lot of "loose parts" floating around.  When new people start, generally they grab parts from all over to build your system.  So wouldn't surprise me if I had some slow memory in this system.

EDIT2: Found the memory on samsung's website.  It's ECC, which is probably the problem.  ECC memory is alwasy slower than non-ECC.  The website doesn't say it's ECC, the side of the RIMM has "ECC" written on it.  So they probably had both types ( ECC and non-ECC)

http://www.samsung.com/Products/Semiconductor/DRAM/RDRAM/RDRAMmodule/NormalRIMM/MR18R082GBN1/MR18R082GBN1.htm
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Momoass

Quote from: MichaelW on May 18, 2005, 05:14:14 AM
I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?

These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.

111111
MemCopy - rep movsd                         : 273834438 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 269703703 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 412743712 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274738762 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles



111111
MemCopy - rep movsd                         : 267229419 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 241144161 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 257438941 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 274631556 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 231717334 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 233434611 cycles


These result are for my SP2600+,why the xcopy fastest?

daydreamer

so you copy while cpus alu units is sitting idle+fpu sits idle x millis?
I am not that good with knowing clock cycles/operand, so why couldnt we experiment doing y math operations while copying x millis at the same time is possible without slowing down memorycopy?
interleave copy with a few math op? and you do z math operations for later use
should that be possible to research and test on?
it wont make memory copy faster, but following calculation proc has been speedup by z math operations being precalculated

daydreamer

what I mean research so you know for each mb copy you are able to interleave without affect copyspeed
for example 8000 fsincos or 40000 muls
or 1 meg add/sub per execution unit
or 1 meg boolean per execution (and or xor)
etc


Farabi

Great Im just gonna test it and this discussion is this far. But the optimize one cannot run on p3 machine rigth? SSE2 is not available?
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Webring

if you want a really fast mem copy try

; make DS:ESI point to the source code
; make ES:EDI point to the destination code
; ECX = length of code to be moved
; the code length is calculated in 16 bytes chunks

   mov_loop:
   fild qword ptr [esi]
   fild qword ptr [esi+16]
   fxch
   fistp qword ptr es:[edi]
   fistp qword ptr es:[edi+16]
   add edi, 16
   add esi, 16
   sub ecx, 16
   jns mov_loop

I got this from a lengendary site, that shared with those interested whats happened in the last decade, as far as the technical aspect.It features some on the most lengendary asm coders in the world, and  It was recently shutdown thanks to my WONDERFUL goverment, and fear in not understanding...  :tdown

Farabi

FPU is slow but I dont know if fistp and fild will be fast.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

u

fxch on older AMD cpus will be a major slowdown. On my K6-2 it took ... 30-60 cycles.
Please use a smaller graphic in your signature.

meeku

Hey all,

I've been running similar bandwidth tests on my various machines at work and at home and sofar I haven't really been that impressed with the performance.
I put together about 15 different memory copy routines and tested them on various sized buffers (512kb, 1Mb, 4Mb, 100Mb and 200Mb).

Sofar the maximum sustained transfer rate I can achieve on the 200Mb buffer is 1.2Gb/s ... Clearly the larger the buffer and the longer the test runs for the move accurate the mean rate becomes. For small buffers you might get considerably higher bandwidth due to measurement innaccuracy or having small chunks already in cache etc.
 
The results I've had sofar have confirmed that the streaming stores do only show an advantage somewhere over the 1Mb range.
I've found that the prefetchnta[...] instructions haven't offered any noticeable performance gain. I suspect this instruction is a bit like the Java GC, it's rumoured to do something but when and how is a bit mystery. Software pre-fetching as described in that AMD PDF does definately yield an improvement.

The machines I'm testing with are: IBM Pentium M notebook 1.70Ghz and a P4 3.2Ghz HT, 2gig of Dual DDR400, ATI Radeon 9800pro.

Oddly though the MMX versions seem to be faster than the SSE ones... on both machines.. even using movdqa, movntdq with SSE is still slower than the older movq, movntq...

I really have to say based on the machine specs, and the "theoretical" peak transfer of memory I would've expected to see something more like 2 -3 Gb/s .... anyways

hutch--

Hi meeku,

Welcome on board, you have done some interesting tests. I have much the same comment on the difference between MM and XMM data transfer, I can get about the same data transfer rate if I seperate the read and write streams of data using the non-temporal writes but I suggest tha the real limiting factor is memory bandwidth. This also says that processor is still a lot faster than memory. With each generation of hardware I have written a set of test pieces and the only improve I can get with raw data transfer is with non-temporal writes.

This may change as the x86-64 architecture starts to take over as the technical data say the internally transfer 64 bit chunks and can pair these to perform 128 bit data transfer but the true 32 bit hardware seems to handle data transfer in 32 bit chunks internally so they will always have the memory bandwidth problem.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php