if there a simple memory copy routine out there for MASM ?
Hi,
yes there is one shipped together with the masm32 package:
MemCopy proc public uses esi edi Source:PTR BYTE,Dest:PTR BYTE,ln:DWORD
Regards
/me waits for the torrent of memcopy implementations by members.. :bdg
Can we move this to the Lab so we can indeed have competitions about who can copy memory the fastest?
For what its worth, the REP MOVSD version in the masm32 library works OK as a general purpose memory copy but last time I played with a 4 DWORD version that paid attention to avoiding read after write stalls using more registers it was a bit faster on both the PIVs I work on. The REP MOVSD versions all suffer the same problem in that they are slow for the 1st 64 bytes or so until the special case circuitry kicks in.
If you are repeatedly hammering memory copies of under about 250 bytes, an incremented pointer version is a lot faster but the REP MOVSD method catches up fat after that. MOVQ versions are faster when done properly and if the hardware support is there, an XMM version distinguishing between temporal and non temporal reads and writes is faster again.
Striker this issue comes up all the time. The general answer is it depends on how much data you are copying.
1) less than 64-byte use a MOV or MOVQ or MOVDQA
2) 64+ bytes, use REP MOVSD
3) 1KB plus use MOVQ or MOVDQA
4) 2MB plus use MOVNTDQ or other non-temporal stores.
This is a rough estimate, and might vary from processor to procsesor. So look at your data size, and then try a few different ways, to see which works best for your specific application. A lot of it is dependent on memory speed. The biggest gain is with non-temporal stores in the multi-megabyte range. I was showing someone on win32 how to do that for their graphics program. They did different types of copies for differnet sizes sets of data, and have all the results tabulated, but for really large sizes it was running slow, so I showed them how to speed it up using non-temporal stores ( scroll down)
http://board.win32asmcommunity.net/index.php?topic=20846.0
here's a bit of a cut and paste, showing differnet sized buffers and how much faster MOVNTPS ( that's what I showed him to use since it does 16 byte writes and is supported with SSE). With 80mb on his system MOVNTPS ( move non-tepmoral) is 2.5 times faster. As the size of the buffer drops, the speed difference between the 2 algorithm drops. In the example below MOVNTPS is slower up till about .614 MB. So that's why I tend to use 1MB as a guidelines to try to use it. As always benchmark your code. Notebook systems and older systems will have slower memory bandwidth.
;mem write
;-------------
;movaps [esi+n],xmm0, 80MB, LU=4 : 1.75 GB/s
;movntps [esi+n],xmm0, 80MB, LU=4 : 4.25 GB/s :-)
;
;movaps [esi+n],xmm0, 0.2MB, LU=4 : 8.47 GB/s <= Should make sense
;movntps [esi+n],xmm0, 0.2MB, LU=4 : 4.22 GB/s
;
;movaps [esi+n],xmm0, 0.4MB, LU=4 : 6.84 GB/s
;movntps [esi+n],xmm0, 0.4MB, LU=4 : 4.26 GB/s
;
;movaps [esi+n],xmm0, 0.614MB, LU=4 : 4.38 GB/s (640x480x16 Buffer in ram)
;movntps [esi+n],xmm0, 0.614MB, LU=4 : 4.23 GB/s (640x480x16 Buffer in ram)
Here is a quick play. Its an unrolled DWORD copy timed against the REP MOVSD algo in the masm32 library. The main loop is two blocks of 8 movs, the design was to seperate reads from write using the same registers to avoid read after write stalls. Thw two blocks are an unroll by 2.
On the Prescott PIV I an using, aligning the label for the main block to 16 made no difference so I left it at 4.
Algo assumes at least 4 byte alignment for both source and destination buffers.
The timings I am getting on a 200 meg block of memory aligned by at least 4 bytes is 437 MS for REP MOVSD and 375 for the version posted below.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
dcopy proc src:DWORD,dst:DWORD,cnt:DWORD
LOCAL dcnt :DWORD
push ebx
push esi
push edi
mov esi, src
mov edi, dst
mov ecx, cnt
cmp ecx, 32
jl tail
shr ecx, 5 ; div by 32
mov dcnt, ecx
align 4
body:
mov eax, [esi]
mov ebx, [esi+4]
mov ecx, [esi+8]
mov edx, [esi+12]
mov [edi], eax
mov [edi+4], ebx
mov [edi+8], ecx
mov [edi+12], edx
mov eax, [esi+16]
mov ebx, [esi+20]
mov ecx, [esi+24]
mov edx, [esi+28]
mov [edi+16], eax
mov [edi+20], ebx
mov [edi+24], ecx
mov [edi+28], edx
add esi, 32
add edi, 32
sub dcnt, 1
jnz body
mov ecx, cnt
and ecx, 31
jz bcend
tail:
mov al, [esi]
add esi, 1
mov [edi], al
add edi, 1
sub ecx, 1
jnz tail
bcend:
pop edi
pop esi
pop ebx
ret
dcopy endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?
These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.
111111
MemCopy - rep movsd : 273834438 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 269703703 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 412743712 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 274738762 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles
[attachment deleted by admin]
This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.
It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.
nops MACRO number
REPEAT number
nop
ENDM
ENDM
mmxcopy proc src:DWORD,dst:DWORD,cnt:DWORD
LOCAL lcnt :DWORD
push ebx
push esi
push edi
mov esi, src
mov edi, dst
mov ecx, cnt
xor ebx, ebx ; zero ebx
shr ecx, 6 ; div by 64
mov lcnt, ecx
mov ecx, esi
mov edx, edi
align 4
stlp:
movq mm(0), [esi]
movq mm(1), [ecx+8]
movq mm(2), [esi+16]
movq mm(3), [ecx+24]
movq mm(4), [esi+32]
movq mm(5), [ecx+40]
movq mm(7), [esi+56]
movq mm(6), [ecx+48]
add esi, 64
add ecx, 64
movntq [edi], mm(0)
movntq [edx+8], mm(1)
movntq [edi+16], mm(2)
movntq [edx+24], mm(3)
movntq [edi+32], mm(4)
movntq [edx+40], mm(5)
movntq [edi+48], mm(6)
movntq [edx+56], mm(7)
add edi, 64
add edx, 64
;; nops 96
sub lcnt, 1
jnz stlp
quit:
pop edi
pop esi
pop ebx
ret
mmxcopy endp
I have my answer but I suspect this thread will continue for some time :)
Hutch,
Unaligned data will slow your code. I suggest that you include another loop so that your data is aligned before going to the main loop.
Victor,
The MMX version needs at least 8 byte alignment but thats not what I am testing with it, its the absolute data transfer rate which does not seem to be all that much faster than the integer versions.
I think you are missing out prefetch codes.
Quote from: hutch-- on May 18, 2005, 05:17:41 AM
This is a test piece, it does not handle odd numbers of bytes at the tail. I timed it on the same 200 meg sample and the time drops to about 300 MS but there are some unusual onomalies in how it runs. It can be padded with a large number of nops and not run any slower on this PIV. I have changed the order and the only thing that slows it down is to place the last non-temporal write at the beginning instead of in order. Interleaving the reads and writes did not effect the timing at all and running prefetchnta and with other hint types every 4k down to 128 bytes did not effect the timing at all.
It is faster than either REP MOVSD and the version posted above but not by as much as you would expect. Change the MOVNTQ to MOVQ and it runs at about the same speed as the one posted above.
Also try moving where the "prefetchnta" instruction is. I wrote a program to automatically try all combinations of offsets and the location of the instruction in a loop and find the "optimum one". I'll see if I can find it and post it. You only want to prefetch the source not the destination since you are writing directly to memory. I'll take a try at your code and see if I can speed it up any, when I get a chance.
EDIT: Memory copies are heavily dependent on the maximum peak memory bandwidth. So some systems might run your code really slow and others really fast. Currently it is running in 828 ms for 200MB on mine with no modfiications.
This is the SSE version of the same algo and on my PIV it runs at exactly the same timing as the MMX version. The box is a 2.8 gig Prescott PIV with an 800 meg FSB Intel board and 2 gig of DDR400 memory. I have done this style of testing on a number of different generations of hardware and they all seem to exhibit the same characteristics which suggests to me that Mark's comment on memory bandwidth is the limiting factor is correct.
the factor that fascinated me is the amount of spare time floating around in the loops in both the MMX and XMM versions. I tried a hybrib that did both MMX and normal integer copy but it was really slow so the memory access times seem to be the problem. The only gain I can so far get from MMX or SSE code is the non temporal writes which reduce cache pollution.
Being able to padd the loop with a large number of NOPS shows that there is processing time being wasted which says the processor is still a lot faster than the DDR400 memory.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
xmmcopy proc src:DWORD,dst:DWORD,cnt:DWORD
LOCAL lcnt :DWORD
push ebx
push esi
push edi
mov esi, src
mov edi, dst
mov ecx, cnt
shr ecx, 6
mov lcnt, ecx
align 4
stlp:
movdqa xmm(0), [esi]
movdqa xmm(1), [esi+16]
movdqa xmm(2), [esi+32]
movdqa xmm(3), [esi+48]
add esi, 64
movntdq [edi], xmm(0)
movntdq [edi+16], xmm(1)
movntdq [edi+32], xmm(2)
movntdq [edi+48], xmm(3)
add edi, 64
sub lcnt, 1
jnz stlp
pop edi
pop esi
pop ebx
ret
xmmcopy endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Seems like I'm the only AMD fan :)
Taken from:
"Using Block Prefetch for Optimized Memory Performance" - Advanced Micro Devices, author Mike Wall
Quote from PDF:
These code samples were run on an AMD AthlonXP Processor 1800+ with CAS2 DDR2100 memory, and VIA KT266A chipset. Data sizes were several megabytes, i.e. much larger than the cache.
To compare, "rep movsd" has bandwidth of ~640 MB/sec
; Note: copies qwords, to copy the left 0..7 bytes we should add a few more lines.
; Also we will want to make 8-byte alignment of copied range
;bandwidth: ~1976 MB/sec (up 300% vs. baseline) .
CACHEBLOCK equ 400h ; number of QWORDs in a chunk
mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; total number of QWORDS (8 bytes)
; (assumes len / CACHEBLOCK = integer) lea esi, [esi+ecx*8] lea edi, [edi+ecx*8]
neg ecx
mainloop:
mov eax, CACHEBLOCK / 16 ; note: prefetch loop is unrolled 2X
add ecx, CACHEBLOCK ; move up to end of block
prefetchloop:
mov ebx, [esi+ecx*8-64] ; read one address in this cache line...
mov ebx, [esi+ecx*8-128] ; ... and one in the previous line
sub ecx, 16 ; 16 QWORDS = 2 64-byte cache lines
dec eax
jnz prefetchloop
mov eax, CACHEBLOCK / 8
;-----------[ Block copy ]-------------------\
writeloop:
movq mm0, qword ptr [esi+ecx*8]
movq mm1, qword ptr [esi+ecx*8+8]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
movntq qword ptr [edi+ecx*8], mm0
movntq qword ptr [edi+ecx*8+8], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
;-------------------------------------------/
add ecx, 8
dec eax
jnz writeloop
or ecx, ecx
jnz mainloop
sfence
emms
This code, using block prefetch and the MOVNTQ streaming store, achieves an overall memory bandwidth of 1976 MB/sec, which is over 90% of the theoretical maximum possible with DDR2100 memory.
Note on some instructions:
Now that the MMX registers are being used, the code can employ a very special instruction: MOVNTQ. This is a streaming store instruction, for writing data to memory. This instruction bypasses the on-chip cache, and sends data directly into a write combining buffer. And because the MOVNTQ allows the CPU to avoid reading the old data from the memory destination address, MOVNTQ can effectively double the total write bandwidth. (note that an SFENCE is required after the data is written, to flush the write buffer)
Thanks Ultrano,
I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.
Quote from: hutch-- on May 19, 2005, 05:01:40 AM
Thanks Ultrano,
I think the technique is called software pretouch and it appears to be faster than using the Intel "prefetchxxx" series of instructions.
Yep, software prefetch. First came across it in Abrash's book ( Zen of Assembly Lnguage). Personally I thought that was really funny considering the author used SSE code to do the MOVTNQ but then didn't try a prefetch instruction. I wanted to try that and see what the speed difference is. Technically a prefetch instruction just grabs the data over the bus and into the cache. However you don't get any register dependency stalls ( yea!), and you can force it to only fetch into one cache level which usually makes it faster. Right now the code is dying the big dog in my program. I need to double check it.
I tried prefetchnta and it ran in the same speed. No difference. Which indicates to me that it's heavily write i/o bound. Since technically the prefetch should have sped it up ( the author's code prefetches into the L1/L2 caches and I only prefetched into the L1, which is a faster way to do it).
Ultano you might want to try running it on a 100 MB data buffer, to test those bandwidth numbers. I tried 100MB, 10MB, and 1MB and got some varied results. So I think 100MB will probably give you more accurate info.
Algorithm from the book:
100MB = 0.280 seconds = 357 MB/s
10MB = 0.025 seconds = 400 MB/s
1MB = 0.002 seconds = 500 MB/s
REP MOVSD
100MB = 0.445 seconds = 225 MB/s
10MB = 0.042 seconds = 238 MB/s
1MB = 0.003 seconds = 333 MB/s
From your comments it sounded like you tried 1 or 2 MB. The MB/s between 10MB and 100MB didn't change as much compared to the timing between 1MB and 10MB. So you probably need to at least run a 10MB buffer or bigger to get accurate results. So your benchmarking is going to be off.
As a comparison hutch's code runs in .318 seconds for 100MB buffer on my machine. I have a slow system at work. My system at home has a 3.2 GB/s peak memory bandwidth ( dual channel rambus pc800).
I just copy/pasted from that PDF, and fixed-up the code from inline (for C/C++ compilers) to normal asm. While this is completely useless for me, I guessed some of you might find it useful (and I see AMD docs are not something rarely anyone discusses on-so I'm showing a peek of them).
What are the specs of the PC you measured "1MB = 0.002 seconds = 500 MB/s " from ?
I haven't tested it yet though ^^" - but my PC must beat the pdf's benchmark results twofold: AthlonXP2000+, 512MB DDR2@400MHz (PC2-3200).
I don't know since I was given the system at work. It has a 1.7 GHz P4. I went to the control panel and it has an ICH0 (82801AB I/O controller), which dates it as really old. The MCH ( memory controller hub, which would be responsible for the bandwidth to the memory) is a 82850 ( 850 is the chipset). I am trying to dig up the peak memory bandwidth for the chipset. But for a P4 system is seems low. Most P4 systems have a lot of memory bandwidth.
EDIT: The PDF for the 82850 says that it supports a peak bandwidth of 3.2 GB/s, which leaves the memory installed as the slow part. I'll see, if I can find out what kind. I work for a hardware company, so there are a lot of "loose parts" floating around. When new people start, generally they grab parts from all over to build your system. So wouldn't surprise me if I had some slow memory in this system.
EDIT2: Found the memory on samsung's website. It's ECC, which is probably the problem. ECC memory is alwasy slower than non-ECC. The website doesn't say it's ECC, the side of the RIMM has "ECC" written on it. So they probably had both types ( ECC and non-ECC)
http://www.samsung.com/Products/Semiconductor/DRAM/RDRAM/RDRAMmodule/NormalRIMM/MR18R082GBN1/MR18R082GBN1.htm
Quote from: MichaelW on May 18, 2005, 05:14:14 AM
I couldn't resist playing with this, so I created a test app that compares the MASM32 MemCopy to Hutch's dcopy, along with two MMX versions and two SSE versions. I coded the MMX and SSE versions without bothering to learn the details, so try not to laugh. And as you might expect, Hutch's dcopy is the fastest, but only by a small margin. Why exactly are the XMM and SSE versions slower than the ALU versions?
These results are for my P3 and a 100MB buffer size. The "1"s along the top are the return values for the MASM32 cmpmem procedure that I used as part of a function test for each of the procedures. I had a lot of other stuff running while I was running the tests, but each time after the test app terminated System Information showed ~390000 KB available.
111111
MemCopy - rep movsd : 273834438 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 269703703 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 412743712 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 274738762 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 270069732 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 312140267 cycles
111111
MemCopy - rep movsd : 267229419 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 241144161 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 257438941 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 274631556 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 231717334 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 233434611 cycles
These result are for my SP2600+,why the xcopy fastest?
so you copy while cpus alu units is sitting idle+fpu sits idle x millis?
I am not that good with knowing clock cycles/operand, so why couldnt we experiment doing y math operations while copying x millis at the same time is possible without slowing down memorycopy?
interleave copy with a few math op? and you do z math operations for later use
should that be possible to research and test on?
it wont make memory copy faster, but following calculation proc has been speedup by z math operations being precalculated
what I mean research so you know for each mb copy you are able to interleave without affect copyspeed
for example 8000 fsincos or 40000 muls
or 1 meg add/sub per execution unit
or 1 meg boolean per execution (and or xor)
etc
Great Im just gonna test it and this discussion is this far. But the optimize one cannot run on p3 machine rigth? SSE2 is not available?
if you want a really fast mem copy try
; make DS:ESI point to the source code
; make ES:EDI point to the destination code
; ECX = length of code to be moved
; the code length is calculated in 16 bytes chunks
mov_loop:
fild qword ptr [esi]
fild qword ptr [esi+16]
fxch
fistp qword ptr es:[edi]
fistp qword ptr es:[edi+16]
add edi, 16
add esi, 16
sub ecx, 16
jns mov_loop
I got this from a lengendary site, that shared with those interested whats happened in the last decade, as far as the technical aspect.It features some on the most lengendary asm coders in the world, and It was recently shutdown thanks to my WONDERFUL goverment, and fear in not understanding... :tdown
FPU is slow but I dont know if fistp and fild will be fast.
fxch on older AMD cpus will be a major slowdown. On my K6-2 it took ... 30-60 cycles.
Hey all,
I've been running similar bandwidth tests on my various machines at work and at home and sofar I haven't really been that impressed with the performance.
I put together about 15 different memory copy routines and tested them on various sized buffers (512kb, 1Mb, 4Mb, 100Mb and 200Mb).
Sofar the maximum sustained transfer rate I can achieve on the 200Mb buffer is 1.2Gb/s ... Clearly the larger the buffer and the longer the test runs for the move accurate the mean rate becomes. For small buffers you might get considerably higher bandwidth due to measurement innaccuracy or having small chunks already in cache etc.
The results I've had sofar have confirmed that the streaming stores do only show an advantage somewhere over the 1Mb range.
I've found that the prefetchnta[...] instructions haven't offered any noticeable performance gain. I suspect this instruction is a bit like the Java GC, it's rumoured to do something but when and how is a bit mystery. Software pre-fetching as described in that AMD PDF does definately yield an improvement.
The machines I'm testing with are: IBM Pentium M notebook 1.70Ghz and a P4 3.2Ghz HT, 2gig of Dual DDR400, ATI Radeon 9800pro.
Oddly though the MMX versions seem to be faster than the SSE ones... on both machines.. even using movdqa, movntdq with SSE is still slower than the older movq, movntq...
I really have to say based on the machine specs, and the "theoretical" peak transfer of memory I would've expected to see something more like 2 -3 Gb/s .... anyways
Hi meeku,
Welcome on board, you have done some interesting tests. I have much the same comment on the difference between MM and XMM data transfer, I can get about the same data transfer rate if I seperate the read and write streams of data using the non-temporal writes but I suggest tha the real limiting factor is memory bandwidth. This also says that processor is still a lot faster than memory. With each generation of hardware I have written a set of test pieces and the only improve I can get with raw data transfer is with non-temporal writes.
This may change as the x86-64 architecture starts to take over as the technical data say the internally transfer 64 bit chunks and can pair these to perform 128 bit data transfer but the true 32 bit hardware seems to handle data transfer in 32 bit chunks internally so they will always have the memory bandwidth problem.
Hey Hutch :wink
Good to be on board!
Something I'd actually be interested to try is using the same asm routines on a linux box. I did quite a bit of work on putting together my own os and kernel based on an old protected mode extender i wrote for dos and I found that using software in the kernel to manage virtual address spaces, memory management and task switching was considerably faster and more reliable than using the built in task switching/v86 mode stuff on the cpu.. basically you leave the cpu running in a flat memory model with no paging and all at ring 0, and let the kernel manage the rest... anyhow im digressing, the point is that memory access was at least 20% faster, so perhaps the results would vary on linux.
Anyone doing asm coding under linux want to give us some performance results for the same tests? :bg
Hey,
Managed to marginally improve the performance by adding one more read prior to prefetch to prime the TLB buffer (as per Intel's optimisation recommendations).
Sitting at 1.1Gb/s tested with a 100 and 200Mb buffer and various number of iterations on the timing loop.
This was the performance from my 1.7Ghz Pentium Mobile IBM laptop... in theory this result should be much better on a decent spec desktop.
Here are the two fastest versions sofar:
mov esi, data1ptr
mov edi, data2ptr
mov ecx, DATASIZE
lea esi, [esi+ecx*8]
lea edi, [edi+ecx*8]
neg ecx
align 16
mainloop:
mov eax, (CACHEBLOCK / 16)
add ecx, CACHEBLOCK
mov edx, [esi+ecx*8-128]
prefetchloop:
mov ebx, [esi+ecx*8-64]
mov ebx, [esi+ecx*8-128]
sub ecx, 16
dec eax
jnz short prefetchloop
mov eax, (CACHEBLOCK / 8)
writeloop:
movdqa xmm0, [esi+ecx*8]
movdqa xmm1, [esi+ecx*8+16]
movdqa xmm2, [esi+ecx*8+32]
movdqa xmm3, [esi+ecx*8+48]
movntdq [edi+ecx*8], xmm0
movntdq [edi+ecx*8+16], xmm1
movntdq [edi+ecx*8+32], xmm2
movntdq [edi+ecx*8+48], xmm3
add ecx, 8
dec eax
jnz writeloop
or ecx, ecx
jnz mainloop
AND...
mov esi, data1ptr
mov edi, data2ptr
mov ecx, DATASIZE
lea esi, [esi+ecx*8]
lea edi, [edi+ecx*8]
neg ecx
align 16
mainloop:
mov eax, (CACHEBLOCK / 16)
add ecx, CACHEBLOCK
mov edx, [esi+ecx*8-128] ; Prime TLB
prefetchloop: ; Software Prefetch (touch) loop.
mov ebx, [esi+ecx*8-64]
mov ebx, [esi+ecx*8-128]
sub ecx, 16
dec eax
jnz short prefetchloop
mov eax, (CACHEBLOCK / 8)
writeloop:
movq mm0, qword ptr [esi+ecx*8]
movq mm1, qword ptr [esi+ecx*8+8]
movq mm2, qword ptr [esi+ecx*8+16]
movq mm3, qword ptr [esi+ecx*8+24]
movq mm4, qword ptr [esi+ecx*8+32]
movq mm5, qword ptr [esi+ecx*8+40]
movq mm6, qword ptr [esi+ecx*8+48]
movq mm7, qword ptr [esi+ecx*8+56]
movntq qword ptr [edi+ecx*8], mm0
movntq qword ptr [edi+ecx*8+8], mm1
movntq qword ptr [edi+ecx*8+16], mm2
movntq qword ptr [edi+ecx*8+24], mm3
movntq qword ptr [edi+ecx*8+32], mm4
movntq qword ptr [edi+ecx*8+40], mm5
movntq qword ptr [edi+ecx*8+48], mm6
movntq qword ptr [edi+ecx*8+56], mm7
add ecx, 8
dec eax
jnz short writeloop
or ecx, ecx
jnz short mainloop
It's basically almost identical to the AMD reference one.. with a few minor changes.
I think this is about as good as it's going to get IMO.
webring fild is slow instruction, and even when i tried it on 486 it doesnt gave better performance, fld was faster
but one problem in time of dos remained, and i didnt knew why copied data wasnt same. reason was fpu and his accurancy bits, so if some game left fpu doing all calculations in 24bit then we had problems.
for my sdr133 ram best solution is movq and movntq, compared to rep movsd it gives increase in speed from 272 to 512mb/s when software prefetch used to copy 64mb and do prefetch for my athlon xp 1700+ 1467mhz for 64KB L1 cache, P4 can do only 8kb :P
i already was using mmx to copy data and this only gives 220 mln ticks compared to baseline 350 mln, with movntq and prefetch it droped to 194 mln ticks. using sse movaps doesnt boost pefrormance, have to but new amd64 and ddr2 memory
Sorry for bumping this thread, but I implemented two of my own approaches to memory copying (using MMX), and got interesting results on my AMD Athlon 64 X2 4400+ and thought I'd share it with the rest of the community:
11111111
MemCopy - rep movsd : 176845573 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 164048385 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 172751465 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 151329157 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 154956402 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 148054639 cycles
mmxcopy - movq x 8 : 147378815 cycles
mmxcopy2 - movq x 8 : 26129470 cycles
Press enter to exit...
mmxcopy and mmxcopy2 are my own functions and the second one is in all cases (and I've tried a LOT) much, much faster.
Edit: I sent it to a friend and he got results in a similar "order" on his laptop.
[attachment deleted by admin]
amd 64 3800+
11111111
MemCopy - rep movsd : 282149047 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 279833356 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 280034282 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 277556675 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 277835608 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 276572415 cycles
mmxcopy - movq x 8 : 276869072 cycles
mmxcopy2 - movq x 8 : 41037911 cycles
Press enter to exit...
nice job! :U
Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.
INTEL P3 846 512RAM
11111111
MemCopy - rep movsd : 522935521 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 523243442 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 524413281 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 523513785 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 511182237 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 521955504 cycles
mmxcopy - movq x 8 : 524072757 cycles
mmxcopy2 - movq x 8 : 83558428 cycles
Press enter to exit...
On Intel Celeron 2,53 GHz:
Quote
11111111
MemCopy - rep movsd : 396270757 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 373467620 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 375417875 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 402793093 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 398304671 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 404616784 cycles
mmxcopy - movq x 8 : 406699997 cycles
mmxcopy2 - movq x 8 : 47871561 cycles
Nick
Thanks, guys, for testing. :U
Sempron 3000+ (64-bit) [wow it beats an amd 64 3800+]
11111111
MemCopy - rep movsd : 176175708 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 177890633 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 173873833 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 167199286 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 169962053 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 169044058 cycles
mmxcopy - movq x 8 : 167693297 cycles
mmxcopy2 - movq x 8 : 36151686 cycles
But strangely, on subsequent runs, the results are always around
MemCopy - rep movsd : 218944142 cycles
dcopy - mov reg,mem/mov mem,reg x 8 : 219306860 cycles
qcopy - movq mmx,mem/movq mem,mmx x 1 : 197413306 cycles
_qcopy - movq mmx,mem/movq mem,mmx x 8 : 197132039 cycles
xcopy - movaps xmm,mem/movaps mem/xmm x 1 : 196587334 cycles
_xcopy - movaps xmm,mem/movaps mem/xmm x 8 : 195489834 cycles
mmxcopy - movq x 8 : 196248279 cycles
mmxcopy2 - movq x 8 : 36618720 cycles
:eek
Interesting, mmxcopy2 seems to be the fastest method on both AMD and Intel CPUs. Does anyone have a theory on why it's so much faster than the rest?
It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.
Quote from: Ultrano on March 03, 2007, 04:03:21 PM
It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.
I expected it to be a "hidden" bug that caused the "magic" result, thanks for letting me know. So what should it be?
change "shr ecx,6" into "shr ecx,3",
change "add eax,1" into "add eax,8"
But even better:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
mmxcopy2 proc src:DWORD,dst:DWORD,lenx:DWORD
mov ecx,[esp+3*4] ; len
add esp,-3*4 ; 3 DWORDs
mov [esp],ecx ;
mov [esp-4],esi ;
mov [esp-8],edi ;
mov esi,[esp+1*4+(3*4)] ; src
mov edi,[esp+2*4+(3*4)] ; dst
cmp ecx,64 ;
;jz @exit ;
jb @tail ;
shr ecx,3 ;
mov edx,8
;
align 16 ;
@@:
sub ecx,edx ;
movq mm0,[esi+ecx*8+0] ;
movq mm1,[esi+ecx*8+8] ;
movq mm2,[esi+ecx*8+16] ;
movq mm3,[esi+ecx*8+24] ;
movq mm4,[esi+ecx*8+32] ;
movq mm5,[esi+ecx*8+40] ;
movq mm6,[esi+ecx*8+48] ;
movq mm7,[esi+ecx*8+56] ;
;
movq [edi+ecx*8+0],mm0 ;
movq [edi+ecx*8+8],mm1 ;
movq [edi+ecx*8+16],mm2 ;
movq [edi+ecx*8+24],mm3 ;
movq [edi+ecx*8+32],mm4 ;
movq [edi+ecx*8+40],mm5 ;
movq [edi+ecx*8+48],mm6 ;
movq [edi+ecx*8+56],mm7 ;
jz @F
;
jmp @B
@@:
;
and dword ptr [esp],63 ;
jz @exit ;
mov ecx,[esp] ;
;
@tail: ;
;cld ;
rep movsb ;
@exit: ;
mov edi,[esp-8] ;
mov esi,[esp-4] ;
add esp,3*4 ;
ret 3*4 ;
mmxcopy2 endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
And still, this is not faster than mmxcopy()
Quote from: Seb on March 03, 2007, 12:18:58 AM
Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.
I think what mobo/memorystick configuration is more relevant to know whats fastest
I mean most interesting to know how dualchannel DDR2, compares to a single DDR memorystick
mobos that are hyped to give dualchannel performance, despite the cpu doesnt support dualchannel should be interesting to check if its true or not