hi
i copy 256 mo 268435456 octet to onother 256 mo
it take 0.116 second
can do faster ?
it s 2.15Go/s in read and 2.15Go/s in write on a amd or amd say cand do 8Go/s???
The MASM32 memcopy procedure should be able to do reasonably well copying 256MB, but there are better methods available.
http://www.masmforum.com/simple/index.php?topic=1637.0
ok so 2go/s is good
i use code of amd
i make a dll
.686 ; for 586 processor or better
.model flat, stdcall ; 32-bit memory and standard call
option casemap:none
.xmm
.code ; the beginning of the code section
LibMain proc h:DWORD, r:DWORD, u:DWORD ; the dll entry point
mov eax, 1 ; if eax is 0, the dll won't start
ret ; return
LibMain Endp ; end of the dll entry
;******************************************************************
mul5 proc nb:dword,nb2:dword,p:dword
mov edx,nb ;pointer
mov esi,nb2 ; pointer
mov ecx,p ; number of octet
add edx, ecx ; Add to source address the number of bytes to copy.;
add esi, ecx ; Add to destination address the number of bytes to copy.
shr ecx, 3 ; Convert number of bytes to number of QWORDs.
neg ecx ; Make number of QWORDs negative.
chunkloop:
mov ebx, ecx ; Save number of QWORDs.
mov eax,128 ;128 Initialize number of cache-line pairs to prefetch.
prefetchloop:
prefetchnta [edx+ecx*8] ; 64 Load a line (64 bytes) into the L1 data cache;
add ecx, 8 ; 8*8 64 Select next cache-line pair.
dec eax ; 64*128=8192 Decrement number of cache-line pairs.
jnz prefetchloop ; If cache-line pairs remain, then jump.
mov ecx, ebx ; Restore number of QWORDs.
mov eax, 256 ; 256 Initialize number of sub-blocks to copy.
copyloop:
movapd xmm0,[edx+ecx*8] ;
movapd xmm1,[edx+ecx*8+16] ;
movntdq [esi+ecx*8], xmm0 ;
movntdq [esi+ecx*8+16], xmm1 ;
add ecx, 4 ;
dec eax ;
jnz copyloop ; If another sub-block remains, then jump.
or ecx, ecx ; Test whether chunk count is 0.
jnz chunkloop ; If another chunk remains, then jump.
sfence ; Flush the write-combining buffer.
ret
mul5 endp
The combination you are using looks correct, temporal reads and non temporal writes. You are probably already finding the memory speed limit which is the major restricting factor with block memory copy on current computers. You may be able to tweak a little more here and there but you will not get 4 times speed increases as you are already using SSE instructions.
You don't need the line:
SHR ECX, 3
If you keep ECX as the number of bytes, you can remove the *8 multiplier everywhere it appears, and this will speed up the address calculation (plus produce smaller code). Just add 64 instead of 8 to ECX everywhere you do that to compensate, and 32 near the end of the code where you have:
ADD ECX, 4
When you prefetch the data, you have only considered the size of the cache, you haven't limited that to the amount of data you are transferring, which might be smaller! You should probably add a test/JMP in the prefetch loop:
ADD ECX, 8 ; 8*8 64 Select next cache-line pair;
JNS @F <-- ADD THIS to clear loop early if size of data is less than cache size
DEC EAX ; 64*128=8192 Decrement number of cache-line pairs.
JNZ prefetchloop ; If cache-line pairs remain, then jump.
@@: <-- AND THIS
MOV ECX, EBX
You need an identical early-out test/JMP in the copyloop too.
You might also consider that you should probably check whether the pointers you are giving the routine are 16-byte aligned, which I believe the SSE instructions require, and obviously also that the number of bytes is a multiple of 32, as you are currently transferring 32 bytes in the copyloop regardless of the actual amount left, unless you add code to handle copying any remainder. (If you don't do the SHR ECX, 3 as I suggested, you still have the remainder available). You'd really need to return some error value in EAX if the parameters were wrong. A good subroutine should sanity-check its input parameters, not rely on the calling routine to have done so.
And don't forget to add a PUSH ESI at the beginning and POP ESI at the end to avoid the whole thing crashing, as you must restore that register for the calling routine. :wink
IanB
yes if not a multiple of 8192 big probleme
esi is important ? i crash if i use edi or ebp not with esi
yes i must alligne at 16 and visual basic not allign all the time at 16 :( i must include to copy 8 and after by 16
Quote from: elcricri on January 16, 2006, 11:43:02 PM
esi is important ? i crash if i use edi or ebp not with esi
You've been lucky! That's only because whatever you are calling it from hasn't been using ESI, fortunately. If you reuse the routine with another piece of code you can't rely on that. Always save/restore EBP/EBX/EDI/ESI if you use them in a proc.
Quoteyes i must alligne at 16 and visual basic not allign all the time at 16 :(
If you need an aligned memory block with VB, and it obviously needs to be a significant size for your purposes, you could try adding another routine into your DLL that calls VirtualAlloc, which reserves and allocates page-aligned memory blocks. There's a fudge using sneaky non-supported VB statements which allows you to use the memory pointer you can pass back as the return value so that you can access the memory block. My memory on it is a little rusty, but do some Googling on
AddressOf,
VarPtr,
StrPtr, and
ObjPtr and you'll certainly find some code that'll show you how to use the allocated and now aligned memory block as a normal byte array in your VB code.
Try http://www.thevbzone.com/secrets.htm for starters. :wink
Don't forget to add another routine in the DLL that will DE-allocate the block with VirtualFree though, when you pass the initial pointer back.
IanB
Hmm.. :'(
Re-reading this, I think it's going to be difficult to get a chunk of memory allocated from within the DLL recognised by the calling VB program, as you can't write direct to an arbitrary memory address in VB, only to a variable reference. And I don't think you can change or create a reference to use that allocation.
So I think you have two options. The first is to write code that keeps all the data you are moving around within the DLL, do all your processing in ASM by DLL calls and use the VB side as merely a GUI wrapper to that code. But that will then make it difficult to get any information out of the memory block if your VB prog needs to process it other than merely moving it from place to place.
The second option is more fudge. Use the VarPtr function to get the memory address of a byte array that you create within VB. Make it at least 16 bytes more than you need in size. Because the result of that function is a real pointer to the first element of the byte array, that you can pass to an API call or to your memory copy proc, all you need to know is the alignment of that pointer, which you can test the usual way (AND with 0fH = 15 to see if it's 16-byte aligned). If it's not aligned, just save a padding value that you can use as an offset on the VB side, and send your DLL the first aligned location in the array block as a pointer. That way, your DLL can use the memory allocated by VB and your VB code can access the results, always indexing into the byte array with the extra padding offset.
IanB
Quote from: Ian_B on January 17, 2006, 05:04:52 PM
AND with 0fH = 15 to see if it's 16-byte aligned
Shouldn't that be 'AND with 0Fh = 0 if it's 16-byte aligned'?
Alternatively, OR the value witih 0Fh then add 1 and it is guarenteed to be 16-byte aligned (but test first in case it already is) :U
Here is a quick play, mainly to format it so I could read it and add the stack code for EBX and ESI. If you can make it fit with your block copy size, I would unroll the SSE instructions so you work on a bigger block at a time and there is no reason not to use all 8 SSE registers. SSE instruction can be noticably laggy when mixed with normal integer instructions and one way you can tell this is to padd different parts of the code with nops to see if the operation slows down. At times, unrolling the SSE code can help here.
Seperating temporal read from non temporal write is a good idea which is faster than feeding the same data in and out of the cache.
What you will tend to find is that memory speed is the limiting factors with how fast you can make this code work.
; ******************************************************************
align 16
mul5 proc nb:dword,nb2:dword,p:dword
push ebx ; << added
push esi
mov edx,nb ; pointer
mov esi,nb2 ; pointer
mov ecx,p ; number of octet
add edx, ecx ; Add to source address the number of bytes to copy.;
add esi, ecx ; Add to destination address the number of bytes to copy.
shr ecx, 3 ; Convert number of bytes to number of QWORDs.
neg ecx ; Make number of QWORDs negative.
chunkloop:
mov ebx, ecx ; Save number of QWORDs.
mov eax,128 ; 128 Initialize number of cache-line pairs to prefetch.
prefetchloop:
prefetchnta [edx+ecx*8] ; 64 Load a line (64 bytes) into the L1 data cache;
add ecx, 8 ; 8*8 64 Select next cache-line pair.
sub eax, 1 ; 64*128=8192 Decrement number of cache-line pairs.
jnz prefetchloop ; If cache-line pairs remain, then jump.
mov ecx, ebx ; Restore number of QWORDs.
mov eax, 256 ; 256 Initialize number of sub-blocks to copy.
copyloop:
movapd xmm0,[edx+ecx*8]
movapd xmm1,[edx+ecx*8+16]
movntdq [esi+ecx*8], xmm0
movntdq [esi+ecx*8+16], xmm1
add ecx, 4
sub eax, 1
jnz copyloop ; If another sub-block remains, then jump.
test ecx, ecx ; Test whether chunk count is 0.
jnz chunkloop ; If another chunk remains, then jump.
sfence ; Flush the write-combining buffer.
pop esi ; << added
pop ebx
ret
mul5 endp
; ******************************************************************
Try an unroll something like this.
movapd xmm0,[edx+ecx*8]
movapd xmm1,[edx+ecx*8+16]
movapd xmm2,[edx+ecx*8+32]
movapd xmm3,[edx+ecx*8+48]
movapd xmm4,[edx+ecx*8+64]
movapd xmm5,[edx+ecx*8+80]
movapd xmm6,[edx+ecx*8+96]
movapd xmm7,[edx+ecx*8+112]
movntdq [esi+ecx*8], xmm0
movntdq [esi+ecx*8+16], xmm1
movntdq [esi+ecx*8+32], xmm2
movntdq [esi+ecx*8+48], xmm3
movntdq [esi+ecx*8+64], xmm4
movntdq [esi+ecx*8+80], xmm5
movntdq [esi+ecx*8+96], xmm6
movntdq [esi+ecx*8+112], xmm7
Quote from: zooba on January 18, 2006, 02:57:40 AM
Quote from: Ian_B on January 17, 2006, 05:04:52 PM
AND with 0fH = 15 to see if it's 16-byte aligned
Shouldn't that be 'AND with 0Fh = 0 if it's 16-byte aligned'?
No, I was adding, perhaps redundantly, the info that 0fH = 15. I guess I could have added the relevant JNZ instruction too to really be obtuse, but I thought that would be perfectly obvious.
IanB
Ah, okay. Since the topic was VB I interpreted the '=' symbol as meaning 'is equal to' for the whole expression (ie. '==' in almost everything else) :U
if si and di alligned on 16 or on 8 i use sse2 else only a rep movsb now i can copy all from 1 octet to 4 go
mul5 proc nb:dword,nb2:dword,p:dword
clc
push esi
push edi
rdtsc
push eax
push edx
mov esi,nb
and esi,0fh
mov edi,nb2
and edi,0fh
cmp esi,edi
jz ok
mov esi,nb ;not alligned
mov edi,nb2
mov ecx,p
rep movsb
pop edx
pop eax
mov ecx,eax
mov ebx,edx
rdtsc
sub edx,ebx
sbb eax,ecx
pop edi
pop esi
ret
ok:
mov ecx,p
mov eax,0
cmp esi,0
jz suit
mov eax,16d
sub eax,esi
mov esi,nb
mov edi,nb2
mov ecx,eax
rep movsb
suit:
mov ecx,p
sub ecx,eax
cmp ecx,0
jbe end1
mov esi,nb
mov edi,nb2
add esi,eax
add edi,eax
add edi,ecx
add esi,ecx
neg ecx
chunkloop:
mov ebx,ecx
mov eax,ecx
neg eax
shr eax,6
mov edx,128
cmp eax,128
cmovge eax,edx
mov edx,eax
or eax,eax
jnz prefetchloop
add edi,ecx
add esi,ecx
neg ecx
rep movsb
jmp end1
prefetchloop:
prefetchnta [esi+ecx]
add ecx, 64
dec eax
jnz prefetchloop
mov ecx,ebx
mov eax,edx
shl eax,2
copyloop:
movapd xmm0,[esi+ecx]
movntdq [edi+ecx], xmm0
add ecx, 16
dec eax
jnz copyloop
or ecx, ecx
jnz chunkloop
end1:
sfence
pop edx
pop eax
mov ecx,eax
mov ebx,edx
rdtsc
sub edx,ebx
sbb eax,ecx
pop edi
pop esi
ret
mul5 endp