News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

copy memory to memory speed?

Started by elcricri, January 16, 2006, 10:51:55 AM

Previous topic - Next topic

elcricri

hi
i copy 256 mo 268435456 octet to onother 256 mo
it take 0.116 second
can do faster ?
it s 2.15Go/s in read and 2.15Go/s in write on a amd or amd say cand do 8Go/s???

MichaelW

The MASM32 memcopy procedure should be able to do reasonably well copying 256MB, but there are better methods available.

http://www.masmforum.com/simple/index.php?topic=1637.0

eschew obfuscation

elcricri

#2
ok so 2go/s is good
i use code of amd
i make a dll

.686                                     ; for 586 processor or better
.model flat, stdcall                     ; 32-bit memory and standard call
option casemap:none
.xmm

.code                                    ; the beginning of the code section
LibMain proc h:DWORD, r:DWORD, u:DWORD   ; the dll entry point
        mov eax, 1                       ; if eax is 0, the dll won't start
        ret                              ; return
LibMain Endp                             ; end of the dll entry
;******************************************************************
mul5  proc nb:dword,nb2:dword,p:dword

    mov edx,nb    ;pointer
    mov esi,nb2   ; pointer
    mov ecx,p     ; number of octet
add edx, ecx            ; Add to source address the number of bytes to copy.;
add esi, ecx            ; Add to destination address the number of bytes to copy.
shr ecx, 3              ; Convert number of bytes to number of QWORDs.
neg ecx                 ; Make number of QWORDs negative.
chunkloop:
mov ebx, ecx             ; Save number of QWORDs.
mov eax,128            ;128  Initialize number of cache-line pairs to prefetch.

prefetchloop:
prefetchnta [edx+ecx*8] ;    64 Load a line (64 bytes) into the L1 data cache;
add ecx, 8                   ; 8*8 64  Select next cache-line pair.
dec eax                      ; 64*128=8192         Decrement number of cache-line pairs.
jnz prefetchloop           ; If cache-line pairs remain, then jump.

mov ecx, ebx             ; Restore number of QWORDs.
mov eax, 256            ; 256 Initialize number of sub-blocks to copy.

copyloop:
movapd xmm0,[edx+ecx*8]    ;
                movapd xmm1,[edx+ecx*8+16] ;
      movntdq [esi+ecx*8], xmm0  ;
movntdq [esi+ecx*8+16], xmm1  ;       
add ecx, 4                ;
dec eax                   ;
jnz copyloop              ; If another sub-block remains, then jump.
or ecx, ecx               ; Test whether chunk count is 0.
jnz chunkloop             ; If another chunk remains, then jump.
sfence                    ; Flush the write-combining buffer.

       ret
mul5 endp

hutch--

The combination you are using looks correct, temporal reads and non temporal writes. You are probably already finding the memory speed limit which is the major restricting factor with block memory copy on current computers. You may be able to tweak a little more here and there but you will not get 4 times speed increases as you are already using SSE instructions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ian_B

You don't need the line:
SHR ECX, 3
If you keep ECX as the number of bytes, you can remove the *8 multiplier everywhere it appears, and this will speed up the address calculation (plus produce smaller code). Just add 64 instead of 8 to ECX everywhere you do that to compensate, and 32 near the end of the code where you have:
ADD ECX, 4

When you prefetch the data, you have only considered the size of the cache, you haven't limited that to the amount of data you are transferring, which might be smaller! You should probably add a test/JMP in the prefetch loop:
ADD ECX, 8                   ; 8*8 64  Select next cache-line pair;
JNS @F             <-- ADD THIS to clear loop early if size of data is less than cache size
DEC EAX                      ; 64*128=8192         Decrement number of cache-line pairs.
JNZ prefetchloop           ; If cache-line pairs remain, then jump.
@@:                <-- AND THIS
MOV ECX, EBX


You need an identical early-out test/JMP in the copyloop too.

You might also consider that you should probably check whether the pointers you are giving the routine are 16-byte aligned, which I believe the SSE instructions require, and obviously also that the number of bytes is a multiple of 32, as you are currently transferring 32 bytes in the copyloop regardless of the actual amount left, unless you add code to handle copying any remainder. (If you don't do the SHR ECX, 3 as I suggested, you still have the remainder available). You'd really need to return some error value in EAX if the parameters were wrong. A good subroutine should sanity-check its input parameters, not rely on the calling routine to have done so.

And don't forget to add a PUSH ESI at the beginning and POP ESI at the end to avoid the whole thing crashing, as you must restore that register for the calling routine.  :wink

IanB

elcricri

yes if not a multiple of 8192 big probleme

esi is important ? i crash if i use edi or ebp not with esi
yes i must alligne at 16 and visual basic not allign all the time at 16  :( i must include to copy 8 and after by 16

Ian_B

Quote from: elcricri on January 16, 2006, 11:43:02 PM
esi is important ? i crash if i use edi or ebp not with esi
You've been lucky! That's only because whatever you are calling it from hasn't been using ESI, fortunately. If you reuse the routine with another piece of code you can't rely on that. Always save/restore EBP/EBX/EDI/ESI if you use them in a proc.

Quoteyes i must alligne at 16 and visual basic not allign all the time at 16  :(
If you need an aligned memory block with VB, and it obviously needs to be a significant size for your purposes, you could try adding another routine into your DLL that calls VirtualAlloc, which reserves and allocates page-aligned memory blocks. There's a fudge using sneaky non-supported VB statements which allows you to use the memory pointer you can pass back as the return value so that you can access the memory block. My memory on it is a little rusty, but do some Googling on AddressOf, VarPtr, StrPtr, and ObjPtr and you'll certainly find some code that'll show you how to use the allocated and now aligned memory block as a normal byte array in your VB code.

Try http://www.thevbzone.com/secrets.htm for starters.  :wink

Don't forget to add another routine in the DLL that will DE-allocate the block with VirtualFree though, when you pass the initial pointer back.

IanB

Ian_B

Hmm..  :'(

Re-reading this, I think it's going to be difficult to get a chunk of memory allocated from within the DLL recognised by the calling VB program, as you can't write direct to an arbitrary memory address in VB, only to a variable reference. And I don't think you can change or create a reference to use that allocation.

So I think you have two options. The first is to write code that keeps all the data you are moving around within the DLL, do all your processing in ASM by DLL calls and use the VB side as merely a GUI wrapper to that code. But that will then make it difficult to get any information out of the memory block if your VB prog needs to process it other than merely moving it from place to place.

The second option is more fudge. Use the VarPtr function to get the memory address of a byte array that you create within VB. Make it at least 16 bytes more than you need in size. Because the result of that function is a real pointer to the first element of the byte array, that you can pass to an API call or to your memory copy proc, all you need to know is the alignment of that pointer, which you can test the usual way (AND with 0fH = 15 to see if it's 16-byte aligned). If it's not aligned, just save a padding value that you can use as an offset on the VB side, and send your DLL the first aligned location in the array block as a pointer. That way, your DLL can use the memory allocated by VB and your VB code can access the results, always indexing into the byte array with the extra padding offset.

IanB

zooba

Quote from: Ian_B on January 17, 2006, 05:04:52 PM
AND with 0fH = 15 to see if it's 16-byte aligned

Shouldn't that be 'AND with 0Fh = 0 if it's 16-byte aligned'?

Alternatively, OR the value witih 0Fh then add 1 and it is guarenteed to be 16-byte aligned (but test first in case it already is) :U

hutch--

Here is a quick play, mainly to format it so I could read it and add the stack code for EBX and ESI. If you can make it fit with your block copy size, I would unroll the SSE instructions so you work on a bigger block at a time and there is no reason not to use all 8 SSE registers. SSE instruction can be noticably laggy when mixed with normal integer instructions and one way you can tell this is to padd different parts of the code with nops to see if the operation slows down. At times, unrolling the SSE code can help here.

Seperating temporal read from non temporal write is a good idea which is faster than feeding the same data in and out of the cache.

What you will tend to find is that memory speed is the limiting factors with how fast you can make this code work.


; ******************************************************************

align 16

mul5  proc nb:dword,nb2:dword,p:dword

    push ebx                    ; << added
    push esi

    mov edx,nb                  ; pointer
    mov esi,nb2                 ; pointer
    mov ecx,p                   ; number of octet
    add edx, ecx                ; Add to source address the number of bytes to copy.;
    add esi, ecx                ; Add to destination address the number of bytes to copy.
    shr ecx, 3                  ; Convert number of bytes to number of QWORDs.
    neg ecx                     ; Make number of QWORDs negative.

  chunkloop:
    mov ebx, ecx                ; Save number of QWORDs.
    mov eax,128                 ; 128  Initialize number of cache-line pairs to prefetch.

  prefetchloop:
    prefetchnta [edx+ecx*8]     ; 64 Load a line (64 bytes) into the L1 data cache;
    add ecx, 8                  ; 8*8 64  Select next cache-line pair.
    sub eax, 1                  ; 64*128=8192         Decrement number of cache-line pairs.
    jnz prefetchloop            ; If cache-line pairs remain, then jump.

    mov ecx, ebx                ; Restore number of QWORDs.
    mov eax, 256                ; 256 Initialize number of sub-blocks to copy.

  copyloop:
    movapd xmm0,[edx+ecx*8]
    movapd xmm1,[edx+ecx*8+16]
    movntdq [esi+ecx*8], xmm0
    movntdq [esi+ecx*8+16], xmm1
    add ecx, 4
    sub eax, 1
    jnz copyloop                ; If another sub-block remains, then jump.
    test ecx, ecx               ; Test whether chunk count is 0.
    jnz chunkloop               ; If another chunk remains, then jump.
    sfence                      ; Flush the write-combining buffer.

    pop esi                     ; << added
    pop ebx

    ret

mul5 endp

; ******************************************************************

Try an unroll something like this.

    movapd xmm0,[edx+ecx*8]
    movapd xmm1,[edx+ecx*8+16]
    movapd xmm2,[edx+ecx*8+32]
    movapd xmm3,[edx+ecx*8+48]
    movapd xmm4,[edx+ecx*8+64]
    movapd xmm5,[edx+ecx*8+80]
    movapd xmm6,[edx+ecx*8+96]
    movapd xmm7,[edx+ecx*8+112]

    movntdq [esi+ecx*8], xmm0
    movntdq [esi+ecx*8+16], xmm1
    movntdq [esi+ecx*8+32], xmm2
    movntdq [esi+ecx*8+48], xmm3
    movntdq [esi+ecx*8+64], xmm4
    movntdq [esi+ecx*8+80], xmm5
    movntdq [esi+ecx*8+96], xmm6
    movntdq [esi+ecx*8+112], xmm7
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ian_B

Quote from: zooba on January 18, 2006, 02:57:40 AM
Quote from: Ian_B on January 17, 2006, 05:04:52 PM
AND with 0fH = 15 to see if it's 16-byte aligned

Shouldn't that be 'AND with 0Fh = 0 if it's 16-byte aligned'?
No, I was adding, perhaps redundantly, the info that 0fH = 15. I guess I could have added the relevant JNZ instruction too to really be obtuse, but I thought that would be perfectly obvious.

IanB

zooba

Ah, okay. Since the topic was VB I interpreted the '=' symbol as meaning 'is equal to' for the whole expression (ie. '==' in almost everything else) :U

elcricri

if si and di alligned on 16 or on 8 i use sse2 else only a rep movsb  now i can copy all from 1 octet to 4 go

mul5  proc nb:dword,nb2:dword,p:dword
    clc
    push esi
    push edi
    rdtsc
    push eax
    push edx
    mov esi,nb
    and esi,0fh
    mov edi,nb2
    and edi,0fh
    cmp esi,edi
    jz ok
    mov esi,nb     ;not alligned
    mov edi,nb2
    mov ecx,p
    rep movsb     
    pop edx
    pop eax
    mov ecx,eax
    mov ebx,edx
    rdtsc
    sub edx,ebx
    sbb eax,ecx
    pop edi
    pop esi   
    ret
ok:
    mov ecx,p 
    mov eax,0
    cmp esi,0
    jz suit
    mov eax,16d
    sub eax,esi
    mov esi,nb
    mov edi,nb2
    mov ecx,eax
    rep movsb
suit:
    mov ecx,p 
    sub ecx,eax
    cmp ecx,0
    jbe end1
    mov esi,nb
    mov edi,nb2
    add esi,eax
    add edi,eax
    add edi,ecx
    add esi,ecx
    neg ecx   
chunkloop:
        mov ebx,ecx
        mov eax,ecx
        neg eax
        shr eax,6   
        mov edx,128 
        cmp eax,128
        cmovge eax,edx
        mov edx,eax
        or  eax,eax
        jnz prefetchloop
        add edi,ecx
        add esi,ecx
        neg ecx
        rep movsb
        jmp end1
prefetchloop:
   prefetchnta [esi+ecx]
   add ecx, 64         
   dec eax             
   jnz prefetchloop     
   mov ecx,ebx 
                mov eax,edx
   shl eax,2
copyloop:
   movapd xmm0,[esi+ecx]
   movntdq [edi+ecx], xmm0
   add ecx, 16
   dec eax
   jnz copyloop
   or ecx, ecx
   jnz chunkloop
end1:
   sfence

      pop edx
      pop eax
      mov ecx,eax
      mov ebx,edx
      rdtsc
      sub edx,ebx
      sbb eax,ecx
      pop edi
      pop esi
      ret
mul5 endp