News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

memory copy ...

Started by James Ladd, May 15, 2005, 12:36:49 AM

Previous topic - Next topic

meeku

Hey Hutch  :wink
Good to be on board!

Something I'd actually be interested to try is using the same asm routines on a linux box. I did quite a bit of work on putting together my own os and kernel based on an old protected mode extender i wrote for dos and I found that using software in the kernel to manage virtual address spaces, memory management and task switching was considerably faster and more reliable than using the built in task switching/v86 mode stuff on the cpu.. basically you leave the cpu running in a flat memory model with no paging and all at ring 0, and let the kernel manage the rest... anyhow im digressing, the point is that memory access was at least 20% faster, so perhaps the results would vary on linux.

Anyone doing asm coding under linux want to give us some performance results for the same tests?  :bg

meeku

Hey,

Managed to marginally improve the performance by adding one more read prior to prefetch to prime the TLB buffer (as per Intel's optimisation recommendations).
Sitting at 1.1Gb/s tested with a 100 and 200Mb buffer and various number of iterations on the timing loop.

This was the performance from my 1.7Ghz Pentium Mobile IBM laptop... in theory this result should be much better on a decent spec desktop.

Here are the two fastest versions sofar:



    mov esi, data1ptr
    mov edi, data2ptr

    mov ecx, DATASIZE

    lea esi, [esi+ecx*8]
    lea edi, [edi+ecx*8]

    neg ecx

align 16
mainloop:

    mov eax, (CACHEBLOCK / 16)
    add ecx, CACHEBLOCK

    mov edx, [esi+ecx*8-128]           
prefetchloop:                           
    mov ebx, [esi+ecx*8-64]
    mov ebx, [esi+ecx*8-128]
    sub ecx, 16
    dec eax
    jnz short prefetchloop


    mov eax, (CACHEBLOCK / 8)

writeloop:
    movdqa xmm0, [esi+ecx*8]
    movdqa xmm1, [esi+ecx*8+16]
    movdqa xmm2, [esi+ecx*8+32]
    movdqa xmm3, [esi+ecx*8+48]

    movntdq [edi+ecx*8], xmm0
    movntdq [edi+ecx*8+16], xmm1
    movntdq [edi+ecx*8+32], xmm2
    movntdq [edi+ecx*8+48], xmm3


    add ecx, 8
    dec eax
    jnz  writeloop

    or ecx, ecx
    jnz  mainloop

AND...

    mov esi, data1ptr
    mov edi, data2ptr

    mov ecx, DATASIZE

    lea esi, [esi+ecx*8]
    lea edi, [edi+ecx*8]

    neg ecx

align 16
mainloop:

    mov eax, (CACHEBLOCK / 16)
    add ecx, CACHEBLOCK

    mov edx, [esi+ecx*8-128]            ; Prime TLB
prefetchloop:                           ; Software Prefetch (touch) loop.
    mov ebx, [esi+ecx*8-64]
    mov ebx, [esi+ecx*8-128]
    sub ecx, 16
    dec eax
    jnz short prefetchloop


    mov eax, (CACHEBLOCK / 8)

writeloop:
    movq mm0, qword ptr [esi+ecx*8]
    movq mm1, qword ptr [esi+ecx*8+8]
    movq mm2, qword ptr [esi+ecx*8+16]
    movq mm3, qword ptr [esi+ecx*8+24]
    movq mm4, qword ptr [esi+ecx*8+32]
    movq mm5, qword ptr [esi+ecx*8+40]
    movq mm6, qword ptr [esi+ecx*8+48]
    movq mm7, qword ptr [esi+ecx*8+56]

    movntq qword ptr [edi+ecx*8], mm0
    movntq qword ptr [edi+ecx*8+8], mm1
    movntq qword ptr [edi+ecx*8+16], mm2
    movntq qword ptr [edi+ecx*8+24], mm3
    movntq qword ptr [edi+ecx*8+32], mm4
    movntq qword ptr [edi+ecx*8+40], mm5
    movntq qword ptr [edi+ecx*8+48], mm6
    movntq qword ptr [edi+ecx*8+56], mm7

    add ecx, 8
    dec eax
    jnz short writeloop

    or ecx, ecx
    jnz short mainloop



It's basically almost identical to the AMD reference one.. with a few minor changes.
I think this is about as good as it's going to get IMO.

Human

webring fild is slow instruction, and even when i tried it on 486 it doesnt gave better performance, fld was faster
but one problem in time of dos remained, and i didnt knew why copied data wasnt same. reason was fpu and his accurancy bits, so if some game left fpu doing all calculations in 24bit then we had problems.

for my sdr133 ram best solution is movq and movntq, compared to rep movsd it gives increase in speed from 272 to 512mb/s when software prefetch used to copy 64mb and do prefetch for my athlon xp 1700+ 1467mhz for 64KB L1 cache, P4 can do only 8kb :P
i already was using mmx to copy data and this only gives 220 mln ticks compared to baseline 350 mln, with movntq and prefetch it droped to 194 mln ticks. using sse movaps doesnt boost pefrormance, have to but new amd64 and ddr2 memory

Seb

Sorry for bumping this thread, but I implemented two of my own approaches to memory copying (using MMX), and got interesting results on my AMD Athlon 64 X2 4400+ and thought I'd share it with the rest of the community:

11111111

MemCopy - rep movsd                         : 176845573 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 164048385 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 172751465 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 151329157 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 154956402 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 148054639 cycles
mmxcopy  - movq x 8 : 147378815 cycles
mmxcopy2  - movq x 8 : 26129470 cycles

Press enter to exit...


mmxcopy and mmxcopy2 are my own functions and the second one is in all cases (and I've tried a LOT) much, much faster.

Edit: I sent it to a friend and he got results in a similar "order" on his laptop.

[attachment deleted by admin]

ecube

amd 64 3800+

11111111

MemCopy - rep movsd                         : 282149047 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 279833356 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 280034282 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 277556675 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 277835608 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 276572415 cycles
mmxcopy  - movq x 8 : 276869072 cycles
mmxcopy2  - movq x 8 : 41037911 cycles

Press enter to exit...

nice job!  :U

Seb

Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.

ic2

INTEL P3  846  512RAM


11111111

MemCopy - rep movsd                         : 522935521 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 523243442 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 524413281 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 523513785 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 511182237 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 521955504 cycles
mmxcopy  - movq x 8 : 524072757 cycles
mmxcopy2  - movq x 8 : 83558428 cycles

Press enter to exit...

TNick

On Intel Celeron 2,53 GHz:

Quote
11111111

MemCopy - rep movsd                         : 396270757 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 373467620 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 375417875 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 402793093 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 398304671 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 404616784 cycles
mmxcopy  - movq x 8 : 406699997 cycles
mmxcopy2  - movq x 8 : 47871561 cycles


Nick

Seb

Thanks, guys, for testing. :U

u

Sempron 3000+ (64-bit)  [wow it beats an amd 64 3800+]

11111111

MemCopy - rep movsd                         : 176175708 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 177890633 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 173873833 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 167199286 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 169962053 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 169044058 cycles
mmxcopy  - movq x 8 : 167693297 cycles
mmxcopy2  - movq x 8 : 36151686 cycles

But strangely, on subsequent runs, the results are always around

MemCopy - rep movsd                         : 218944142 cycles
dcopy   - mov reg,mem/mov mem,reg x 8       : 219306860 cycles
qcopy   - movq mmx,mem/movq mem,mmx x 1     : 197413306 cycles
_qcopy  - movq mmx,mem/movq mem,mmx x 8     : 197132039 cycles
xcopy   - movaps xmm,mem/movaps mem/xmm x 1 : 196587334 cycles
_xcopy  - movaps xmm,mem/movaps mem/xmm x 8 : 195489834 cycles
mmxcopy  - movq x 8 : 196248279 cycles
mmxcopy2  - movq x 8 : 36618720 cycles
  :eek
Please use a smaller graphic in your signature.

Seb

Interesting, mmxcopy2 seems to be the fastest method on both AMD and Intel CPUs. Does anyone have a theory on why it's so much faster than the rest?

u

It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.
Please use a smaller graphic in your signature.

Seb

Quote from: Ultrano on March 03, 2007, 04:03:21 PM
It is because of the expression [esi+edx*8+0], which is wrong. It makes the code stomp at the same data. in 7/8 of the cases.

I expected it to be a "hidden" bug that caused the "magic" result, thanks for letting me know. So what should it be?

u

change "shr ecx,6" into "shr ecx,3",
change "add eax,1" into "add eax,8"
But even better:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 16

mmxcopy2 proc src:DWORD,dst:DWORD,lenx:DWORD
mov ecx,[esp+3*4] ; len
add esp,-3*4 ; 3 DWORDs
mov [esp],ecx ;
mov [esp-4],esi ;
mov [esp-8],edi ;
mov esi,[esp+1*4+(3*4)] ; src
mov edi,[esp+2*4+(3*4)] ; dst
cmp ecx,64 ;
;jz @exit ;
jb @tail ;
shr ecx,3 ;
mov edx,8

;
align 16 ;
@@:
sub ecx,edx ;
movq mm0,[esi+ecx*8+0] ;
movq mm1,[esi+ecx*8+8] ;
movq mm2,[esi+ecx*8+16] ;
movq mm3,[esi+ecx*8+24] ;
movq mm4,[esi+ecx*8+32] ;
movq mm5,[esi+ecx*8+40] ;
movq mm6,[esi+ecx*8+48] ;
movq mm7,[esi+ecx*8+56] ;
;
movq [edi+ecx*8+0],mm0 ;
movq [edi+ecx*8+8],mm1 ;
movq [edi+ecx*8+16],mm2 ;
movq [edi+ecx*8+24],mm3 ;
movq [edi+ecx*8+32],mm4 ;
movq [edi+ecx*8+40],mm5 ;
movq [edi+ecx*8+48],mm6 ;
movq [edi+ecx*8+56],mm7 ;

jz @F
;
jmp @B
@@:
;
and dword ptr [esp],63 ;
jz @exit ;
mov ecx,[esp] ;
;
@tail: ;
;cld ;
rep movsb ;
@exit: ;
mov edi,[esp-8] ;
mov esi,[esp-4] ;
add esp,3*4 ;
ret 3*4 ;
mmxcopy2 endp

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF

And still, this is not faster than mmxcopy()
Please use a smaller graphic in your signature.

daydreamer

Quote from: Seb on March 03, 2007, 12:18:58 AM
Thanks. :bg I'd appreciate if anyone else, in particular if you've got an Intel CPU, could test it out.
I think what mobo/memorystick configuration is more relevant to know whats fastest
I mean most interesting to know how dualchannel DDR2, compares to a single DDR memorystick
mobos that are hyped to give dualchannel performance, despite the cpu doesnt support dualchannel should be interesting to check if its true or not