What is the fastest way to fill a couple of memory, exactly we are talking about 1,6 GByte, completly with FF in hexadecimal.
My best idea was moving it all into the mem via mov, but knowing little about ram architektur I thougt there must be a faster way.
fill it with 0's, then invert :lol
or, you could try this code
mov edi,offset MemBuff
mov ecx,(sizeof MemBuff)/4
mov eax,0FFFFFFFFh
rep stosd
it will be reasonably fast, as long as the base address of MemBuff is 4-aligned
notice that the buffer size should be divisable by 4, as well
if it isn't, add a few pad bytes to the end so that it is
On memory of that size if it has to be done regularly you would be better to use SSE 128 bit fills. Think of instructions like MOVNTDQA if the memory is aligned correctly.
.686
.MODEL FLAT,C
.MMX
.XMM
.CODE
FastFill PROC DataSize:DWORD, Buffer:PTR BYTE
push esi
mov esi, Buffer
mov ecx, DataSize
shr ecx, 6
movups xmm0,AllFF
@@:
movups [esi + 0], xmm0
movups [esi + 16], xmm0
movups [esi + 32], xmm0
movups [esi + 48], xmm0
add esi, 64
add ecx, -1
jnz @B
pop esi
ret
FastFill ENDP
.DATA
AllFF dd -1,-1,-1,-1
END
the AllFF define should be 16 aligned ?
Quote from: dedndave, November 21, 2010, at 03:20:51 AMthe AllFF define should be 16 aligned ?
No, not in that case, because Clive is using MOVUPS (move unaligned packed single).
Gunther
Quote from: hutch-- on November 21, 2010, 02:49:12 AM
On memory of that size if it has to be done regularly you would be better to use SSE 128 bit fills. Think of instructions like MOVNTDQA if the memory is aligned correctly.
Hutch has the fastest solution. Align the memory first (but most probably it is already aligned), then use MOVNTDQA. You can unroll it a little bit to save some cycles.
The point about MOVNTDQA is that it does
not write to the data cache.
Isn't MOVNTDQA sse4?
I thought for more than 256 meg 'rep stosd' was pretty speedy.
Quote from: sinsi on November 21, 2010, 09:31:37 AM
Isn't MOVNTDQA sse4?
I thought for more than 256 meg 'rep stosd' was pretty speedy.
Yes, correct - it's SSE4. But there is an 'ordinary' variant, movntdq. Note that in standard timing benchmarks it looks pretty bad because it writes without caching; you would have to change the testbed for Gigabyte size to see the difference:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1405 cycles for 100*movdqa
???? cycles for 100*movntdq
EDIT: There is something weird here. See attachment, third loop.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1554 cycles for 100*movdqa
1924 cycles for 100*movntdq
1571 cycles for 100*MOVNTPD
1549 cycles for 100*movdqa
24888 cycles for 100*movntdq ; without 'speedup'
More detail on performance here (http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-12/msg00071.html).
Quote from: jj2007 on November 21, 2010, 10:13:56 AM
Yes, correct - it's SSE4. But there is an 'ordinary' variant, movntdq. Note that in standard timing benchmarks it looks pretty bad because it writes without caching; you would have to change the testbed for Gigabyte size to see the difference:
Go to: "http://www.masm32.com/board/index.php?topic=14685.msg119904#msg119904" and follow thread at all.
For buffer which bigger than L2 cache in some times - MOVNTDQ would be best choice.
Alex
Thanks for all the replys, great forum.
I'm going to figure out what is the fastest solution in my case, and which brings up enough compatibility(not every pc has SSE4).
@hutch-- :
I was not quite sure where to post my question, so thanks for moving it to the right subforum.
i still say my original idea sounds best :P
Quotefill it with 0's, then invert
Quote from: dedndave on November 21, 2010, 09:06:13 PM
i still say my original idea sounds best :P
Quotefill it with 0's, then invert...
... and all with non-temporal writes for big buffers :P