While chasing the fastest memcopy algo ever, I stumbled over 2 memfill oddities:
1. It fails for sizes under 4 bytes (not a big deal, but the docu might mention that)
2. It fails for certain sizes, such as 546.
Example is attached.
[attachment deleted by admin]
JJ,
From the reference,
Quote
Description
memfill is a fast memory filling procedure that works in DWORD unit sizes. It is usually good practice to allocate memory in intervals of four for alignment purposes and this procedure is designed to work with memory allocated this way.
From memory we did a LAB topic on memfill and over about 500 bytes REP MOVSD starts to become faster as it appears to handle non temporal writes automatically where an unrolled DWORD version like memfill does not. Michael Webster did a hybrid that handled both techniques.
Quote from: hutch-- on September 13, 2008, 05:13:06 PM
memfill is a fast memory filling procedure that works in DWORD unit sizes. ... this procedure is designed to work with memory allocated this way.
You are right, Hutch. I was tempted to adopt the "solution" below, but it would avoid "healthy" crashes.
Maybe you should add a bold warning saying it will crash for
size=n*32+1, 2, 3?
rmndr:
and ln, 31 ; get remainder
cmp ln, 0
je mfQuit
mov ecx, ln
shr ecx, 2 ; divide by 4
je mfQuit
; <<<< would help to avoid crashes! @@:
mov [edx], eax
add edx, 4
dec ecx
jnz @B
mfQuit:
JJ,
It already states that it is to be used on 4 byte increment sized buffers. It is an unrolled algo to add some pace to memory fills up to a certain size where REP MOVSD becomes faster.
If you don't mind losing the old 486 compatibility, there are some fast MMX versions around but make sure you use a version that properly handles writeback that is not through the cache as it is measurably faster.
As I already wrote above, you are right and the docu is correct. The problem is that bloody beginners like myself either don't read or don't understand the docu. What about the compromise listed below? It makes the routine general purpose, so you don't have to control the size any more (by the way, it's a frequent case in coding that you don't know the size a priori - a situation that forces you to check each time whether to use memfill or another routine...).
The adjustment costs roughly 2 (two) cycles and an increase of 8 bytes in size.
EDIT:
mov edx, [esp+4] ; lpmem, buffer address
instead of
mov edx, lpmem ; buffer address
... etc., needed since there is no PROLOGUE
EDIT (2): See later post (http://www.masm32.com/board/index.php?topic=9895.msg72610#msg72610). The no prologue option is tricky:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD
Warning: mov edx, lpmem looks plausible but is a trap - Masm will ruthless translate that to mov edx, dword ptr [ebp+8]...
:bg
Come on JJ, you are no beginner. The algo looks fine but you will understand my waryness at changing a ducumented algo that has been around for years as it may break other peoples code. I would be inclined to write a later one with a different name and probably use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.
Interestingly enough most people who need to perform memory fill need the opposite end with often large 4 byte dividable sizes being filled and they benefit from MMX or SSE code in critical applications. Most people who need to fill non 4 byte dividable sizes generally do it with a tiny byte fill routine. This is the one I use when I am handling small eneven zero fill sizes.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
charfill proc buff:DWORD,bcnt:DWORD,char:DWORD
mov ecx, [esp+4] ; data address
mov edx, [esp+8] ; fill count
mov eax, [esp+12] ; fill character
sub ecx, 1
@@:
add ecx, 1
mov [ecx], al
sub edx, 1
jnz @B
ret 12
charfill endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
I had to rewrite the memfill "general purpose" version, it was a bit buggy. Here is a working version, suitable for all sizes. The stackoffset option is for cheerful playing.
Timings on a Core 2 Celeron M:
1024 bytes
279 cycles for new memfill 96 bytes
281 cycles for mlib memfill 78 bytes
8192 bytes
2136 cycles for new memfill 96 bytes
2123 cycles for mlib memfill 78 bytes
memfill2 PROTO: DWORD, :DWORD, :DWORD
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD
stackoffset = -8 ; plus or minus 8
if stackoffset eq -8
pop edx ; ret address trashed
pop edx ; lpmem, buffer address
pop ecx ; ln, byte length
pop eax ; fill chars
else
mov edx, [esp+4] ; lpmem, buffer address
mov ecx, [esp+8] ; ln, byte length
mov eax, [esp+12] ; fill chars
endif
shr ecx, 5 ; divide by 32
test ecx, ecx
jz rmndr
; ------------ unroll by 8 ------------
align 4
@@:
mov [edx], eax ; put fill chars at address in edx
mov [edx+4], eax
mov [edx+8], eax
mov [edx+12], eax
mov [edx+16], eax
mov [edx+20], eax
mov [edx+24], eax
mov [edx+28], eax
add edx, 32
dec ecx
jnz @B
rmndr:
mov ecx, [esp+stackoffset]
and ecx, 31 ; get remainder
je mfQuit ; we are done
shr ecx, 2 ; divide by 4
je Below4a ; about 2 cycles extra, but this way it's general purpose
align 4
@@:
mov [edx], eax ; some more dwords
add edx, 4
dec ecx
jnz @B
mov ecx, [esp+stackoffset]
and ecx, 3 ; test for last 3 bytes
jne Below4b
mfQuit:
if stackoffset eq -8
sub esp, 4*4
endif
ret 3*4
Below4a:
mov ecx, [esp+stackoffset]
and ecx, 3 ; test for last 3 bytes
je mfQuit
Below4b:
mov [edx], al
inc edx
dec ecx
jnz Below4b
jmp mfQuit
memfill2 endp
[attachment deleted by admin]
Quote from: hutch-- on September 14, 2008, 08:36:36 AM
use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.
EDIT: Bug fixed & new version attached. The breakeven point for switching to stosd seems to be around 1536 bytes.
rep stosd
pop edi
; WRONG: jmp rmndr
jmp Below4a ; we moved dwords above
Done. Core 2 Celeron M timings:
256 bytes
86 cycles for new memfill 106 bytes
85 cycles for mlib memfill 78 bytes
512 bytes
146 cycles for new memfill 106 bytes
151 cycles for mlib memfill 78 bytes
1024 bytes (still mov version, stosd 345 cycles)
282 cycles for new memfill 106 bytes
282 cycles for mlib memfill 78 bytes
2048 bytes
481 cycles for new memfill 106 bytes
544 cycles for mlib memfill 78 bytes
4096 bytes
750 cycles for new memfill 106 bytes
1074 cycles for mlib memfill 78 bytes
8192 bytes
1295 cycles for new memfill 106 bytes
2123 cycles for mlib memfill 78 bytes
Full code attached, timings welcome. Look for memfill2 if you want to play. The *.asc can be opened with WordPad.
[attachment deleted by admin]
AMD x2 4000 x64/WinXPx32
Testing:
256 bytes
57 cycles for new memfill 106 bytes
51 cycles for mlib memfill 78 bytes
512 bytes
100 cycles for new memfill 106 bytes
102 cycles for mlib memfill 78 bytes
1024 bytes
182 cycles for new memfill 106 bytes
188 cycles for mlib memfill 78 bytes
1536 bytes
259 cycles for new memfill 106 bytes
261 cycles for mlib memfill 78 bytes
2048 bytes
284 cycles for new memfill 106 bytes
341 cycles for mlib memfill 78 bytes
4096 bytes
540 cycles for new memfill 106 bytes
662 cycles for mlib memfill 78 bytes
8192 bytes
1060 cycles for new memfill 106 bytes
1301 cycles for mlib memfill 78 bytes
16384 bytes
2083 cycles for new memfill 106 bytes
2584 cycles for mlib memfill 78 bytes
32768 bytes
4137 cycles for new memfill 106 bytes
5148 cycles for mlib memfill 78 bytes
---