MasmLib memfill bug?

jj2007 · September 13, 2008, 04:46:03 PM

While chasing the fastest memcopy algo ever, I stumbled over 2 memfill oddities:
1. It fails for sizes under 4 bytes (not a big deal, but the docu might mention that)
2. It fails for certain sizes, such as 546.
Example is attached.

[attachment deleted by admin]

hutch-- · September 13, 2008, 05:13:06 PM

JJ,

From the reference,

Quote
Description

memfill is a fast memory filling procedure that works in DWORD unit sizes. It is usually good practice to allocate memory in intervals of four for alignment purposes and this procedure is designed to work with memory allocated this way.

From memory we did a LAB topic on memfill and over about 500 bytes REP MOVSD starts to become faster as it appears to handle non temporal writes automatically where an unrolled DWORD version like memfill does not. Michael Webster did a hybrid that handled both techniques.

jj2007 · September 13, 2008, 05:33:34 PM

Quote from: hutch-- on September 13, 2008, 05:13:06 PM
memfill is a fast memory filling procedure that works in DWORD unit sizes. ... this procedure is designed to work with memory allocated this way.

You are right, Hutch. I was tempted to adopt the "solution" below, but it would avoid "healthy" crashes.
Maybe you should add a bold warning saying it will crash for size=n*32+1, 2, 3?

rmndr:

and ln, 31 ; get remainder
cmp ln, 0
je mfQuit
mov ecx, ln
shr ecx, 2 ; divide by 4
je mfQuit ; <<<< would help to avoid crashes!
@@:
mov [edx], eax
add edx, 4
dec ecx
jnz @B

mfQuit:

hutch-- · September 14, 2008, 12:48:14 AM

JJ,

It already states that it is to be used on 4 byte increment sized buffers. It is an unrolled algo to add some pace to memory fills up to a certain size where REP MOVSD becomes faster.

If you don't mind losing the old 486 compatibility, there are some fast MMX versions around but make sure you use a version that properly handles writeback that is not through the cache as it is measurably faster.

jj2007 · September 14, 2008, 08:10:50 AM

As I already wrote above, you are right and the docu is correct. The problem is that bloody beginners like myself either don't read or don't understand the docu. What about the compromise listed below? It makes the routine general purpose, so you don't have to control the size any more (by the way, it's a frequent case in coding that you don't know the size a priori - a situation that forces you to check each time whether to use memfill or another routine...).

The adjustment costs roughly 2 (two) cycles and an increase of 8 bytes in size.

EDIT:
mov edx, [esp+4] ; lpmem, buffer address
instead of
mov edx, lpmem ; buffer address

... etc., needed since there is no PROLOGUE

EDIT (2): See later post. The no prologue option is tricky:

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD

Warning: mov edx, lpmem looks plausible but is a trap - Masm will ruthless translate that to mov edx, dword ptr [ebp+8]...

hutch-- · September 14, 2008, 08:36:36 AM

:bg

Come on JJ, you are no beginner. The algo looks fine but you will understand my waryness at changing a ducumented algo that has been around for years as it may break other peoples code. I would be inclined to write a later one with a different name and probably use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.

Interestingly enough most people who need to perform memory fill need the opposite end with often large 4 byte dividable sizes being filled and they benefit from MMX or SSE code in critical applications. Most people who need to fill non 4 byte dividable sizes generally do it with a tiny byte fill routine. This is the one I use when I am handling small eneven zero fill sizes.

Code Select


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

    align 16

charfill proc buff:DWORD,bcnt:DWORD,char:DWORD

    mov ecx, [esp+4]        ; data address
    mov edx, [esp+8]        ; fill count
    mov eax, [esp+12]       ; fill character
    sub ecx, 1

  @@:
    add ecx, 1
    mov [ecx], al
    sub edx, 1
    jnz @B

    ret 12

charfill endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef 

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007 · September 14, 2008, 01:50:49 PM

I had to rewrite the memfill "general purpose" version, it was a bit buggy. Here is a working version, suitable for all sizes. The stackoffset option is for cheerful playing.

Timings on a Core 2 Celeron M:

Code Select

1024 bytes
279 cycles for new memfill      96 bytes
281 cycles for mlib memfill     78 bytes

8192 bytes
2136 cycles for new memfill     96 bytes
2123 cycles for mlib memfill    78 bytes

Code Select

memfill2	PROTO: DWORD, :DWORD, :DWORD

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD
  stackoffset = -8		; plus or minus 8
  if stackoffset eq -8
	pop edx			; ret address trashed
	pop edx			; lpmem, buffer address
	pop ecx			; ln, byte length
	pop eax			; fill chars
  else
	mov edx, [esp+4]	; lpmem, buffer address
	mov ecx, [esp+8]	; ln, byte length
	mov eax, [esp+12]	; fill chars
  endif
  shr ecx, 5		; divide by 32
  test ecx, ecx
  jz rmndr

;	------------ unroll by 8 ------------
	align 4
  @@:
	mov [edx],    eax   ; put fill chars at address in edx
	mov [edx+4],  eax
	mov [edx+8],  eax
	mov [edx+12], eax
	mov [edx+16], eax
	mov [edx+20], eax
	mov [edx+24], eax
	mov [edx+28], eax
	add edx, 32
	dec ecx
	jnz @B

rmndr:
	mov ecx, [esp+stackoffset]
	and ecx, 31		; get remainder
	je mfQuit		; we are done
	shr ecx, 2		; divide by 4
	je Below4a		; about 2 cycles extra, but this way it's general purpose
	align 4

 @@:
	mov [edx], eax	; some more dwords
	add edx, 4
	dec ecx
	jnz @B

	mov ecx, [esp+stackoffset]
	and ecx, 3		; test for last 3 bytes
	jne Below4b

mfQuit:
	if stackoffset eq -8
		sub esp, 4*4
	endif
	ret 3*4

Below4a:
	mov ecx, [esp+stackoffset]
	and ecx, 3		; test for last 3 bytes
	je mfQuit

Below4b:
	mov [edx], al
	inc edx
	dec ecx
	jnz Below4b
	jmp mfQuit

memfill2 endp

[attachment deleted by admin]

jj2007 · September 14, 2008, 02:42:32 PM

Quote from: hutch-- on September 14, 2008, 08:36:36 AM
use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.

EDIT: Bug fixed & new version attached. The breakeven point for switching to stosd seems to be around 1536 bytes.

Code Select

	rep stosd
	pop edi
	; WRONG: jmp rmndr
	jmp Below4a	; we moved dwords above

Done. Core 2 Celeron M timings:

Code Select

256 bytes
86 cycles for new memfill       106 bytes
85 cycles for mlib memfill      78 bytes

512 bytes
146 cycles for new memfill      106 bytes
151 cycles for mlib memfill     78 bytes

1024 bytes (still mov version, stosd 345 cycles)
282 cycles for new memfill      106 bytes
282 cycles for mlib memfill     78 bytes

2048 bytes
481 cycles for new memfill      106 bytes
544 cycles for mlib memfill     78 bytes

4096 bytes
750 cycles for new memfill      106 bytes
1074 cycles for mlib memfill    78 bytes

8192 bytes
1295 cycles for new memfill     106 bytes
2123 cycles for mlib memfill    78 bytes

Full code attached, timings welcome. Look for memfill2 if you want to play. The *.asc can be opened with WordPad.

[attachment deleted by admin]

Mark Jones · September 16, 2008, 04:43:37 AM

AMD x2 4000 x64/WinXPx32

Code Select


Testing:
256 bytes
57 cycles for new memfill       106 bytes
51 cycles for mlib memfill      78 bytes

512 bytes
100 cycles for new memfill      106 bytes
102 cycles for mlib memfill     78 bytes

1024 bytes
182 cycles for new memfill      106 bytes
188 cycles for mlib memfill     78 bytes

1536 bytes
259 cycles for new memfill      106 bytes
261 cycles for mlib memfill     78 bytes

2048 bytes
284 cycles for new memfill      106 bytes
341 cycles for mlib memfill     78 bytes

4096 bytes
540 cycles for new memfill      106 bytes
662 cycles for mlib memfill     78 bytes

8192 bytes
1060 cycles for new memfill     106 bytes
1301 cycles for mlib memfill    78 bytes

16384 bytes
2083 cycles for new memfill     106 bytes
2584 cycles for mlib memfill    78 bytes

32768 bytes
4137 cycles for new memfill     106 bytes
5148 cycles for mlib memfill    78 bytes
---

News:

MasmLib memfill bug?