News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MasmLib memfill bug?

Started by jj2007, September 13, 2008, 04:46:03 PM

Previous topic - Next topic

jj2007

While chasing the fastest memcopy algo ever, I stumbled over 2 memfill oddities:
1. It fails for sizes under 4 bytes (not a big deal, but the docu might mention that)
2. It fails for certain sizes, such as 546.
Example is attached.

[attachment deleted by admin]

hutch--

JJ,

From the reference,

Quote
Description

memfill is a fast memory filling procedure that works in DWORD unit sizes. It is usually good practice to allocate memory in intervals of four for alignment purposes and this procedure is designed to work with memory allocated this way.

From memory we did a LAB topic on memfill and over about 500 bytes REP MOVSD starts to become faster as it appears to handle non temporal writes automatically where an unrolled DWORD version like memfill does not. Michael Webster did a hybrid that handled both techniques.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on September 13, 2008, 05:13:06 PM
memfill is a fast memory filling procedure that works in DWORD unit sizes. ... this procedure is designed to work with memory allocated this way.

You are right, Hutch. I was tempted to adopt the "solution" below, but it would avoid "healthy" crashes.
Maybe you should add a bold warning saying it will crash for size=n*32+1, 2, 3?

  rmndr:

    and ln, 31          ; get remainder
    cmp ln, 0
    je mfQuit
    mov ecx, ln
    shr ecx, 2          ; divide by 4
    je mfQuit      ; <<<< would help to avoid crashes!
  @@:
    mov [edx], eax
    add edx, 4
    dec ecx
    jnz @B

  mfQuit:

hutch--

JJ,

It already states that it is to be used on 4 byte increment sized buffers. It is an unrolled algo to add some pace to memory fills up to a certain size where REP MOVSD becomes faster.

If you don't mind losing the old 486 compatibility, there are some fast MMX versions around but make sure you use a version that properly handles writeback that is not through the cache as it is measurably faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

#4
As I already wrote above, you are right and the docu is correct. The problem is that bloody beginners like myself either don't read or don't understand the docu. What about the compromise listed below? It makes the routine general purpose, so you don't have to control the size any more (by the way, it's a frequent case in coding that you don't know the size a priori - a situation that forces you to check each time whether to use memfill or another routine...).

The adjustment costs roughly 2 (two) cycles and an increase of 8 bytes in size.

EDIT:
    mov edx, [esp+4]   ; lpmem, buffer address
instead of
    mov edx, lpmem      ; buffer address

... etc., needed since there is no PROLOGUE

EDIT (2): See later post. The no prologue option is tricky:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD

Warning: mov edx, lpmem looks plausible but is a trap - Masm will ruthless translate that to mov edx, dword ptr [ebp+8]...

hutch--

 :bg

Come on JJ, you are no beginner. The algo looks fine but you will understand my waryness at changing a ducumented algo that has been around for years as it may break other peoples code. I would be inclined to write a later one with a different name and probably use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.

Interestingly enough most people who need to perform memory fill need the opposite end with often large 4 byte dividable sizes being filled and they benefit from MMX or SSE code in critical applications. Most people who need to fill non 4 byte dividable sizes generally do it with a tiny byte fill routine. This is the one I use when I am handling small eneven zero fill sizes.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

    align 16

charfill proc buff:DWORD,bcnt:DWORD,char:DWORD

    mov ecx, [esp+4]        ; data address
    mov edx, [esp+8]        ; fill count
    mov eax, [esp+12]       ; fill character
    sub ecx, 1

  @@:
    add ecx, 1
    mov [ecx], al
    sub edx, 1
    jnz @B

    ret 12

charfill endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

I had to rewrite the memfill "general purpose" version, it was a bit buggy. Here is a working version, suitable for all sizes. The stackoffset option is for cheerful playing.

Timings on a Core 2 Celeron M:

1024 bytes
279 cycles for new memfill      96 bytes
281 cycles for mlib memfill     78 bytes

8192 bytes
2136 cycles for new memfill     96 bytes
2123 cycles for mlib memfill    78 bytes


memfill2 PROTO: DWORD, :DWORD, :DWORD

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
memfill2 proc lpmem:DWORD, ln:DWORD, fill:DWORD
  stackoffset = -8 ; plus or minus 8
  if stackoffset eq -8
pop edx ; ret address trashed
pop edx ; lpmem, buffer address
pop ecx ; ln, byte length
pop eax ; fill chars
  else
mov edx, [esp+4] ; lpmem, buffer address
mov ecx, [esp+8] ; ln, byte length
mov eax, [esp+12] ; fill chars
  endif
  shr ecx, 5 ; divide by 32
  test ecx, ecx
  jz rmndr

; ------------ unroll by 8 ------------
align 4
  @@:
mov [edx],    eax   ; put fill chars at address in edx
mov [edx+4],  eax
mov [edx+8],  eax
mov [edx+12], eax
mov [edx+16], eax
mov [edx+20], eax
mov [edx+24], eax
mov [edx+28], eax
add edx, 32
dec ecx
jnz @B

rmndr:
mov ecx, [esp+stackoffset]
and ecx, 31 ; get remainder
je mfQuit ; we are done
shr ecx, 2 ; divide by 4
je Below4a ; about 2 cycles extra, but this way it's general purpose
align 4

@@:
mov [edx], eax ; some more dwords
add edx, 4
dec ecx
jnz @B

mov ecx, [esp+stackoffset]
and ecx, 3 ; test for last 3 bytes
jne Below4b

mfQuit:
if stackoffset eq -8
sub esp, 4*4
endif
ret 3*4

Below4a:
mov ecx, [esp+stackoffset]
and ecx, 3 ; test for last 3 bytes
je mfQuit

Below4b:
mov [edx], al
inc edx
dec ecx
jnz Below4b
jmp mfQuit

memfill2 endp

[attachment deleted by admin]

jj2007

#7
Quote from: hutch-- on September 14, 2008, 08:36:36 AM
use the trick Michael did with the switchover to REP MOVSD over about 500 bytes as you then get the best of both worlds. Having a tail end trimmer for non 4 byte increment sizes is trivial to do if that was needed.

EDIT: Bug fixed & new version attached. The  breakeven point for switching to stosd seems to be around 1536 bytes.

rep stosd
pop edi
; WRONG: jmp rmndr
jmp Below4a ; we moved dwords above


Done. Core 2 Celeron M timings:

256 bytes
86 cycles for new memfill       106 bytes
85 cycles for mlib memfill      78 bytes

512 bytes
146 cycles for new memfill      106 bytes
151 cycles for mlib memfill     78 bytes

1024 bytes (still mov version, stosd 345 cycles)
282 cycles for new memfill      106 bytes
282 cycles for mlib memfill     78 bytes

2048 bytes
481 cycles for new memfill      106 bytes
544 cycles for mlib memfill     78 bytes

4096 bytes
750 cycles for new memfill      106 bytes
1074 cycles for mlib memfill    78 bytes

8192 bytes
1295 cycles for new memfill     106 bytes
2123 cycles for mlib memfill    78 bytes


Full code attached, timings welcome. Look for memfill2 if you want to play. The *.asc can be opened with WordPad.

[attachment deleted by admin]

Mark Jones

AMD x2 4000 x64/WinXPx32

Testing:
256 bytes
57 cycles for new memfill       106 bytes
51 cycles for mlib memfill      78 bytes

512 bytes
100 cycles for new memfill      106 bytes
102 cycles for mlib memfill     78 bytes

1024 bytes
182 cycles for new memfill      106 bytes
188 cycles for mlib memfill     78 bytes

1536 bytes
259 cycles for new memfill      106 bytes
261 cycles for mlib memfill     78 bytes

2048 bytes
284 cycles for new memfill      106 bytes
341 cycles for mlib memfill     78 bytes

4096 bytes
540 cycles for new memfill      106 bytes
662 cycles for mlib memfill     78 bytes

8192 bytes
1060 cycles for new memfill     106 bytes
1301 cycles for mlib memfill    78 bytes

16384 bytes
2083 cycles for new memfill     106 bytes
2584 cycles for mlib memfill    78 bytes

32768 bytes
4137 cycles for new memfill     106 bytes
5148 cycles for mlib memfill    78 bytes
---
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08