News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

prefered method of setting memory ...

Started by James Ladd, March 29, 2005, 08:33:32 AM

Previous topic - Next topic

James Ladd

What is the prefered way of setting memory to a given value like zero (0).
In 'C' I would use memset(ptr, 0, size).
I looked for an example of using memset in the masm folders but no luck.

Phoenix

Hi striker,

i mostly use

Invoke GlobalAlloc,GPTR,numBytes
mov hMem, eax

From MSDN:

QuoteGPTR: 0x0040    Combines GMEM_FIXED and GMEM_ZEROINIT.

The return value is a pointer to allocated memory.

Regards, Phoenix


donkey

There are a number of ways to do it, the most efficient depends on the situation. I generally use a DWORD memfill and step through the memory that way, not terrribly efficient though. For extremely large blocks you might look at MMX, zero all the MMX registers then use MOVQ to fill the memory in large chunks.

mov edi,[lpDest]
mov ecx,[nBytes]
xor eax,eax ; Set EAX to the value you want to fill with (0 in this case)

; do the evenly divisible ones
shr ecx,2
rep stosd

; do the remainder
mov ecx,[nBytes]
and ecx,3
rep stosb
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Tedd

The equivalent is "FillMemory" - I think this is actually what "memset" is replaced with.
(You could use "ZeroMemory", but this just calls FillMemory with 0 anyway.)

Of course, these aren't 'prefered' methods because we're asmers and we do things our own way :bg
I tend to use a function similar to Donkey's, but with the 'rep stosd/stosb' in primitive instructions - but I gather that for small fills, stosd is faster (?)
No snowflake in an avalanche feels responsible.

donkey

Hi Tedd,

Actually I was lazy :) I just used it because I couldn't find a generic example in my sources, I generally write them on the fly swapping between MMX and mov [edi],eax/add edi,4 depending on the situation. But for the most part the REP STOSD instruction is only efficient when dealing with more than 64 bytes of aligned data.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Mark_Larson

  Yea rep stosd is great for bigger than 64 bytes up to some size.  Then you switch to MMX or SSE, which is good up to 1 or 2MB then you switch to non-temporal writes via MMX or SSE. 

  I usually write all 3 versions for my APP and time which one is fastest for my particular program.


;let's assume the buffer is divisible by 16 so we don't have to do fixups for MMX , SSE, and REP STOSD


;rep stosd
mov edi,offset memory_to_fill                          ;address to write to
mov eax,05050505h                                       ; pattern
mov ecx, num_bytes                                       ;number of bytes
shr ecx,2                                                     ;divide by 4 because we are doing dword writes
rep stosd                                                     ;blast it out.


;MMX
mov edi,offset memory_to_fill                          ;address to write to
mov eax,05050505h                                       ; pattern
movd mm0,eax                                             ;move dword EAX to mm0, and zero extend the rest
pshufw mm0,mm0,01000100b                         ;requires SSE, blasts the low dword to all dwords.
mov ecx, num_bytes                                       ;number of bytes
shr ecx,3                                                     ;divide by 8 because we are doing qword writes
@@:
mov [edi],mm0
add edi,8
sub ecx,1
jnz @B

;SSE
mov edi,offset memory_to_fill                          ;address to write to
mov eax,05050505h                                       ; pattern
; if you don't have SSE2 to use the following instruction you can simply read the pattern from a memory variable.
movd mm0,eax                                             ;requires SSE2, move dword EAX to mm0, and zero extend the rest
shufps mm0,mm0,00000000b                         ;blasts the low dword to all dwords.
mov ecx, num_bytes                                     ;number of bytes
shr ecx,4                                                     ;divide by 16 because we are doing qword writes
@@:
mov [edi],xmm0
add edi,16
sub ecx,1
jnz @B

;non-temporal MMX - requires SSE.  Writes directly to memory bypassing the cache.  Fastest way to do it when you have buffers close to the size of the L2 cache.  I generally start to do this around 1-2MB, but as always time your code to find out which way is fastest for you for your buffer size.

mov edi,offset memory_to_fill                          ;address to write to
mov eax,05050505h                                       ; pattern
movd mm0,eax                                             ;move dword EAX to mm0, and zero extend the rest
pshufw mm0,mm0,01000100b                         ;requires SSE, blasts the low dword to all dwords.
mov ecx, num_bytes                                       ;number of bytes
shr ecx,3                                                     ;divide by 8 because we are doing qword writes
@@:
movntq [edi],mm0
add edi,8
sub ecx,1
jnz @B

;non-temporal SSE - requires SSE2.  Writes directly to memory bypassing the cache.  Fastest way to do it when you have buffers close to the size of the L2 cache.  I generally start to do this around 1-2MB, but as always time your code to find out which way is fastest for you for your buffer size.

mov edi,offset memory_to_fill                          ;address to write to
mov eax,05050505h                                       ; pattern
movd mm0,eax                                             ;move dword EAX to mm0, and zero extend the rest
shufps mm0,mm0,00000000b                         ;blasts the low dword to all dwords.
mov ecx, num_bytes                                     ;number of bytes
shr ecx,4                                                     ;divide by 16 because we are doing qword writes
@@:
movntdq [edi],xmm0
add edi,16
sub ecx,1
jnz @B





As far as unrolling the loops, your mileage will vary.  For really large buffers I have found even unrolling the loop once has made no difference in how fast you can write to memory.  So try unrolling it yourself, time it, and see if you can speed it up by unrolling.  One additional point.  Both the SSE and non-temporal SSE require the buffer to be aligned on a 16 byte boundary or you will get an exception.  If people write DLLs or libraries where they can't control the alignment of the data, they go with MMX or non temporal MMX, since the buffer does not have to be aligned on a boundary.



BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Relvinian

Along with the examples provided, if you get memory from HeapAlloc, you can specify HEAP_ZERO_MEMORY to have it initialized to zero for you.  VirtualAlloc already sets the memory to zero when you get it.

Once you have it, the examples posted are great for "clearing it again".

Relvinian

Mark_Larson

Quote from: Relvinian on March 29, 2005, 04:59:56 PM
Along with the examples provided, if you get memory from HeapAlloc, you can specify HEAP_ZERO_MEMORY to have it initialized to zero for you.  VirtualAlloc already sets the memory to zero when you get it.

Once you have it, the examples posted are great for "clearing it again".

Relvinian


  You should really do your own version,  Unless you only allocate memory once at the start of your program.  Why?  Well they have to generically support all sizes of allocated memory.  So they have a generic routine to zero it out.  Which means if you do a lot of allocations during run time, it is faster to zero it yourself with your own custom code.

  Case in point.  I used memset() in my C program to zero a region of memory that was a dword array.  Turns out because of the generic way they have to support everything, if you simply do a FOR loop with the dword array and set the array elements to 0, it is faster than memset().  I've seen that optimization for C on different webpages.  My guess is they are using REP STOSB for the code in memset().
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Phoenix

Hi striker,

i guess that i did not understand your question by the first read  :red, sorry for my mild mental deficiency (i'm getting old, you know).

Well, what i came up with is very similar to donkeys and mark_larsons suggestions using rep stos:


SetUserValue Proc USES edi lpMem:DWORD, numBytes:DWORD, bUserVal:BYTE

    mov edi,lpMem       ; Pointer to memory

    mov  dl,bUserVal    ; Value to initialize memory with
    xor  eax,eax

    .if dl != 0         ; Build dword
        mov  al,dl
        shl  eax,8
        mov  al,dl
        mov  cx,ax
        shl  eax,16
        mov  ax,cx
    .endif

    mov  ecx,numBytes   ; number of bytes to write
    push ecx         
    shr  ecx,2          ; numbers of dwords to write
    rep  stosd
    pop  ecx
    and  ecx,3          ; number of remaining bytes to write
    rep  stosb

    xor  eax,eax
    ret

SetUserValue endp


It works, so far, but does not test for bad writes.

Regards, phoenix


James Ladd

ok, I guess for new memory Ill ask the allocator to do the zero and for
memory I have used Ill use a combination of approached.
Most likely Donkeys/MarkL approach.
Ta.

btw - thanks for putting your example into a proc for me Phoenix.