News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

stack vs memory

Started by loki_dre, April 21, 2008, 07:13:17 PM

Previous topic - Next topic

jj2007

Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???
Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??

donkey

Quote from: jj2007 on April 27, 2008, 09:54:39 PM
Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???
Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??

This is not an "ideological war", there are quantifiable reasons to use one scheme over the other and they are all well documented at MS and by Intel. I'm not sure how many times I will have to type this before it is understood, There is no moveable memory in Win32 in the sense Hutch was using it. The best allocation scheme for what you are doing is to use a linked list and HeapAlloc creating a private heap.

The MASM32 library waits for a process to end using a loop instead of the Wait functions, wrong but in there all the same... Please, no arguments about the value of a loop to do this, I'm tired of that one.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

NightWare

hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?

donkey

Quote from: NightWare on April 28, 2008, 12:51:11 AM
hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?


Hi NightWare,

That's pretty much how I would do it, using a linked list makes ordering simple and by writing your own memory management you can optimize it for your particular application.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

donkey

Quote from: hutch-- on April 28, 2008, 02:43:55 AM
There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414

COM uses IUnknown.Release when a program terminates in order to allow the reference count to be decremented if the COM thread for an application is no longer running (IE pinging the interface). The garbage collection here is simply a de-allocation of memory that is no longer referenced by any running process, not a compaction of in-use memory or even a simple garbage collect of in-use memory. This is similar to the way the API releases handles and the process heap even if your program does not explicitly do so on termination, another form of garbage collection, however like the one described in the article it has nothing to do with garbage collection in a large array that is still in use, which is the actual topic under discussion.

Also it is wrong to refer to memory that is being marshalled as "moveable" memory, they are not even close to the same thing, a closer parallel would be "shared" memory. Nowhere in the article did they refer to the memory as "moveable" since that concept has no meaning in a flat memory model.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

Edgar,

> since that concept has no meaning in a flat memory model.

Memory management schemes are independent of the memory modeil they are used under, you are confusing meaning with implimentation. There is of course a bit more to the choice of allocation method than just memory recovery (garbage collection), an OLE string stores the length when it is allocated where fixed memory schemes like GlobalAlloc() and HeapAlloc() don't and this saves either scanning the string each time you need its length or having to store it somewhere else. OLE string is marginally slower in allocation than fixed memory strategies but that is predictable as it does slightly more.

None the less I give you the point that automatic compaction does not occur without a mechanism written to do that job in much the manner as java or basic garbage collection. Unfiortunately fragmentation testing is not a simple task as sequential allocation in simple test pieces gives you linear addressing for each block. To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on April 28, 2008, 08:22:11 AMTo test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

OK, let's try:

include \masm32\include\masm32rt.inc

ShowWinErr PROTO:DWORD

.data?
ShowErrBuf dd 100 dup(?) ; 200 chars for Windows errors
StringTable dd 2000 dup (?) ; pointers to 2000 string variables

.data
ShowErr db "Windows error:", 13, 10, 0

s10 db "1234567890", 0 ; ten bytes
s2k db 200 dup ("x234567890"), 0 ; 2000
s64k db 6400 dup ("A234567890"), 0 ; 64000

.code

start: pushad
xor esi, esi ; outer loop counter ->1000*2000*66000=132 000 000 000
.While esi<1000
xor ebx, ebx ; inner loop counter
.While ebx<2000
lea edi, [StringTable+4*ebx]
mov eax, [edi]
.if eax
invoke SysFreeString, eax
.endif
mov edx, offset s10 ; default: a short string
mov eax, ebx
add eax, esi
bt eax, 0
.if CARRY? ; odd
mov edx, offset s2k
.else
bt eax, 1
.if CARRY?
mov edx, offset s64k
.endif
.endif
invoke SysAllocStringByteLen, edx, len(edx)
.if eax
mov [edi], eax
.else
invoke ShowWinErr, 1 ; show formatted error
mov esi, 999999 ; get out of the outer loop, too
.break
.endif
; invoke MessageBox, 0, eax, chr$("The String:"), MB_OK
inc ebx ; inner loop counter
.Endw
inc esi ; outer loop counter
; give us some feedback on where you are...
print chr$(13, 10, "Loop ")
print str$(esi)
print chr$(13, 10, "Length of first string is ")
lea edi, [StringTable]
invoke SysStringByteLen, [edi]
print str$(eax)
; invoke StdOut, [edi] ; the string itself
.Endw
.While ebx ; free the strings allocated in the last inner loop
dec ebx
lea edi, [StringTable+4*ebx]
mov eax, [edi]
.if eax
invoke SysFreeString, eax
.endif
.Endw
; invoke ShowWinErr, 1
if 0 ; try the same with The Heap?
invoke GetProcessHeap ; for FileOpen, Save, Menus
mov ProcHeap, eax
invoke HeapAlloc, ProcHeap,
HEAP_GENERATE_EXCEPTIONS or HEAP_ZERO_MEMORY, sizeof MyText
mov MyHeap$, eax
invoke lstrcpy, eax, addr MyText
invoke MessageBox, 0, MyHeap$, chr$("The Heap String:"), MB_OK
invoke HeapFree, ProcHeap, 0, MyHeap$
invoke ShowWinErr, chr$("HeapFree")
invoke HeapCompact, ProcHeap, 0
invoke MessageBox, 0, str$(eax), chr$("HeapCompact:"), MB_OK
endif
popad
invoke ExitProcess, 0

ShowWinErr proc ShowFormatted:DWORD
pushad
invoke GetLastError
.if ShowFormatted
invoke FormatMessage,FORMAT_MESSAGE_FROM_SYSTEM,
0, ; GetItFromSystem
eax, ; ErrNum
0, ; Default language
addr ShowErrBuf, ; where to send the string from system
400, 0          ; size 400, no arguments
.if ShowFormatted!=1
invoke lstrcat, addr ShowErrBuf, ShowFormatted
.endif
invoke MessageBox, 0, addr ShowErrBuf, addr ShowErr, MB_OK
.else
invoke MessageBox, 0, str$(eax), addr ShowErr, MB_OK
.endif
popad
ret
ShowWinErr endp

end start

jj2007

Quote from: hutch-- on April 28, 2008, 08:22:11 AM
To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

Here is a testbed, hopefully correct. Set UseHeap EQU 1 to assemble either version.

Neither version crashed, but HeapAlloc takes roughly three times as long for the same task.

[attachment deleted by admin]

hutch--

jj,

Is it possible to use a faster string copy algo.


mov [edi], eax ; pointer into stringtable
mov eax, len(edx)
invoke lstrcpyn, [edi], edx, eax


lstrcpyn is particularly slow, I wondered if you used another copy algo if it would make the HeapAlloc code faster.

LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.


invoke MemCopy,edx,[edi],eax
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on April 28, 2008, 11:10:41 AM
LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.


invoke MemCopy,edx,[edi],eax

It works, yes, but HeapAlloc is still considerably slower, about 17%. Here some stats on ticks & mem usage:
Ticks Mem Handles GDI
Heap 12000 87 12 4
SysAll 10000 87 14 4

with these settings:
Loops EQU 8000
Strings EQU 5000


Note that SysAlloc uses three handles more. HeapAlloc on the other hand is somewhat clumsier to use.

8000*5000*66000=2.64E+012 bytes shoved around, no crash. Apparently both versions know how to reuse the freed slots. A weak point of the test app might be that although the three strings are different in size, there is an exact old slot available for future use every time one of them is released.

New version attached, will first use SysAlloc, then HeapAlloc.

[attachment deleted by admin]

Jimg

I played around with your code just for fun.  By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times.  It certainly takes a lot more code, however.  Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference.  This was done on an AMD.  I'm sure P4's are much different.

[attachment deleted by admin]

jj2007

Quote from: Jimg on April 28, 2008, 02:45:07 PM
I played around with your code just for fun.  By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times.  It certainly takes a lot more code, however.  Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference.  This was done on an AMD.  I'm sure P4's are much different.
I see the old 17% difference when running your exe (P4, 2.4 GHz, 512 MB Ram, XP SP1):
Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 19599

Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 23554

Jimg

I figured it was my flakey AMD.  I just ran it four times in a row from the command prompt-

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17516
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17313

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17406
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17296

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17438
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17438

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17484
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17390

Press any key to continue ...

R:\

hutch--

Here azre my results using Jim's test code. This is on my old 2.8 gig Northwood core PIV, 2 gig of DDR400.


Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 11672
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 11906

Press any key to continue ...


I am certainly not seeing the so called advantage of HeapAlloc() here and the OLE string still has the added advantage of being designed for large count small allocations complete with length storage 4 bytes below its start address. For what it is worth I have never seen HeapAlloc outperform GlobalAlloc when used with the GMEM_FIXED flag which I use for the main pointer array.

As a humerous aside, when I was doing some testing recently on different memory strategies I forgot to change the deallocation from GlobalFree() to HeapFree yet the GlobalFree successfully deallocated the memory handle for HeapAlloc.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php