stack vs memory

jj2007 · April 27, 2008, 09:54:39 PM

Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???

Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??

donkey · April 27, 2008, 10:53:32 PM

Quote from: jj2007 on April 27, 2008, 09:54:39 PM
Quote from: donkey on April 27, 2008, 09:27:56 PM
all you are doing with it is allocating memory and copying some bytes, I mean why do you need system strings for that ???
Good question. For example, I will need thousands of variable length little strings, and wonder what are the implications of heap fragmentation:
"Memory allocated by HeapAlloc is not movable. Since the memory is not movable, it is possible for the heap to become fragmented."
The Masm32 alloc$ macro uses SysAllocStringByteLen, not HeapAlloc. Am I just entering an ideological war on the best memory allocation option??

This is not an "ideological war", there are quantifiable reasons to use one scheme over the other and they are all well documented at MS and by Intel. I'm not sure how many times I will have to type this before it is understood, There is no moveable memory in Win32 in the sense Hutch was using it. The best allocation scheme for what you are doing is to use a linked list and HeapAlloc creating a private heap.

The MASM32 library waits for a process to end using a loop instead of the Wait functions, wrong but in there all the same... Please, no arguments about the value of a loop to do this, I'm tired of that one.

Donkey

NightWare · April 28, 2008, 12:51:11 AM

hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?

donkey · April 28, 2008, 01:37:25 AM

Quote from: NightWare on April 28, 2008, 12:51:11 AM
hi donkey,
i'm not very familar with win api (i avoid them as much as possible...), but it's quite predictable to determine our maximum memory need, so why not allocating just one memory area, just once, and make our own manipulations inside... we alloc an area with GlobalAlloc GMEM_FIXED/GMEM_ZEROINIT or HeapAlloc HEAP_ZERO_MEMORY for something like TotalArea = UsedArea+PotentiallyUsableArea

when we need to remove an area we just move (my movemem function has aproximatively the same speed as copymem) OFFSET DataToRemove+SIZEOF DataToRemove to OFFSET DataToRemove where size=(OFFSET PotentiallyUsableArea-(OFFSET DataToRemove+SIZEOF DataToRemove)), or size=(OFFSET PotentiallyUsableArea-OFFSET DataToRemove) if we need zero memory

then loop for all the pointers (yep, forgot to say you need a LUT for pointers, and the corresponding size...) after OFFSET DataToRemove, where you just subtract SIZEOF DataToRemove

and when you need to add an area you just have it at the new OFFSET PotentiallyUsableArea... by doing this, there is no memory fragmentation, no need to constanly alloc/free mem, and (depending of the numbers/size of the areas) it will be quite fast, no ?

Hi NightWare,

That's pretty much how I would do it, using a linked list makes ordering simple and by writing your own memory management you can optimize it for your particular application.

Donkey

hutch-- · April 28, 2008, 02:43:55 AM

There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414

donkey · April 28, 2008, 04:59:34 AM

Quote from: hutch-- on April 28, 2008, 02:43:55 AM
There appears to be some confusion between the old 1990 win3.0 movable memory and modern garbage collection. Here is one of many links on how some of it works.

http://support.microsoft.com/kb/171414

COM uses IUnknown.Release when a program terminates in order to allow the reference count to be decremented if the COM thread for an application is no longer running (IE pinging the interface). The garbage collection here is simply a de-allocation of memory that is no longer referenced by any running process, not a compaction of in-use memory or even a simple garbage collect of in-use memory. This is similar to the way the API releases handles and the process heap even if your program does not explicitly do so on termination, another form of garbage collection, however like the one described in the article it has nothing to do with garbage collection in a large array that is still in use, which is the actual topic under discussion.

Also it is wrong to refer to memory that is being marshalled as "moveable" memory, they are not even close to the same thing, a closer parallel would be "shared" memory. Nowhere in the article did they refer to the memory as "moveable" since that concept has no meaning in a flat memory model.

Donkey

hutch-- · April 28, 2008, 08:22:11 AM

Edgar,

> since that concept has no meaning in a flat memory model.

Memory management schemes are independent of the memory modeil they are used under, you are confusing meaning with implimentation. There is of course a bit more to the choice of allocation method than just memory recovery (garbage collection), an OLE string stores the length when it is allocated where fixed memory schemes like GlobalAlloc() and HeapAlloc() don't and this saves either scanning the string each time you need its length or having to store it somewhere else. OLE string is marginally slower in allocation than fixed memory strategies but that is predictable as it does slightly more.

None the less I give you the point that automatic compaction does not occur without a mechanism written to do that job in much the manner as java or basic garbage collection. Unfiortunately fragmentation testing is not a simple task as sequential allocation in simple test pieces gives you linear addressing for each block. To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

jj2007 · April 28, 2008, 08:32:07 AM

Quote from: hutch-- on April 28, 2008, 08:22:11 AMTo test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

OK, let's try:

Code Select

include \masm32\include\masm32rt.inc

ShowWinErr	PROTO:DWORD

.data?
ShowErrBuf	dd 100 dup(?)	; 200 chars for Windows errors
StringTable	dd 2000 dup (?)	; pointers to 2000 string variables

.data
ShowErr	db "Windows error:", 13, 10, 0

s10	db "1234567890", 0			; ten bytes
s2k	db 200 dup ("x234567890"), 0		; 2000
s64k	db 6400 dup ("A234567890"), 0	; 64000

.code

start:	pushad
	xor esi, esi			; outer loop counter ->1000*2000*66000=132 000 000 000
	.While esi<1000
		xor ebx, ebx	; inner loop counter
		.While ebx<2000
			lea edi, [StringTable+4*ebx]
			mov eax, [edi]
			.if eax
				invoke SysFreeString, eax
			.endif
			mov edx, offset s10	; default: a short string
			mov eax, ebx
			add eax, esi
			bt eax, 0
			.if CARRY?		; odd
				mov edx, offset s2k
			.else
				bt eax, 1
				.if CARRY?
					mov edx, offset s64k
				.endif
			.endif
			invoke SysAllocStringByteLen, edx, len(edx)
			.if eax
				mov [edi], eax
			.else
				invoke ShowWinErr, 1	; show formatted error
				mov esi, 999999		; get out of the outer loop, too
				.break
			.endif
			; invoke MessageBox, 0, eax, chr$("The String:"), MB_OK
			inc ebx		; inner loop counter
		.Endw
		inc esi			; outer loop counter
		; give us some feedback on where you are...
		print chr$(13, 10, "Loop ")
		print str$(esi)
		print chr$(13, 10, "Length of first string is ")
		lea edi, [StringTable]
		invoke SysStringByteLen, [edi]
		print str$(eax)
		; invoke StdOut, [edi]	; the string itself
	.Endw
	.While ebx			; free the strings allocated in the last inner loop
		dec ebx
		lea edi, [StringTable+4*ebx]
		mov eax, [edi]
		.if eax
			invoke SysFreeString, eax
		.endif
	.Endw
	; invoke ShowWinErr, 1
	if 0	; try the same with The Heap?
		invoke GetProcessHeap		; for FileOpen, Save, Menus
		mov ProcHeap, eax
		invoke HeapAlloc, ProcHeap,
		HEAP_GENERATE_EXCEPTIONS or HEAP_ZERO_MEMORY, sizeof MyText
		mov MyHeap$, eax
		invoke lstrcpy, eax, addr MyText
		invoke MessageBox, 0, MyHeap$, chr$("The Heap String:"), MB_OK
		invoke HeapFree, ProcHeap, 0, MyHeap$
		invoke ShowWinErr, chr$("HeapFree")
		invoke HeapCompact, ProcHeap, 0
		invoke MessageBox, 0, str$(eax), chr$("HeapCompact:"), MB_OK
	endif
	popad
	invoke ExitProcess, 0

ShowWinErr proc ShowFormatted:DWORD
 pushad
 invoke GetLastError
 .if ShowFormatted
	invoke FormatMessage,FORMAT_MESSAGE_FROM_SYSTEM,
	0,	; GetItFromSystem
	eax,	; ErrNum
	0,	; Default language
	addr ShowErrBuf,			; where to send the string from system
	400, 0          				; size 400, no arguments
	.if ShowFormatted!=1
		invoke lstrcat, addr ShowErrBuf, ShowFormatted
	.endif
	invoke MessageBox, 0, addr ShowErrBuf, addr ShowErr, MB_OK
 .else
	invoke MessageBox, 0, str$(eax), addr ShowErr, MB_OK
 .endif
 popad
ret
ShowWinErr endp

end start

jj2007 · April 28, 2008, 10:34:03 AM

Quote from: hutch-- on April 28, 2008, 08:22:11 AM
To test this capacity between different memory allocation strategies you need a reproducable pattern of deallocations and reallocations to see if you get any difference from one method to another.

Here is a testbed, hopefully correct. Set UseHeap EQU 1 to assemble either version.

Neither version crashed, but HeapAlloc takes roughly three times as long for the same task.

[attachment deleted by admin]

hutch-- · April 28, 2008, 11:10:41 AM

jj,

Is it possible to use a faster string copy algo.

Code Select


					mov [edi], eax		; pointer into stringtable
					mov eax, len(edx)
					invoke lstrcpyn, [edi], edx, eax

lstrcpyn is particularly slow, I wondered if you used another copy algo if it would make the HeapAlloc code faster.

LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.

Code Select


invoke MemCopy,edx,[edi],eax

jj2007 · April 28, 2008, 01:30:49 PM

Quote from: hutch-- on April 28, 2008, 11:10:41 AM
LATER: I got the speed of the HeapAlloc() code up by substituting lstrcpyn with the following. It dropped its time by about 30%.

Code Select Expand
invoke MemCopy,edx,[edi],eax

It works, yes, but HeapAlloc is still considerably slower, about 17%. Here some stats on ticks & mem usage:

Code Select

	Ticks	Mem	Handles GDI
Heap	12000	87	12	4
SysAll	10000	87	14	4

with these settings:
Loops	EQU	8000
Strings	EQU	5000

Note that SysAlloc uses three handles more. HeapAlloc on the other hand is somewhat clumsier to use.

8000*5000*66000=2.64E+012 bytes shoved around, no crash. Apparently both versions know how to reuse the freed slots. A weak point of the test app might be that although the three strings are different in size, there is an exact old slot available for future use every time one of them is released.

New version attached, will first use SysAlloc, then HeapAlloc.

[attachment deleted by admin]

Jimg · April 28, 2008, 02:45:07 PM

I played around with your code just for fun. By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times. It certainly takes a lot more code, however. Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference. This was done on an AMD. I'm sure P4's are much different.

[attachment deleted by admin]

jj2007 · April 28, 2008, 07:22:20 PM

Quote from: Jimg on April 28, 2008, 02:45:07 PM
I played around with your code just for fun. By removing HEAP_ZERO_MEMORY, which SysAllocStringByteLen doesn't do, and putting the code from MemCopy inline, I get virtually identical times. It certainly takes a lot more code, however. Also I only did 99 loops because I didn't want to sit through 8000, so that may make some difference. This was done on an AMD. I'm sure P4's are much different.

I see the old 17% difference when running your exe (P4, 2.4 GHz, 512 MB Ram, XP SP1):
Loop 99 Length of first string is 64000
Ticks, UseHeap=0 : 19599

Loop 99 Length of first string is 64000
Ticks, UseHeap=1 : 23554

Jimg · April 28, 2008, 11:59:55 PM

I figured it was my flakey AMD. I just ran it four times in a row from the command prompt-

Code Select


R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17516
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17313

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17406
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17296

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17438
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17438

Press any key to continue ...

R:\ tst

Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 17484
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 17390

Press any key to continue ...

R:\

hutch-- · April 29, 2008, 12:38:42 AM

Here azre my results using Jim's test code. This is on my old 2.8 gig Northwood core PIV, 2 gig of DDR400.

Code Select


Loop 99  Length of first string is 64000
Ticks, UseHeap=0 : 11672
Loop 99  Length of first string is 64000
Ticks, UseHeap=1 : 11906

Press any key to continue ...

I am certainly not seeing the so called advantage of HeapAlloc() here and the OLE string still has the added advantage of being designed for large count small allocations complete with length storage 4 bytes below its start address. For what it is worth I have never seen HeapAlloc outperform GlobalAlloc when used with the GMEM_FIXED flag which I use for the main pointer array.

As a humerous aside, when I was doing some testing recently on different memory strategies I forgot to change the deallocation from GlobalFree() to HeapFree yet the GlobalFree successfully deallocated the memory handle for HeapAlloc.

News:

stack vs memory