Well, search no more. Here is the source code:
.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE
INCLUDE windows.inc
INCLUDE kernel32.inc
INCLUDELIB kernel32.lib
INCLUDE masm32.inc
INCLUDELIB masm32.lib
INCLUDE msvcrt.inc
INCLUDELIB msvcrt.lib
INCLUDE c:\masm32\macros\macros.asm
INCLUDE c:\masm32\macros\timers.asm
TEST_LIMIT1 = 100000h
TEST_LIMIT2 = 10000000h
INCREASE_STEP_1 = 4000h
INCREASE_STEP_2 = 80000h
LOOP_COUNT = 1
.DATA
hHeapApp DWORD 0
pMem DWORD 0
MemoryCounter DWORD 1000h
Rez1 DWORD 0
Rez2 DWORD 0
.CODE
DoATest PROC BytesCount:DWORD
print hex$(BytesCount), 9,"|",9
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
INVOKE HeapAlloc, hHeapApp,0,BytesCount
mov pMem, eax
counter_end
mov Rez1, eax
print str$(eax),9
.IF pMem!=0
INVOKE HeapFree, hHeapApp,0,pMem
write <"OK",9,"|",9>
mov pMem, 0
.ELSE
write <"FAILED",9,"|",9>
.ENDIF
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
INVOKE VirtualAlloc, NULL,BytesCount,MEM_COMMIT or MEM_RESERVE,PAGE_READWRITE
mov pMem, eax
counter_end
mov Rez2, eax
print str$(eax)
.IF pMem!=0
INVOKE VirtualFree, pMem,0,MEM_RELEASE
write <9,"OK",9,"|",9>
mov pMem, 0
.ELSE
write <9,"FAILED",9,"|",9>
.ENDIF
mov eax, Rez1
sub eax, Rez2
print str$(eax),9,"|",13,10
ret
DoATest ENDP
App_Entry_Point:
mov esi, INCREASE_STEP_1
mov edi, TEST_LIMIT1
INVOKE GetProcessHeap
mov hHeapApp, eax
write 13,10,"Indepth test...",13,10
write "=================================================================================",13,10
write "BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
write "=================================================================================",13,10
LoopTest_Heap:
INVOKE DoATest, MemoryCounter
add MemoryCounter, esi
cmp MemoryCounter, edi
jbe LoopTest_Heap
cmp edi, TEST_LIMIT2
je @F
mov esi, INCREASE_STEP_2
mov edi, TEST_LIMIT2
jmp LoopTest_Heap
@@:
write "=================================================================================",13,10
write "=================================================================================",13,10,13,10
write "Here is a small sample that you can post back...",13,10
write "=================================================================================",13,10
write "BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
write "=================================================================================",13,10
INVOKE DoATest, 1000h
INVOKE DoATest, 10000h
INVOKE DoATest, 100000h
INVOKE DoATest, 1000000h
INVOKE DoATest, 10000000h
write "=================================================================================",13,10
write "=================================================================================",13,10,13,10
inkey "Print any key to exit ..."
INVOKE ExitProcess, NULL
END App_Entry_Point
Here are my timings:
Quote
=================================================================================
BYTES | HA STS | VA STS | HA-VA |
=================================================================================
00001000 | 42601 OK | 15185 OK | 27416 |
00010000 | 39728 OK | 12582 OK | 27146 |
00100000 | 47115 OK | 6856 OK | 40259 |
01000000 | 47301 OK | 5986 OK | 41315 |
10000000 | 47303 OK | 10719 OK | 36584 |
=================================================================================
And here is the exe.
Quote from: hutch-- on December 13, 2009, 12:00:56 AM
Apart from people quoting out of date Microsoft data without keeping up on what the functions actually do...
:bg
While I can't speak for all MSDN data, this one is current. I just traced through it, and GlobalAlloc (with a GM_FIXED flag) indeed calls the same function as HeapAlloc, on a Vista system. The OP had it right all along: GlobalAlloc calls HeapAlloc, which (when needed) calls VirtualAlloc. Granted I haven't tested for speed myself, but to say doing three functions is faster than one seems bananas. Who's out of date now? :toothy
HeapAlloc() jumps right to here:
ntdll!RtlAllocateHeap:
76ed6570 8bff mov edi,edi
76ed6572 55 push ebp
blah blah
while GlobalAlloc eventually calls the same (line 76ae7d8a)
kernel32!GlobalAlloc:
76ae7d34 6a18 push 18h
76ae7d36 68a87dae76 push offset kernel32!GlobalAlloc+0x74 (76ae7da8)
76ae7d3b e8a01b0000 call kernel32!WaitForSingleObjectEx+0xe0 (76ae98e0)
76ae7d40 8b4508 mov eax,dword ptr [ebp+8]
76ae7d43 a98d80ffff test eax,0FFFF808Dh
76ae7d48 0f85f8650100 jne kernel32!PulseEvent+0x6845 (76afe346)
<blah blah>
76ae7d84 ff3550e2b676 push dword ptr [kernel32!WakeConditionVariable+0x4274 (76b6e250)]
76ae7d8a ff152010aa76 call dword ptr [kernel32+0x1020 (76aa1020)] ds:0023:76aa1020={ntdll!RtlAllocateHeap (76ed6570)}
r,
The clock says it all, one function call to retrieve the pointer and one function call to deallocate it.
mov hMem(1024*1024)
; just do_it
free hMem
Below all of the DLL layering memory is memory, pick your packaging to suit your purpose. GlobalAlloc() is simpler to use and with the depth of layering, I seriously doubt you will find a viable speed difference. The GlobalAlloc() of Win3 no longer exists and the warnings of it being deprecated are simply nonsense.
Quote from: z941998 on December 10, 2009, 05:25:43 AM
I use this technique for Power Ball, I have all combination in a table with stats, a 4 level search takes less than a second to get the stats I need for a specific combination. Using sequential search access method and something like BASIC, this might take a day or two to complete.
Going off the topic a bit.. I wrote a whole bit of Lotto code in VB6 (yes I know- but the idea was to put it into masm after I'd finished the structures - No time yet), but anyway with all the formulae and stats I threw into it, it could produce any stat within a second, without having a database, except a result table which takes less than 16KB of mem.
It was just the simulation/prediction side of the code which took a long time (I didn't optimise it) - about 3 hours :red I'm sure with masm I could get this down to a few minutes or so.. must do it, must do it... Xmas is coming :P
Quote from: TNick on December 13, 2009, 12:04:16 AM
Well, search no more. Here is the source code:
... Here are my timings:
BYTES | HA STS | VA STS | HA-VA |
=================================================================================
00001000 | 42601 OK | 15185 OK | 27416 |
00010000 | 39728 OK | 12582 OK | 27146 |
00100000 | 47115 OK | 6856 OK | 40259 |
01000000 | 47301 OK | 5986 OK | 41315 |
10000000 | 47303 OK | 10719 OK | 36584 |
Sure?
LOOP_COUNT = 1
My timings with LOOP_COUNT=10000, in cycles per kByte. Calling the behaviour erratic would be an understatement :wink
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
97 cycles for HeapAlloc 00001000 bytes
1484 cycles for VirtualAlloc 00001000 bytes
151 cycles for GlobalAlloc 00001000 bytes
48 cycles for HeapAlloc 00002000 bytes
1041 cycles for VirtualAlloc 00002000 bytes
76 cycles for GlobalAlloc 00002000 bytes
24 cycles for HeapAlloc 00004000 bytes
818 cycles for VirtualAlloc 00004000 bytes
38 cycles for GlobalAlloc 00004000 bytes
12 cycles for HeapAlloc 00008000 bytes
1050 cycles for VirtualAlloc 00008000 bytes
19 cycles for GlobalAlloc 00008000 bytes
47 cycles for HeapAlloc 00010000 bytes
1412 cycles for VirtualAlloc 00010000 bytes
16 cycles for GlobalAlloc 00010000 bytes
8 cycles for HeapAlloc 00020000 bytes
1371 cycles for VirtualAlloc 00020000 bytes
10 cycles for GlobalAlloc 00020000 bytes
5 cycles for HeapAlloc 00040000 bytes
1354 cycles for VirtualAlloc 00040000 bytes
6 cycles for GlobalAlloc 00040000 bytes
1347 cycles for HeapAlloc 00080000 bytes
1344 cycles for VirtualAlloc 00080000 bytes
1348 cycles for GlobalAlloc 00080000 bytes
UseMem=1, cycles per kByte=1
NumBytes= 4096
LOOP_COUNT= 10000
UseMemOn= 1
ShowPerK= 1
UseTheMem MACRO
if UseMemOn
mov edx, eax
mov ecx, 0
.Repeat
mov byte ptr [edx+ecx+123], 0 ; poke a byte in each page
add ecx, 4096
.Until ecx>=NumBytes
endif
ENDM
TestH MACRO
invoke HeapAlloc, ProHeap, HEAP_GENERATE_EXCEPTIONS or HEAP_NO_SERIALIZE, NumBytes
.if eax
UseTheMem
invoke HeapFree, ProHeap, HEAP_NO_SERIALIZE, eax
.else
invoke MessageBox, 0, chr$("HeapAlloc failed"), 0, MB_OK
exit
.endif
ENDM
You simply cannot time memory allocations accurately. It all depends on the amount of ram you have, how much is free, the size
of the pagefile, how much of that is free, what services are running, what your antivirus is doing, the phase of the moon...you get the idea?
Personally, to allocate a chunk (e.g. in a LoadFile proc) I use GlobalAlloc with GMEM_FIXED - just do_it indeed.
To allocate/deallocate over and over, I use the heap functions.
Just write your code in 64-bit, then you can allocate huge amounts at once. :bdg
Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
LOOP_COUNT = 1
Yes. No loop between counter_begin and counter_end. Just do it once for heap, get the timing, free what we've just got, do it once for Virtual, get the timing, free.
To have a loop, Free function should be inserted in test, which I did avoid on purpose.
In your test,
after first Alloc-Free pair in heap, the Heap has the memory right there for it's use. It only needs to search it's free chunks list and will find a chunk in at least
9999 cases. The test only proves that, if you need to allocate and free again and again same ammount of memory, it's best to use the heap.
EDIT:
@sinsi:
No doubt about it. Still, by using the test program, you may see a trend, I think. I do not argue that is a big gain, I simply say that:
IF you need a big chunk of memory (above a page or two), VirtualAlloc may be faster than Heap or Global alloc. Nick
Quote from: TNick on December 13, 2009, 11:09:55 AM
Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
LOOP_COUNT = 1
Yes. No loop between counter_begin and counter_end.
Please have a look at timers.asm
@Sinsi: I wouldn't be so pessimistic regarding the accuracy of timings. Of couse, if you need the swapfile, things change, but with today's RAM that would require really large requests. But for a fat 128MB request, for example, the timings are almost identical. The only lesson you can draw from the timings for the smaller amounts is "use HeapAlloc".
180559386 cycles for HeapAlloc 08000000 bytes
181753646 cycles for VirtualAlloc 08000000 bytes
180141043 cycles for GlobalAlloc 08000000 bytes
UseMem=1, cycles per kByte=0
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
168 cycles for HeapAlloc 00001000 bytes
1966 cycles for VirtualAlloc 00001000 bytes
156 cycles for GlobalAlloc 00001000 bytes
78 cycles for HeapAlloc 00002000 bytes
1587 cycles for VirtualAlloc 00002000 bytes
73 cycles for GlobalAlloc 00002000 bytes
36 cycles for HeapAlloc 00004000 bytes
1278 cycles for VirtualAlloc 00004000 bytes
56 cycles for GlobalAlloc 00004000 bytes
21 cycles for HeapAlloc 00008000 bytes
1061 cycles for VirtualAlloc 00008000 bytes
31 cycles for GlobalAlloc 00008000 bytes
12 cycles for HeapAlloc 00010000 bytes
1021 cycles for VirtualAlloc 00010000 bytes
16 cycles for GlobalAlloc 00010000 bytes
7 cycles for HeapAlloc 00020000 bytes
852 cycles for VirtualAlloc 00020000 bytes
10 cycles for GlobalAlloc 00020000 bytes
4 cycles for HeapAlloc 00040000 bytes
803 cycles for VirtualAlloc 00040000 bytes
6 cycles for GlobalAlloc 00040000 bytes
776 cycles for HeapAlloc 00080000 bytes
798 cycles for VirtualAlloc 00080000 bytes
801 cycles for GlobalAlloc 00080000 bytes
UseMem=1, cycles per kByte=1
Where did you get your timings for '08000000 bytes'?
Quote from: jj2007 on December 13, 2009, 11:15:09 AM
Please have a look at timers.asm
I did. What now? :)
As I said, the code between macro calls will be executed once. That is because you get strange results like yours when you repeat the request for memory for same amount. The user will - probably - not repeat it's request over and over again. Not the timings are important here,I think, but the difference, which almost always say Virtual is faster.
That being said, I would say we have allocated enough time from our short existence for this matter.
I wish everyone a good Sunday! :)
Nick
QuoteThat being said, I would say we have allocated enough time from our short existence for this matter.
lol - you don't know us very well, huh
i see no harm in using an allocate/free cycle for timing measurements
to be fair, if one method is faster at allocation but slower at freeing it, where is the advantage ?
if you pair them up, you can run them in a loop
the way i normally use large pieces of allocated memory, i am not likely to be that concerned about speed, anyways
i can see some instances where you may want to allocate and free blocks
but, i think it is better if you grab what you intend to use, and manage it from there with code
when you are done with it, then free it
that seems much more efficient that allocating several small blocks, no matter how you slice it ( ::) )
The trouble with an alloc/touch/free test regime is things that happen over and over tend to get cached and reused.
Allocate 4k gets a page table entry, but de-allocate it and windows might just leave it for a couple of seconds, just in case.
Then windows is justified because we allocate it again in a few milliseconds.
>i think it is better if you grab what you intend to use, and manage it from there with code
My point exactly. Hell, grab the whole 2 gig and parcel it up how you want.
Use the old /3gb and /largeaddressaware and you get 3 gig.
well - it is good to use what you need, when you need it, then release it for other processes to use
otherwise, you write memory-hog code, like yahoo messenger - lol
but, there is no need to make the OS micro-manage the blocks for you
i think that is a result of lazy programming :bg
I moved this topic to the Lab where timings and comments are more appropriate.
>it is good to use what you need, when you need it, then release it for other processes to use
heh, you can VirtualAlloc memory and reserve it - it's not actually mapped into your address space, just backed up by the pagefile (i.e. non-existent).
I don't see why windows doesn't do what dos did - allocate all memory.
Quote from: sinsi on December 13, 2009, 11:58:26 AM
The trouble with an alloc/touch/free test regime is things that happen over and over tend to get cached and reused.
That is valid within the size of the cache.
22460992 cycles for HeapAlloc 01000000 bytes
22609943 cycles for VirtualAlloc 01000000 bytes
22480093 cycles for GlobalAlloc 01000000 bytes
1000000h = 16 MB
Quote from: hutch-- on December 13, 2009, 06:29:32 AM
The clock says it all, one function call to retrieve the pointer and one function call to deallocate it.
I still don't think I understand; are the heap functions not also a single call to allocate (HeapAlloc) and one to free (HeapFree), during which the pointer you receive from HeapAlloc a direct pointer to a fixed memory area of that specific size? I was under the impression that moveable memory itself (reallocation notwithstanding) was fundamentally depreciated, if not the GlobalAlloc function. It was my understanding that on NT, memory, once allocated, was fixed, even if you used GlobalAlloc with the moveable option. Is that not so?
-r
r,
You have hit the distinction on the head, old Win3 movable memory is finished (fortunately) and the only viable allocation techniques are those provided by ntoskrnl.exe and the layering chain above it. This is why once memory is allocated by whatever package, there is no realistic speed difference between them as they all come from the same place.
There are a couple of techniques specific to rapid allocation and deallocation of small amounts of memory, OLE string is one, HeapAlloc() with the more recent low fragmentation heap is another but they both pay a price in speed terms for this level of memory management. Serious memory fragmentation problems are usually addressed by custom code, a web server that must stay online for long periods is one task that cannot just rely on the OS, they tend to use a large count circular buffer and keep track of each one as its no longer used by one connection so it can be set up for the next.
Then in what way is GlobalAlloc() "simpler" to use, if both calls use fixed memory, direct pointers, and the same underlying functions? Is there something I'm missing? It seems the only real difference would be that GlobalAlloc() essentially calls GetProcessHeap() for you, at the expense of some additional, yet mostly negligable, overhead. That hardly seems like a good enough reason to stay with an (ostensibly) depreciated API call; especially considering it's only a single argument which stays constant through the life of the process. Also, if that's the case, how do you back your claim of GlobalAlloc being fastest? There's quite a divide, especially on this fourm (and subforum), between "nothing is faster" and "there is no realistic speed difference".
-r
Timings once more, for alloc/touch/free. From what I see, there are some important observations:
- HeapAlloc is a lot faster for small allocations (we all knew that)
- VirtualAlloc suffers from overhead for the small ones
- up to ca. 1M, i.e. the cache size, roughly 600 cycles per kByte are "normal"
- above that, 1320 cycles per kByte become standard, and there is virtually no difference between the three candidates.
There is a REPEAT 8 before the counter_begin. You may use REP 9 to get 256 MB (ok), and REP 10 for 1 Giga. The latter is not really advisable unless you have a lot of RAM. On my notebook, 1 GB Ram, it started swapping, got really slow, and said byebye in the GlobalAlloc loop. Afterwards, the whole system remained very slow, and I had to reboot.
Now one last question: Why do these bloody function need over a cycle per byte if they just earmark memory as usable? There is no zeroinit code involved as far as I can see...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
104 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
1501 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
154 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
28 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
819 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
38 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
615 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
621 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
191 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
8 cycles per kByte for HeapAlloc 00040000h bytes (256 kB)
568 cycles per kByte for VirtualAlloc 00040000h bytes (256 kB)
7 cycles per kByte for GlobalAlloc 00040000h bytes (256 kB)
554 cycles per kByte for HeapAlloc 00100000h bytes (1 MB)
694 cycles per kByte for VirtualAlloc 00100000h bytes (1 MB)
1311 cycles per kByte for GlobalAlloc 00100000h bytes (1 MB)
1330 cycles per kByte for HeapAlloc 00400000h bytes (4 MB)
1324 cycles per kByte for VirtualAlloc 00400000h bytes (4 MB)
1322 cycles per kByte for GlobalAlloc 00400000h bytes (4 MB)
1312 cycles per kByte for HeapAlloc 01000000h bytes (16 MB)
1284 cycles per kByte for VirtualAlloc 01000000h bytes (16 MB)
1326 cycles per kByte for GlobalAlloc 01000000h bytes (16 MB)
1321 cycles per kByte for HeapAlloc 04000000h bytes (64 MB)
1330 cycles per kByte for VirtualAlloc 04000000h bytes (64 MB)
1328 cycles per kByte for GlobalAlloc 04000000h bytes (64 MB)
IIRC, *all* pages are initialized to zero as part of the security compliance; that way, another process can't accidently get a hold of another processess old data. "Free pages" are zeroed out to be "zeroed pages" in a low-priority background thread that only runs while no other threads are active, but if there are no zeroed pages availble it has to do it right then and there. Doing this many allocations would almost certainly deplete any zeroed pages existing before the test was run, which means later allocations would be stuck waiting; reversing the order of the tests could test that theory. Also, allocating memory is an expensive operation anyway; for a 64MB allocation, that's 16,000 different page table entries that have to be constructed, which ironically requires 16 *more* pages allocated to hold them, and PDE's created to point to them, etc, etc.-r
Quote from: redskull on December 13, 2009, 10:36:00 PM
IIRC, *all* pages are initialized to zero as part of the security compliance
A dump for the pointer returned by HeapAlloc starts with 0D F0 AD BA, reversed BAAD F00D.
Same for GlobalAlloc, while VirtualAlloc yields zeroes only.
If you add the HEAP_ZERO_MEMORY flag, allocation slows down considerably. So zeroing is slower than writing bad food???
I remember seeing that BAADFOOD code mentioned elsewhere before; I guess it's an internal check that the HeapManager does for itself. It seems to only be on pages initialized by the heap manager at the start of the process (preexisting before the call) When I did a HeapAlloc() for 4096, the page was "tacked on" to the end of the heap, and was BAADFOOD; when I did one for 1024*1024*1024, I got back zeroed pages way up high in memory, and they were all zeroed. Either way, I'm guessing they come back from the memory manager as zeroed.
-r
If anyone is serious about test memory ALLOCATION times, you do it by reallocating the memory size larger each time so that the test cannot keep using the same memory hole.
None of you guys have mentioned MemoryMapped files, which is the recommended method to messing file large files in memory instead of VirtualAlloc the whole thing, or reading it in smaller alloc's with global/heapallow, anyone test with memmapfiles? Also what about the SysAlloc function and the standard C Malloc function?
Quote from: redskull on December 13, 2009, 11:40:54 PM
I remember seeing that BAADFOOD code mentioned elsewhere before; I guess it's an internal check that the HeapManager does for itself. It seems to only be on pages initialized by the heap manager at the start of the process (preexisting before the call) When I did a HeapAlloc() for 4096, the page was "tacked on" to the end of the heap, and was BAADFOOD; when I did one for 1024*1024*1024, I got back zeroed pages way up high in memory, and they were all zeroed. Either way, I'm guessing they come back from the memory manager as zeroed.
Thanks, I could not test that yet but it seems plausible that on startup, the OS
- zeroes the .data? section
- writes BAADFOOD to the entire heap
So when you allocate "ordinary" heap without the HEAP_ZERO_MEMORY flag, you get bad food; if, however, the allocation exceeds your available heap, the OS switches internally to VirtualAlloc and gives you zeroed memory.
Any experts around who could confirm that based on more or less official documentation...?
Its not hard to tell you guys are driven by your stomach, in the days of YAW you made reference to 0xBADC0DE.
vote for 0xCAFECAFE
ABAD1DEA
1DEADF15h
0ADEADF15h (lol - 4 words!)
I always liked 0xDEADBEEF. :bg
Pete Townshend:
I don't want it grilled
I don't want it cooked
I don't even want it killed
If it's dead I heave
It makes me sick
So check that it can breathe
And bring it to me quick
I want food fast
I want fast food
lol - "preferably tattooed"
Hi,
The one I kept seeing was DEADFACE. I suppose
FEEDFACE would work.
Cheers,
Steve
:bg
I knew I got it right, stomach driven programming.
Hi,
Merry Christmas Hutch. Hope you attend some good
meals.
Cheers,
Steve N.
we go by our "gut instinct", Hutch :P
Quote from: FORTRANS on December 23, 2009, 03:27:59 PM
Merry Christmas Hutch. Hope you attend some good meals.
Yeah, Merry Christmas to Hutch and all the other folks. And don't forget, after the good meals, to pass by CAFEBABE :wink
Dinner AND a show :eek
(http://www.freakingnews.com/pictures/33500/HOOTERS--33655.jpg)