News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

XP memory alloc / Low Fragmentation Heap

Started by chep, June 15, 2005, 04:31:57 AM

Previous topic - Next topic

chep

Hi,

Due to Eóin's, Mark Jones' & hutch--'s posts in the "XP big memory alloc" thread I started playing with memory allocation functions.

I also wanted to test the new Windows XP & Server 2k3 Low Fragmentation Heap (see HeapSetInformation in the Platform SDK) as I thought intuitively that it would provide better performance for small objects.
Quote from: MSDNBecause the system cannot compact a private heap, it can become fragmented.
Applications that allocate large amounts of memory in various allocation sizes can use the low-fragmentation heap to reduce heap fragmentation.


Routines:
    - GlobalAlloc
    - Ultrano's SmallAlloc
    - HeapAlloc
    - SysAllocStringByteLen (OLE Strings)

NOTE: I used Ultrano's SmallAlloc "out of the box" so maybe there's a way to tweak it a bit.


Patterns:
  8192 successive objects allocations (followed by 8192 deallocations), 1024 iterations :
    - with 16 byte blocks, with & without  LFH (malloc-16b-*.exe)
    - with 64 kb blocks, with & without LFH (malloc-64kb-*.exe)

Sadly I couldn't test the 64kb versions, as my current machine has very limited resources and the system goes wild after a few seconds. :(

The results are quite consistent over different runs, but when I moved the OLE Strings test around I found that it slows down subsequent allocations :eek, at least for the 16 byte version (so I put it at the end).

Quite obviously, the LFH-enabled versions will only run under Windows XP. :toothy
Here are the results for 16 bytes blocks on my AMD Duron 650Mhz :

H:\Dev\Asm\MAlloc>MAlloc-16b-noLFH
(De)Allocating 8192 objects (16 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 11742 ms
SmallAlloc   : 10921 ms
HeapAlloc    : 10506 ms
OLEString    : 17164 ms
-------------------------------------------------------------------
H:\Dev\Asm\MAlloc>MAlloc-16b-LFH
(De)Allocating 8192 objects (16 bytes) 1024 times, LFH enabled...
GlobalAlloc  : 6267 ms
SmallAlloc   : 10832 ms
HeapAlloc    : 5317 ms
OLEString    : 8802 ms


As I expected, the Low Fragmentation Heap if far more performant than the other routines.
I also found that LFH not only influences HeapAlloc but also GlobalAlloc & OLE Strings, so I guess those 2 routines internally use HeapAlloc...

The big surprise is that OLE Strings are really slow compared to the other functions! (contrary to what hutch said in the other thread)
And obviously, SmallAlloc doesn't benefit from the LFH since it doesn't use HeapAlloc for allocating small blocks...


So, as a conclusion: always prefer HeapAlloc!

I'd be glad if someone could test the 64kb versions, as I'm quite curious about the LFH influence on big memory blocks, as it is designed for managing small blocks.
<edit>Of course, I know that VirtualAlloc will perform better for big blocks, but well, I'm just experimenting :P</edit>

Awaiting your comments :wink

<edit>Removed OLE Strings (clearly too slow), added VirtualAlloc, & a flag to dirty the allocated memory.</edit>

[attachment deleted by admin]

hutch--

Thanks for writing the benchmark.

Here is a modified version with different loop counts and buffer sizes.


(De)Allocating 65536 objects (16 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 40131 ms
SmallAlloc   : 129216 ms
HeapAlloc    : 37692 ms
OLEString    : 47808 ms


Next.


(De)Allocating 8192 objects (512 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 10964 ms
SmallAlloc   : 65420 ms
HeapAlloc    : 10550 ms
OLEString    : 11862 ms


Next.


(De)Allocating 1024 objects (16384 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 7493 ms
SmallAlloc   : 28762 ms
HeapAlloc    : 7596 ms
OLEString    : 7795 ms


Next.


(De)Allocating 128 objects (1048576 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 1168 ms
SmallAlloc   : 1185 ms
HeapAlloc    : 1170 ms
OLEString    : 1722 ms


Next.


(De)Allocating 16384 objects (8 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 9573 ms
SmallAlloc   : 15733 ms
HeapAlloc    : 9172 ms
OLEString    : 10807 ms


I have tested this on win2k sp4 and the results show that HeapAlloc is clearly faster on small repeated allocations where GlobalAlloc is faster on larger allocations. An unusual result as it is poorly suited for repeated allocations and is best used for large single allocations where fixed memory is required. The OLE string memory allocation is slower in relation to what it used to be under win9x.

I think Ultrano's technique is dedicated to small overhead but its average performance is reasonable enough, even though it is slower under these test conditions.

Have you got the time to plug in a VirtualAlloc() test as well ?

Just one suggestion on the benchmark, see if there is a way to full flush memory between tests or set a delay after deallocation so that a previous operation does not leave the memory as a mess for the next one.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Now here is an example where Ultrano's version is kicking arse. It seems to have advantage where the block size is not a simple aligned size.


(De)Allocating 8192 objects (55 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 5201 ms
SmallAlloc   : 4331 ms
HeapAlloc    : 4901 ms
OLEString    : 6291 ms
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

chep

#3
Quote from: hutch-- on June 15, 2005, 05:47:36 AM
Just one suggestion on the benchmark, see if there is a way to full flush memory between tests or set a delay after deallocation so that a previous operation does not leave the memory as a mess for the next one.
I have absolutely no idea how to really flush the memory, but adding a Sleep of at least 500ms between each test enhanced the performance of all tests but the first (adding a Sleep before the first test didn't change it's performance).

Quote from: hutch-- on June 15, 2005, 05:47:36 AM
...and the results show that HeapAlloc is clearly faster on small repeated allocations where GlobalAlloc is faster on larger allocations. An unusual result as it is poorly suited for repeated allocations and is best used for large single allocations where fixed memory is required.
Quite surprising indeed! Anyone got a theory?

Quote from: hutch-- on June 15, 2005, 05:59:57 AM
Now here is an example where Ultrano's version is kicking arse. It seems to have advantage where the block size is not a simple aligned size.
Good point! :U I didn't even bother to test with unaligned data sizes as I thought that it was *obvious* it would be slower than with aligned sizes.

Quote from: hutch-- on June 15, 2005, 05:47:36 AM
Have you got the time to plug in a VirtualAlloc() test as well ?
Here's a new version.
I removed the OLEString test, as clearly it is far slower than the others in every case.

I also added VirtualAlloc.
When using it's MEM_COMMIT flag, it is really fast :
(De)Allocating 128 objects (4096 bytes) 1024 times, LFH disabled...
GlobalAlloc  : 2459 ms
SmallAlloc   : 2855 ms
HeapAlloc    : 2436 ms
VirtualAlloc : 514 ms


So I added a mov instruction (for every benchmark in order to be fair) to "dirty" the first byte of the allocated area:
(De)Allocating 128 objects (4096 bytes, dirty) 1024 times, LFH disabled...
GlobalAlloc  : 2468 ms
SmallAlloc   : 2867 ms
HeapAlloc    : 2421 ms
VirtualAlloc : 2878 ms

(De)Allocating 128 objects (8192 bytes, dirty) 1024 times, LFH disabled...
GlobalAlloc  : 3271 ms
SmallAlloc   : 5949 ms
HeapAlloc    : 3238 ms
VirtualAlloc : 2870 ms


As you can see, in the 4kb case VirtualAlloc is then slower than the HeapAlloc function. I think the 8kb case is quicker than the 4kb one because only the first page is dirtied (not tested though), so let's focus on the 4kb case.

When enablind the LFH, it is even worse !!
(De)Allocating 128 objects (4096 bytes, dirty) 1024 times, LFH enabled...
GlobalAlloc  : 128 ms
SmallAlloc   : 2889 ms
HeapAlloc    : 112 ms
VirtualAlloc : 2998 ms


I guess it becomes slower because VirtualAlloc has to zero-fill the memory. :(
Does anyone know how (if it is possible) to forbid the memory to be zero-filled when using VirtualAlloc (I tried different flag combinations but I always got access violations) ??

So how come HeapAlloc is faster than VirtualAlloc once the area is fully commited ?
The only explanation I can see is that:
- the memory allocated by HeapAlloc is not zero-filled (how ? it really doesn't seem to call any allocation function, nor any syscall ! -- I stepped into the kernel :P)
- this leads me to think that the overhead we see when dirtying a page allocated by VirtualAlloc has already taken place in the case of HeapAlloc.

To test the second theory, I added a "New Heap" benchmark that relies on HeapAlloc, but first creates a new heap with HeapCreate. The new heap is created with an initial size of 0, so any overhead can be measured. Also, if the LFH flag is enabled, this new heap is configured to support it.
The HeapCreate call is placed out of the timing loop, but putting it inside the timing loop doesn't really change the results...

And here are the results :
(De)Allocating 128 objects (4096 bytes, dirty) 1024 times, LFH disabled...
GlobalAlloc  : 2364 ms
SmallAlloc   : 2860 ms
HeapAlloc    : 2353 ms
VirtualAlloc : 2771 ms
New Heap     : 2314 ms

(De)Allocating 128 objects (4096 bytes, dirty) 1024 times, LFH enabled...
GlobalAlloc  : 128 ms
SmallAlloc   : 2860 ms
HeapAlloc    : 111 ms
VirtualAlloc : 2771 ms
New Heap     : 2320 ms


In the second case (LFH enabled) it is obvious that the new heap has a great overhead, but it is still faster than VirtualAlloc!

At this point, I'm getting a headache trying to understand what really happens, I'm hungry (it's 1:00 PM here) and I can only say one last thing : sorry for the long post... :toothy

PS: attached is the latest version of the awful thing. <edit>the attachment in the first post has been updated</edit>

Mark Jones

Read about the LFH on MSDN. WinXP/Server 2003 only.

How are we expected to learn this stuff without stumbling onto it? ::)

LFH can handle blocks from 8B-16kB. Another interesting test might be to allocate random-sized blocks within that range.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

RuiLoureiro

Here are my XP results:

; Running from Qeditor
; Assembled in Quick Editor in Project -> Console Assemble & Link

(De)Allocating 128 objects (4096 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc   : 1 283 ms   ; (1)           ; *3 = 3 849
SmallAlloc    : 3 421 ms                   ; *3 =10 263
HeapAlloc     : 1 331 ms   ; (3)           ; *3 = 3 993
VirtualAlloc  : 1 571 ms                   ; *3 = 4 713
New Heap      : 1 314 ms   ; (2)           ; *3 = 3 942


(De)Allocating 428 objects (4096 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  :  5 718 ms  ; (2)
SmallAlloc   : 15 446 ms
HeapAlloc    :  5 719 ms  ; (3)
VirtualAlloc :  6 501 ms
New Heap     :  5 511 ms  ; (1)


(De)Allocating 384 objects (4096 bytes, dirty) 1024 times, LFH disabled...
                  = 3* 128

GlobalAlloc  :  5 107 ms  ; (3)
SmallAlloc   : 13 620 ms
HeapAlloc    :  5 058 ms  ; (2)
VirtualAlloc :  5 788 ms
New Heap     :  5 024 ms  ; (1)


How to enable LFH ?

chep

Mark,
Thanks for posting the MSDN link, it is definitely more relevant than the one I posted. :U


RuiLoureiro,
You can enable the LFH by setting the USE_LOWFRAGHEAP constant to 1 instead of 0 in the source code.
If you have enough free RAM, I think it would be interesting to compare timings for big blocks (OBJECT_SIZE = 64*1024 at least) with & without LFH, to see how LFH performs with such big blocks although it has been designed to handle small blocks.


PS:
- I just realized that my previous post is a bit off topic from the original LFH subject, but I got trapped into the "VirtualAlloc is slower than HeapAlloc" thing. Sorry about that...
- I modified the "dirty" method to be able to dirty every allocated page in case of blocks >4kb. The attachment in the first post has been updated.

RuiLoureiro

chep,
        Here are the results

Using the last Malloc.zip AS BEFORE

(De)Allocating 128 objects (65 536 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  :  3 207 ms    ;(2)          *
SmallAlloc   : 57 135 ms
HeapAlloc    :  3 176 ms    ;(3)    ** better
VirtualAlloc :  1 723 ms    ;(1)               »»
New Heap     :  3 228 ms   


(De)Allocating 128 objects (65 536 bytes, dirty) 1024 times, LFH enabled...

GlobalAlloc  :  3 203 ms    ;(2)          * better
SmallAlloc   : 57 046 ms
HeapAlloc    :  3 209 ms    ;(3)    **
VirtualAlloc :  1 709 ms    ;(1)               »» better   =>   First
New Heap     :  3 237 ms   

;-----------------------------------------------------------------------------
(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  :  1 155 ms    ;(2)          * better
SmallAlloc   : 74 291 ms
HeapAlloc    :  1 133 ms    ;(1)                »» better  >>>> BETTER
VirtualAlloc :  1 966 ms    ;(3)    **
New Heap     :  3 468 ms


(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH enabled...

GlobalAlloc  :  1 161 ms    ;(2)          *
SmallAlloc   : 74 089 ms
HeapAlloc    :  1 134 ms    ;(1)                »»
VirtualAlloc :  1 951 ms    ;(3)    ** better
New Heap     :  3 459 ms

;+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Modified "dirty" method ( with NEW Malloc.zip from first post )

(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 13 180 ms    ;(2)
SmallAlloc   : 74 105 ms
HeapAlloc    : 13 136 ms    ;(1)
VirtualAlloc : 46 689 ms
New Heap     : 45 963 ms


(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH enabled...

GlobalAlloc  : 13 279 ms    ;(2)
SmallAlloc   : 74 010 ms
HeapAlloc    : 13 019 ms    ;(1)
VirtualAlloc : 46 799 ms
New Heap     : 45 963 ms

;+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(De)Allocating 128 objects (524 288 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 186 180 ms    ;(3)
SmallAlloc   : 186 429 ms
HeapAlloc    : 185 985 ms    ;(1)
VirtualAlloc : 186 135 ms    ;(2)
New Heap     : **********  Malloc found a problem and will be quit


(De)Allocating 128 objects (524 288 bytes, dirty) 1024 times, LFH enabled...

GlobalAlloc  : 186 242 ms    ;(4)
SmallAlloc   : 186 157 ms    ;(3)
HeapAlloc    : 185 384 ms    ;(2)
VirtualAlloc : 185 354 ms    ;(1)
New Heap     : **********  Malloc found a problem and will be quit


Good work!


chep

Hi,

RuiLoureiro thanks for testing this.


First I want to make a comment on the original benchmarks : until I put the latest "dirty" method which touches every allocated page, I believe results for big allocations were not fully correct, due to Windows paging mechanism. We can see that in RuiLoureiro's last tests :

Quote from: RuiLoureiro on June 15, 2005, 09:52:38 PM
Using the last Malloc.zip AS BEFORE
(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  :  1 155 ms    ;(2)          * better
SmallAlloc   : 74 291 ms
HeapAlloc    :  1 133 ms    ;(1)                »» better  >>>> BETTER
VirtualAlloc :  1 966 ms    ;(3)    **
New Heap     :  3 468 ms

;+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Modified "dirty" method ( with NEW Malloc.zip from first post )
(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 13 180 ms    ;(2)
SmallAlloc   : 74 105 ms
HeapAlloc    : 13 136 ms    ;(1)
VirtualAlloc : 46 689 ms
New Heap     : 45 963 ms


The only difference between the two sets of tests being the new "dirty" method.
The original one was :

mov     BYTE PTR [eax], 0 ; where eax = newly allocated block

while the new method is :

DIRTY_Offset = 0
WHILE DIRTY_Offset LT OBJECT_SIZE
  mov     BYTE PTR [eax+DIRTY_Offset], 0 ; where eax = newly allocated block
  DIRTY_Offset = DIRTY_Offset + 4096
ENDM


so for this example (128kb * 128 objects * 1024 iterations), with the old method we have the following overhead :
    1 mov operation * 128 objects * 1024 iterations = 128k additional mov operations
while in the latest method we have :
    (128kb/4kb = 32) mov operations * 128 objects * 1024 iterations = 4096k additional operations

While the "dirtying" overhead is multiplied by 32, I honestly believe that it doesn't explain completely the difference between the two methods. I think this difference is mainly due to Windows now being forced to commit the memory.


I was also a little surprised by hutch's finding :
Quote from: hutch-- on June 15, 2005, 05:47:36 AM
I have tested this on win2k sp4 and the results show that HeapAlloc is clearly faster on small repeated allocations where GlobalAlloc is faster on larger allocations.

So I stepped into GlobalAlloc & HeapAlloc...
On XP SP2 (dunno if it applies to W2k), HeapAlloc is directly forwarded to Ntdll's RtlAllocateHeap, while GlobalAlloc is *a wrapper* around RtlAllocateHeap.
So I think that it's better to use directly HeapAlloc --at least on XP-- to avoid GlobalAlloc's overhead.


Also, RuiLoureiro's tests show that, although LFH has been designed to handle small memory blocks (8b to 16kb as Mark pointed out), allocating big memory blocks doesn't seem to degrade HeapAlloc's performance when LFH is enabled.
Thus I believe Low Fragmentation Heaps should be used whenever possible (ie. if running on Windows XP/2k3) because I see only advantages, no drawbacks.


Concerning HeapAlloc being slower when using a newly created heap (w/ HeapCreate) rather than the process' default heap, I have no relevant explanation.
My first thought would be that the overhead observed when using a new heap has already taken place as part of the program loading time, but I seriously doubt about this :
Quote from: RuiLoureiro on June 15, 2005, 09:52:38 PM
Modified "dirty" method ( with NEW Malloc.zip from first post )

(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 13 180 ms    ;(2)
SmallAlloc   : 74 105 ms
HeapAlloc    : 13 136 ms    ;(1)
VirtualAlloc : 46 689 ms
New Heap     : 45 963 ms


(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH enabled...

GlobalAlloc  : 13 279 ms    ;(2)
SmallAlloc   : 74 010 ms
HeapAlloc    : 13 019 ms    ;(1)
VirtualAlloc : 46 799 ms
New Heap     : 45 963 ms

I'd be very surprised the program took about 33 seconds to simply *initialize* before starting executing the code!
Moreover, LFH doesn't seem to influence the performance of custom heaps (created with HeapCreate).
Yet Another Mistery I'm Not Smart Enough To Understand...TM :( (but well, I'm starting being used to that :bg)


Concerning the tests where VirtualAlloc is slower than HeapAlloc, I think that, again, the paging system delaying memory committing is a great part of the problem :

(De)Allocating 1 objects (262144 bytes, dirty) 512 times, LFH disabled...
GlobalAlloc  : 63 ms
SmallAlloc   : 453 ms
HeapAlloc    : 11 ms
VirtualAlloc : 140 ms
New Heap     : 133 ms

(De)Allocating 512 objects (262144 bytes, dirty) 1 times, LFH disabled...
GlobalAlloc  : 123 ms
SmallAlloc   : 578 ms
HeapAlloc    : 121 ms
VirtualAlloc : 128 ms
New Heap     : 129 ms


In the first case, a single object is (de)allocated 512 times. For HeapAlloc, memory committing occurs only during the first iteration, in the 511 subsequent iterations the heap is already committed. But for VirtualAlloc the committing occurs at each iteration!

The second case shows that when we make only ONE iteration, HeapAlloc & VirtualAlloc results are more similar.

And RuiLoureiro's tests show that when the memory blocks get bigger, HeapAlloc's performance really degrade compared to VirtualAlloc's performance :
Quote from: RuiLoureiro on June 15, 2005, 09:52:38 PM
(De)Allocating 128 objects (131 072 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 13 180 ms    ;(2)
SmallAlloc   : 74 105 ms
HeapAlloc    : 13 136 ms    ;(1)
VirtualAlloc : 46 689 ms
New Heap     : 45 963 ms


(De)Allocating 128 objects (524 288 bytes, dirty) 1024 times, LFH disabled...

GlobalAlloc  : 186 180 ms    ;(3)
SmallAlloc   : 186 429 ms
HeapAlloc    : 185 985 ms    ;(1)
VirtualAlloc : 186 135 ms    ;(2)
New Heap     : **********  Malloc found a problem and will be quit

And again, HeapAlloc benefits here from the 1024 iterations (the heap memory is committed at the first iteration, while memory allocated with VirtualAlloc has to be committed each and every iteration).


Too bad my main machine is dead, I really can't test big memory allocations on this 256mb thing I'm currently using.
I really have to buy quickly a new mobo... :'(


Well, I hope those explanations are clear enough, and overall that they are correct!
Of course, more test results are welcome so we could confirm / infirm this. Two things left to do for now : focusing on SmallAlloc performance (which I almost didn't discuss) and, as Mark & hutch suggested, running more benchmarks with "unaligned" size blocks. I may take care of that in a few days if noone does it before me... :toothy

Anyway, sorry for this other long post & thanks for your attention. :wink

RuiLoureiro


Codewarp

Great discussion of memory allocation options and optimization ideas.  But when you are using the multi-threaded libraries under Windows, the heap allocator has to protect its global data structures from corruption from multiple thread contention.  This means that each allocation and release call holds and releases a lock to force competing threads from entering at the wrong time.

But "we don't have any other threads in our application" I hear you say.  It doesn't matter :boohoo:, the lock is still used, and the last time I checked, that lock chews up something like 80% of the time spent in what you think is memory allocation activities, even with no contention at all.  Use the single-threaded library when possible, but if you have to use the multi-thread library, large numbers of small allocations can never be fast, without getting rid the the locking. 

You can have extreme alloc/free speed and multiple threads, by using per-thread heap allocators, i.e. separate allocation pool per thread.  I discovered these limitations for myself, when trying to understand why my memory allocator ran 15 times faster (!) than the malloc/free calls from C.  My allocator was thread-specific, so no locking was needed.  But put the locking in and the brakes go on, and this is with no other threads active.  Add a  lot of threads banging on the allocator at the same time, and you will wonder why everything is so s-l-o-w.

chep

You're right!

So single-threaded applications should use the HEAP_NO_SERIALIZE flag when calling HeapAlloc.
And multithreaded applications should create a separate heap for each thread, together with the HEAP_NO_SERIALIZE flag.

MSDN doesn't mention a similar flag for GlobalAlloc or LocalAlloc, so that's another good reason to prefer HeapAlloc.

Obviously, VirtualAlloc can't have such a flag as this is really a system-wide function.

Good point Codewarp :U

hutch--

As a consequence of the memory allocation and deallocation speed testing, I have added these macros to the macro file for MASM32.


comment * ---------------------------------------------------------
        Heap allocation and deallocation macros. On later versions
        of Windows HeapAlloc() appears to be faster on small
        allocations than GlobalAlloc() using the GMEM_FIXED flag.
        --------------------------------------------------------- *

      halloc MACRO bytecount
        EXITM <rv(HeapAlloc,rv(GetProcessHeap),0,bytecount)>
      ENDM

      hsize MACRO hmem
        invoke HeapSize,rv(GetProcessHeap),0,hmem
        EXITM <eax>
      ENDM

      hfree MACRO memory
        invoke HeapFree,rv(GetProcessHeap),0,memory
      ENDM


Thanks again for designing the tests, they were very useful.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

chep

Maybe we should add an optional argument, to be able to specify custom flags :


comment * ---------------------------------------------------------
        Heap allocation and deallocation macros. On later versions
        of Windows HeapAlloc() appears to be faster on small
        allocations than GlobalAlloc() using the GMEM_FIXED flag.
        --------------------------------------------------------- *

      halloc MACRO bytecount:REQ, flags
        LOCAL hflags
        IFNB <flags>
          hflags = flags
        ELSE
          hflags = 0
        ENDIF
        EXITM <rv(HeapAlloc,rv(GetProcessHeap),hflags,bytecount)>
      ENDM

      hsize MACRO hmem:REQ, flags
        LOCAL hflags
        IFNB <flags>
          hflags = flags
        ELSE
          hflags = 0
        ENDIF
        invoke HeapSize,rv(GetProcessHeap),hflags,hmem
        EXITM <eax>
      ENDM

      hfree MACRO memory:REQ, flags
        LOCAL hflags
        IFNB <flags>
          hflags = flags
        ELSE
          hflags = 0
        ENDIF
        invoke HeapFree,rv(GetProcessHeap),hflags,memory
      ENDM


Thus we can use
    mov  hMem, halloc(block_size)
as well as
    mov  hMem, halloc(block_size,HEAP_ZERO_MEMORY OR HEAP_NO_SERIALIZE)
and
    hfree hMem, HEAP_NO_SERIALIZE

Quote from: hutch-- on June 24, 2005, 03:26:23 AM
Thanks again for designing the tests, they were very useful.
You're welcome :wink

Codewarp

chep,

My favorite memory allocator benchmark test is as follows:

  (1)  Allocate a ptr array of some size, say 10000 entries
  (2)  Fill the array with ptrs to random sized char arrays from malloc( )  or whatever
  (3)  For N iterations: select an entry at random, release its memory block, then alloc a new one of a different random size.

Let this thing run a while, and it will tell you better the true speed of the allocator in question under more realistic conditions.
Better yet, have multiple threads do this to the same heap...