News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Re: huge table access

Started by TNick, December 13, 2009, 12:04:16 AM

Previous topic - Next topic

TNick

Well, search no more. Here is the source code:

.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE

INCLUDE windows.inc

INCLUDE kernel32.inc
INCLUDELIB kernel32.lib

INCLUDE masm32.inc
INCLUDELIB masm32.lib

INCLUDE msvcrt.inc
INCLUDELIB msvcrt.lib

INCLUDE c:\masm32\macros\macros.asm
INCLUDE c:\masm32\macros\timers.asm



TEST_LIMIT1 =  100000h
TEST_LIMIT2 = 10000000h

INCREASE_STEP_1 =    4000h
INCREASE_STEP_2 =   80000h

LOOP_COUNT = 1
.DATA
hHeapApp DWORD 0
pMem DWORD 0
MemoryCounter DWORD 1000h
Rez1 DWORD 0
Rez2 DWORD 0
.CODE

DoATest PROC BytesCount:DWORD

print hex$(BytesCount), 9,"|",9

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
INVOKE HeapAlloc, hHeapApp,0,BytesCount
mov pMem, eax
counter_end
mov Rez1, eax
print str$(eax),9
.IF pMem!=0
INVOKE HeapFree, hHeapApp,0,pMem
write <"OK",9,"|",9>
mov pMem, 0
.ELSE
write <"FAILED",9,"|",9>
.ENDIF


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
INVOKE VirtualAlloc, NULL,BytesCount,MEM_COMMIT or MEM_RESERVE,PAGE_READWRITE
mov pMem, eax
counter_end
mov Rez2, eax
print str$(eax)
.IF pMem!=0
INVOKE VirtualFree, pMem,0,MEM_RELEASE
write <9,"OK",9,"|",9>
mov pMem, 0
.ELSE
write <9,"FAILED",9,"|",9>
.ENDIF
mov eax, Rez1
sub eax, Rez2
print str$(eax),9,"|",13,10

ret

DoATest ENDP



App_Entry_Point:

mov esi, INCREASE_STEP_1
mov edi, TEST_LIMIT1
INVOKE GetProcessHeap
mov hHeapApp, eax
write 13,10,"Indepth test...",13,10
write "=================================================================================",13,10
write "BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
write "=================================================================================",13,10

LoopTest_Heap:
INVOKE DoATest, MemoryCounter
add MemoryCounter, esi

cmp MemoryCounter, edi
jbe LoopTest_Heap
cmp edi, TEST_LIMIT2
je @F
mov esi, INCREASE_STEP_2
mov edi, TEST_LIMIT2
jmp LoopTest_Heap
@@:
write "=================================================================================",13,10
write "=================================================================================",13,10,13,10


write "Here is a small sample that you can post back...",13,10
write "=================================================================================",13,10
write "BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
write "=================================================================================",13,10
INVOKE DoATest,    1000h
INVOKE DoATest,   10000h
INVOKE DoATest,  100000h
INVOKE DoATest, 1000000h
INVOKE DoATest, 10000000h

write "=================================================================================",13,10
write "=================================================================================",13,10,13,10

inkey "Print any key to exit ..."




INVOKE ExitProcess, NULL
END App_Entry_Point


Here are my timings:
Quote
=================================================================================
BYTES           |       HA      STS     |       VA      STS     |       HA-VA   |
=================================================================================
00001000        |       42601   OK      |       15185   OK      |       27416   |
00010000        |       39728   OK      |       12582   OK      |       27146   |
00100000        |       47115   OK      |       6856    OK      |       40259   |
01000000        |       47301   OK      |       5986    OK      |       41315   |
10000000        |       47303   OK      |       10719   OK      |       36584   |
=================================================================================

And here is the exe.

redskull

Quote from: hutch-- on December 13, 2009, 12:00:56 AM
Apart from people quoting out of date Microsoft data without keeping up on what the functions actually do...

:bg

While I can't speak for all MSDN data, this one is current.  I just traced through it, and GlobalAlloc (with a GM_FIXED flag) indeed calls the same function as HeapAlloc, on a Vista system.  The OP had it right all along: GlobalAlloc calls HeapAlloc, which (when needed) calls VirtualAlloc.  Granted I haven't tested for speed myself, but to say doing three functions is faster than one seems bananas.  Who's out of date now?   :toothy 

HeapAlloc() jumps right to here:
ntdll!RtlAllocateHeap:
76ed6570 8bff            mov     edi,edi
76ed6572 55              push    ebp
blah blah


while GlobalAlloc eventually calls the same (line 76ae7d8a)


kernel32!GlobalAlloc:
76ae7d34 6a18            push    18h
76ae7d36 68a87dae76      push    offset kernel32!GlobalAlloc+0x74 (76ae7da8)
76ae7d3b e8a01b0000      call    kernel32!WaitForSingleObjectEx+0xe0 (76ae98e0)
76ae7d40 8b4508          mov     eax,dword ptr [ebp+8]
76ae7d43 a98d80ffff      test    eax,0FFFF808Dh
76ae7d48 0f85f8650100    jne     kernel32!PulseEvent+0x6845 (76afe346)
<blah blah>
76ae7d84 ff3550e2b676    push    dword ptr [kernel32!WakeConditionVariable+0x4274 (76b6e250)]
76ae7d8a ff152010aa76    call    dword ptr [kernel32+0x1020 (76aa1020)] ds:0023:76aa1020={ntdll!RtlAllocateHeap (76ed6570)}
Strange women, lying in ponds, distributing swords, is no basis for a system of government

hutch--

r,

The clock says it all, one function call to retrieve the pointer and one function call to deallocate it.


mov hMem(1024*1024)
; just do_it
free hMem


Below all of the DLL layering memory is memory, pick your packaging to suit your purpose. GlobalAlloc() is simpler to use and with the depth of layering, I seriously doubt you will find a viable speed difference. The GlobalAlloc() of Win3 no longer exists and the warnings of it being deprecated are simply nonsense.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

vanjast

Quote from: z941998 on December 10, 2009, 05:25:43 AM
I use this technique for Power Ball, I have all combination in a table with stats,  a 4 level search takes less than a second to get the stats I need for a specific combination.  Using sequential search access method and something like BASIC, this might take a day or two to complete.
Going off the topic a bit.. I wrote a whole bit of Lotto code in VB6 (yes I know- but the idea was to put it into masm after I'd finished the structures - No time yet), but anyway with all the formulae and stats I threw into it, it could produce any stat within a second, without having a database, except a result table which takes less than 16KB of mem.

It was just the simulation/prediction side of the code which took a long time (I didn't optimise it) - about 3 hours  :red I'm sure with masm I could get this down to a few minutes or so.. must do it, must do it... Xmas is coming  :P

jj2007

Quote from: TNick on December 13, 2009, 12:04:16 AM
Well, search no more. Here is the source code:
... Here are my timings:


BYTES           |       HA      STS     |       VA      STS     |       HA-VA   |
=================================================================================
00001000        |       42601   OK      |       15185   OK      |       27416   |
00010000        |       39728   OK      |       12582   OK      |       27146   |
00100000        |       47115   OK      |       6856    OK      |       40259   |
01000000        |       47301   OK      |       5986    OK      |       41315   |
10000000        |       47303   OK      |       10719   OK      |       36584   |



Sure?
LOOP_COUNT = 1

jj2007

My timings with LOOP_COUNT=10000, in cycles per kByte. Calling the behaviour erratic would be an understatement :wink

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      cycles for HeapAlloc            00001000 bytes
1484    cycles for VirtualAlloc         00001000 bytes
151     cycles for GlobalAlloc          00001000 bytes

48      cycles for HeapAlloc            00002000 bytes
1041    cycles for VirtualAlloc         00002000 bytes
76      cycles for GlobalAlloc          00002000 bytes

24      cycles for HeapAlloc            00004000 bytes
818     cycles for VirtualAlloc         00004000 bytes
38      cycles for GlobalAlloc          00004000 bytes

12      cycles for HeapAlloc            00008000 bytes
1050    cycles for VirtualAlloc         00008000 bytes
19      cycles for GlobalAlloc          00008000 bytes

47      cycles for HeapAlloc            00010000 bytes
1412    cycles for VirtualAlloc         00010000 bytes
16      cycles for GlobalAlloc          00010000 bytes

8       cycles for HeapAlloc            00020000 bytes
1371    cycles for VirtualAlloc         00020000 bytes
10      cycles for GlobalAlloc          00020000 bytes

5       cycles for HeapAlloc            00040000 bytes
1354    cycles for VirtualAlloc         00040000 bytes
6       cycles for GlobalAlloc          00040000 bytes

1347    cycles for HeapAlloc            00080000 bytes
1344    cycles for VirtualAlloc         00080000 bytes
1348    cycles for GlobalAlloc          00080000 bytes

UseMem=1, cycles per kByte=1


NumBytes= 4096
LOOP_COUNT= 10000
UseMemOn= 1
ShowPerK= 1

UseTheMem MACRO
if UseMemOn
  mov edx, eax
  mov ecx, 0
  .Repeat
mov byte ptr [edx+ecx+123], 0 ; poke a byte in each page
add ecx, 4096
  .Until ecx>=NumBytes
endif
ENDM

TestH MACRO
  invoke HeapAlloc, ProHeap, HEAP_GENERATE_EXCEPTIONS or HEAP_NO_SERIALIZE, NumBytes
  .if eax
UseTheMem
invoke HeapFree, ProHeap, HEAP_NO_SERIALIZE, eax
  .else
invoke MessageBox, 0, chr$("HeapAlloc failed"), 0, MB_OK
exit
  .endif
ENDM


sinsi

You simply cannot time memory allocations accurately. It all depends on the amount of ram you have, how much is free, the size
of the pagefile, how much of that is free, what services are running, what your antivirus is doing, the phase of the moon...you get the idea?

Personally, to allocate a chunk (e.g. in a LoadFile proc) I use GlobalAlloc with GMEM_FIXED - just do_it indeed.
To allocate/deallocate over and over, I use the heap functions.

Just write your code in 64-bit, then you can allocate huge amounts at once.  :bdg
Light travels faster than sound, that's why some people seem bright until you hear them.

TNick

Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
LOOP_COUNT = 1

Yes. No loop between counter_begin and counter_end. Just do it once for heap, get the timing, free what we've just got, do it once for Virtual, get the timing, free.
To have a loop, Free function should be inserted in test, which I did avoid on purpose.

In your test, after first Alloc-Free pair in heap, the Heap has the memory right there for it's use. It only needs to search it's free chunks list and will find a chunk in at least 9999 cases. The test only proves that, if you need to allocate and free again and again same ammount of memory, it's best to use the heap.

EDIT:
@sinsi:
No doubt about it. Still, by using the test program, you may see a trend, I think. I do not argue that is a big gain, I simply say that:
IF you need a big chunk of memory (above a page or two), VirtualAlloc may be faster than Heap or Global alloc.

Nick

jj2007

Quote from: TNick on December 13, 2009, 11:09:55 AM
Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
LOOP_COUNT = 1

Yes. No loop between counter_begin and counter_end.

Please have a look at timers.asm

@Sinsi: I wouldn't be so pessimistic regarding the accuracy of timings. Of couse, if you need the swapfile, things change, but with today's RAM that would require really large requests. But for a fat 128MB request, for example, the timings are almost identical. The only lesson you can draw from the timings for the smaller amounts is "use HeapAlloc".

180559386       cycles for HeapAlloc            08000000 bytes
181753646       cycles for VirtualAlloc         08000000 bytes
180141043       cycles for GlobalAlloc          08000000 bytes


UseMem=1, cycles per kByte=0

sinsi


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
168     cycles for HeapAlloc            00001000 bytes
1966    cycles for VirtualAlloc         00001000 bytes
156     cycles for GlobalAlloc          00001000 bytes

78      cycles for HeapAlloc            00002000 bytes
1587    cycles for VirtualAlloc         00002000 bytes
73      cycles for GlobalAlloc          00002000 bytes

36      cycles for HeapAlloc            00004000 bytes
1278    cycles for VirtualAlloc         00004000 bytes
56      cycles for GlobalAlloc          00004000 bytes

21      cycles for HeapAlloc            00008000 bytes
1061    cycles for VirtualAlloc         00008000 bytes
31      cycles for GlobalAlloc          00008000 bytes

12      cycles for HeapAlloc            00010000 bytes
1021    cycles for VirtualAlloc         00010000 bytes
16      cycles for GlobalAlloc          00010000 bytes

7       cycles for HeapAlloc            00020000 bytes
852     cycles for VirtualAlloc         00020000 bytes
10      cycles for GlobalAlloc          00020000 bytes

4       cycles for HeapAlloc            00040000 bytes
803     cycles for VirtualAlloc         00040000 bytes
6       cycles for GlobalAlloc          00040000 bytes

776     cycles for HeapAlloc            00080000 bytes
798     cycles for VirtualAlloc         00080000 bytes
801     cycles for GlobalAlloc          00080000 bytes

UseMem=1, cycles per kByte=1

Where did you get your timings for '08000000 bytes'?
Light travels faster than sound, that's why some people seem bright until you hear them.

TNick

Quote from: jj2007 on December 13, 2009, 11:15:09 AM
Please have a look at timers.asm

I did. What now? :)
As I said, the code between macro calls will be executed once. That is because you get strange results like yours when you repeat the request for memory for same amount. The user will - probably - not repeat it's request over and over again. Not the timings are important here,I think, but the difference, which almost always say Virtual is faster.

That being said, I would say we have allocated enough time from our short existence for this matter.

I wish everyone a good Sunday! :)
Nick

dedndave

QuoteThat being said, I would say we have allocated enough time from our short existence for this matter.

lol - you don't know us very well, huh

i see no harm in using an allocate/free cycle for timing measurements
to be fair, if one method is faster at allocation but slower at freeing it, where is the advantage ?
if you pair them up, you can run them in a loop

the way i normally use large pieces of allocated memory, i am not likely to be that concerned about speed, anyways
i can see some instances where you may want to allocate and free blocks
but, i think it is better if you grab what you intend to use, and manage it from there with code
when you are done with it, then free it
that seems much more efficient that allocating several small blocks, no matter how you slice it (  ::) )

sinsi

The trouble with an alloc/touch/free test regime is things that happen over and over tend to get cached and reused.
Allocate 4k gets a page table entry, but de-allocate it and windows might just leave it for a couple of seconds, just in case.
Then windows is justified because we allocate it again in a few milliseconds.

>i think it is better if you grab what you intend to use, and manage it from there with code
My point exactly. Hell, grab the whole 2 gig and parcel it up how you want.
Use the old /3gb and /largeaddressaware and you get 3 gig.
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

well - it is good to use what you need, when you need it, then release it for other processes to use
otherwise, you write memory-hog code, like yahoo messenger - lol
but, there is no need to make the OS micro-manage the blocks for you
i think that is a result of lazy programming   :bg

hutch--

I moved this topic to the Lab where timings and comments are more appropriate.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php