Re: huge table access

TNick · December 13, 2009, 12:04:16 AM

Well, search no more. Here is the source code:


.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE

INCLUDE		windows.inc

INCLUDE		kernel32.inc
INCLUDELIB	kernel32.lib

INCLUDE		masm32.inc
INCLUDELIB	masm32.lib

INCLUDE		msvcrt.inc
INCLUDELIB	msvcrt.lib

INCLUDE		c:\masm32\macros\macros.asm
INCLUDE		c:\masm32\macros\timers.asm



TEST_LIMIT1	=	  100000h
TEST_LIMIT2	=	10000000h

INCREASE_STEP_1	=	    4000h
INCREASE_STEP_2	=	   80000h

LOOP_COUNT	=	1
.DATA
	hHeapApp		DWORD			0
	pMem			DWORD			0
	MemoryCounter		DWORD			1000h
	Rez1			DWORD			0
	Rez2			DWORD			0
.CODE

DoATest	PROC	BytesCount:DWORD
	
	print hex$(BytesCount), 9,"|",9
	
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		INVOKE	HeapAlloc,	hHeapApp,0,BytesCount
		mov	pMem,		eax
	counter_end
	mov	Rez1,		eax
	print str$(eax),9
	.IF	pMem!=0
		INVOKE	HeapFree,	hHeapApp,0,pMem
		write	<"OK",9,"|",9>
		mov	pMem,		0
	.ELSE
		write	<"FAILED",9,"|",9>
	.ENDIF
	
	
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		INVOKE	VirtualAlloc,	NULL,BytesCount,MEM_COMMIT or MEM_RESERVE,PAGE_READWRITE
		mov	pMem,		eax
	counter_end
	mov	Rez2,		eax
	print str$(eax)
	.IF	pMem!=0
		INVOKE	VirtualFree,	pMem,0,MEM_RELEASE
		write	<9,"OK",9,"|",9>
		mov	pMem,		0
	.ELSE
		write	<9,"FAILED",9,"|",9>
	.ENDIF
	mov	eax,		Rez1
	sub	eax,		Rez2
	print	str$(eax),9,"|",13,10
	
	ret

DoATest		ENDP



App_Entry_Point:

	mov	esi,			INCREASE_STEP_1
	mov	edi,			TEST_LIMIT1
	INVOKE	GetProcessHeap
	mov	hHeapApp,		eax
	write	13,10,"Indepth test...",13,10
	write	"=================================================================================",13,10
	write	"BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
	write	"=================================================================================",13,10
	
	LoopTest_Heap:
		INVOKE	DoATest,		MemoryCounter
		add	MemoryCounter,		esi

	cmp	MemoryCounter,		edi
	jbe	LoopTest_Heap
	cmp	edi,		TEST_LIMIT2
	je	@F
		mov	esi,		INCREASE_STEP_2
		mov	edi,		TEST_LIMIT2
		jmp	LoopTest_Heap
	@@:
	write	"=================================================================================",13,10
	write	"=================================================================================",13,10,13,10
	
	
	write	"Here is a small sample that you can post back...",13,10
	write	"=================================================================================",13,10
	write	"BYTES",9, 9,"|",9,"HA",9,"STS",9,"|",9,"VA",9,"STS",9,"|",9,"HA-VA",9,"|",13,10
	write	"=================================================================================",13,10
		INVOKE	DoATest,		    1000h
		INVOKE	DoATest,		   10000h
		INVOKE	DoATest,		  100000h
		INVOKE	DoATest,		 1000000h
		INVOKE	DoATest,		10000000h
	
	write	"=================================================================================",13,10
	write	"=================================================================================",13,10,13,10
	
	inkey	"Print any key to exit ..."
	
	
	
	
	INVOKE	ExitProcess,	NULL
END App_Entry_Point

Here are my timings:

Quote
=================================================================================
BYTES | HA STS | VA STS | HA-VA |
=================================================================================
00001000 | 42601 OK | 15185 OK | 27416 |
00010000 | 39728 OK | 12582 OK | 27146 |
00100000 | 47115 OK | 6856 OK | 40259 |
01000000 | 47301 OK | 5986 OK | 41315 |
10000000 | 47303 OK | 10719 OK | 36584 |
=================================================================================

And here is the exe.

redskull · December 13, 2009, 12:49:28 AM

Quote from: hutch-- on December 13, 2009, 12:00:56 AM
Apart from people quoting out of date Microsoft data without keeping up on what the functions actually do...

:bg

While I can't speak for all MSDN data, this one is current. I just traced through it, and GlobalAlloc (with a GM_FIXED flag) indeed calls the same function as HeapAlloc, on a Vista system. The OP had it right all along: GlobalAlloc calls HeapAlloc, which (when needed) calls VirtualAlloc. Granted I haven't tested for speed myself, but to say doing three functions is faster than one seems bananas. Who's out of date now? :toothy

HeapAlloc() jumps right to here:

Code Select

ntdll!RtlAllocateHeap:
76ed6570 8bff            mov     edi,edi
76ed6572 55              push    ebp
blah blah

while GlobalAlloc eventually calls the same (line 76ae7d8a)

Code Select


kernel32!GlobalAlloc:
76ae7d34 6a18            push    18h
76ae7d36 68a87dae76      push    offset kernel32!GlobalAlloc+0x74 (76ae7da8)
76ae7d3b e8a01b0000      call    kernel32!WaitForSingleObjectEx+0xe0 (76ae98e0)
76ae7d40 8b4508          mov     eax,dword ptr [ebp+8]
76ae7d43 a98d80ffff      test    eax,0FFFF808Dh
76ae7d48 0f85f8650100    jne     kernel32!PulseEvent+0x6845 (76afe346)
<blah blah>
76ae7d84 ff3550e2b676    push    dword ptr [kernel32!WakeConditionVariable+0x4274 (76b6e250)]
76ae7d8a ff152010aa76    call    dword ptr [kernel32+0x1020 (76aa1020)] ds:0023:76aa1020={ntdll!RtlAllocateHeap (76ed6570)}

hutch-- · December 13, 2009, 06:29:32 AM

r,

The clock says it all, one function call to retrieve the pointer and one function call to deallocate it.

Code Select


mov hMem(1024*1024)
; just do_it
free hMem

Below all of the DLL layering memory is memory, pick your packaging to suit your purpose. GlobalAlloc() is simpler to use and with the depth of layering, I seriously doubt you will find a viable speed difference. The GlobalAlloc() of Win3 no longer exists and the warnings of it being deprecated are simply nonsense.

vanjast · December 13, 2009, 06:43:27 AM

Quote from: z941998 on December 10, 2009, 05:25:43 AM
I use this technique for Power Ball, I have all combination in a table with stats, a 4 level search takes less than a second to get the stats I need for a specific combination. Using sequential search access method and something like BASIC, this might take a day or two to complete.

Going off the topic a bit.. I wrote a whole bit of Lotto code in VB6 (yes I know- but the idea was to put it into masm after I'd finished the structures - No time yet), but anyway with all the formulae and stats I threw into it, it could produce any stat within a second, without having a database, except a result table which takes less than 16KB of mem.

It was just the simulation/prediction side of the code which took a long time (I didn't optimise it) - about 3 hours :red I'm sure with masm I could get this down to a few minutes or so.. must do it, must do it... Xmas is coming :P

jj2007 · December 13, 2009, 09:20:45 AM

Quote from: TNick on December 13, 2009, 12:04:16 AM
Well, search no more. Here is the source code:
... Here are my timings:
Code Select Expand
BYTES | HA STS | VA STS | HA-VA | ================================================================================= 00001000 | 42601 OK | 15185 OK | 27416 | 00010000 | 39728 OK | 12582 OK | 27146 | 00100000 | 47115 OK | 6856 OK | 40259 | 01000000 | 47301 OK | 5986 OK | 41315 | 10000000 | 47303 OK | 10719 OK | 36584 |

Sure?

Code Select

LOOP_COUNT = 1

jj2007 · December 13, 2009, 10:19:04 AM

My timings with LOOP_COUNT=10000, in cycles per kByte. Calling the behaviour erratic would be an understatement :wink

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      cycles for HeapAlloc            00001000 bytes
1484    cycles for VirtualAlloc         00001000 bytes
151     cycles for GlobalAlloc          00001000 bytes

48      cycles for HeapAlloc            00002000 bytes
1041    cycles for VirtualAlloc         00002000 bytes
76      cycles for GlobalAlloc          00002000 bytes

24      cycles for HeapAlloc            00004000 bytes
818     cycles for VirtualAlloc         00004000 bytes
38      cycles for GlobalAlloc          00004000 bytes

12      cycles for HeapAlloc            00008000 bytes
1050    cycles for VirtualAlloc         00008000 bytes
19      cycles for GlobalAlloc          00008000 bytes

47      cycles for HeapAlloc            00010000 bytes
1412    cycles for VirtualAlloc         00010000 bytes
16      cycles for GlobalAlloc          00010000 bytes

8       cycles for HeapAlloc            00020000 bytes
1371    cycles for VirtualAlloc         00020000 bytes
10      cycles for GlobalAlloc          00020000 bytes

5       cycles for HeapAlloc            00040000 bytes
1354    cycles for VirtualAlloc         00040000 bytes
6       cycles for GlobalAlloc          00040000 bytes

1347    cycles for HeapAlloc            00080000 bytes
1344    cycles for VirtualAlloc         00080000 bytes
1348    cycles for GlobalAlloc          00080000 bytes

UseMem=1, cycles per kByte=1

Code Select

NumBytes=	4096
LOOP_COUNT=	10000
UseMemOn=	1
ShowPerK=	1

UseTheMem MACRO
if UseMemOn
  mov edx, eax
  mov ecx, 0
  .Repeat
	mov byte ptr [edx+ecx+123], 0		; poke a byte in each page
	add ecx, 4096
  .Until ecx>=NumBytes
endif
ENDM

TestH	MACRO
  invoke HeapAlloc, ProHeap, HEAP_GENERATE_EXCEPTIONS or HEAP_NO_SERIALIZE, NumBytes
  .if eax
	UseTheMem
	invoke HeapFree, ProHeap, HEAP_NO_SERIALIZE, eax
  .else
	invoke MessageBox, 0, chr$("HeapAlloc failed"), 0, MB_OK
	exit
  .endif
ENDM

sinsi · December 13, 2009, 11:01:11 AM

You simply cannot time memory allocations accurately. It all depends on the amount of ram you have, how much is free, the size
of the pagefile, how much of that is free, what services are running, what your antivirus is doing, the phase of the moon...you get the idea?

Personally, to allocate a chunk (e.g. in a LoadFile proc) I use GlobalAlloc with GMEM_FIXED - just do_it indeed.
To allocate/deallocate over and over, I use the heap functions.

Just write your code in 64-bit, then you can allocate huge amounts at once. :bdg

TNick · December 13, 2009, 11:09:55 AM

Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
Code Select Expand
LOOP_COUNT = 1

Yes. No loop between counter_begin and counter_end. Just do it once for heap, get the timing, free what we've just got, do it once for Virtual, get the timing, free.
To have a loop, Free function should be inserted in test, which I did avoid on purpose.

In your test, after first Alloc-Free pair in heap, the Heap has the memory right there for it's use. It only needs to search it's free chunks list and will find a chunk in at least 9999 cases. The test only proves that, if you need to allocate and free again and again same ammount of memory, it's best to use the heap.

EDIT:
@sinsi:
No doubt about it. Still, by using the test program, you may see a trend, I think. I do not argue that is a big gain, I simply say that:
IF you need a big chunk of memory (above a page or two), VirtualAlloc may be faster than Heap or Global alloc.

Nick

jj2007 · December 13, 2009, 11:15:09 AM

Quote from: TNick on December 13, 2009, 11:09:55 AM
Quote from: jj2007 on December 13, 2009, 09:20:45 AM
Sure?
Code Select Expand
LOOP_COUNT = 1

Yes. No loop between counter_begin and counter_end.

Please have a look at timers.asm

@Sinsi: I wouldn't be so pessimistic regarding the accuracy of timings. Of couse, if you need the swapfile, things change, but with today's RAM that would require really large requests. But for a fat 128MB request, for example, the timings are almost identical. The only lesson you can draw from the timings for the smaller amounts is "use HeapAlloc".

Code Select

180559386       cycles for HeapAlloc            08000000 bytes
181753646       cycles for VirtualAlloc         08000000 bytes
180141043       cycles for GlobalAlloc          08000000 bytes

UseMem=1, cycles per kByte=0

sinsi · December 13, 2009, 11:29:48 AM

Code Select


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
168     cycles for HeapAlloc            00001000 bytes
1966    cycles for VirtualAlloc         00001000 bytes
156     cycles for GlobalAlloc          00001000 bytes

78      cycles for HeapAlloc            00002000 bytes
1587    cycles for VirtualAlloc         00002000 bytes
73      cycles for GlobalAlloc          00002000 bytes

36      cycles for HeapAlloc            00004000 bytes
1278    cycles for VirtualAlloc         00004000 bytes
56      cycles for GlobalAlloc          00004000 bytes

21      cycles for HeapAlloc            00008000 bytes
1061    cycles for VirtualAlloc         00008000 bytes
31      cycles for GlobalAlloc          00008000 bytes

12      cycles for HeapAlloc            00010000 bytes
1021    cycles for VirtualAlloc         00010000 bytes
16      cycles for GlobalAlloc          00010000 bytes

7       cycles for HeapAlloc            00020000 bytes
852     cycles for VirtualAlloc         00020000 bytes
10      cycles for GlobalAlloc          00020000 bytes

4       cycles for HeapAlloc            00040000 bytes
803     cycles for VirtualAlloc         00040000 bytes
6       cycles for GlobalAlloc          00040000 bytes

776     cycles for HeapAlloc            00080000 bytes
798     cycles for VirtualAlloc         00080000 bytes
801     cycles for GlobalAlloc          00080000 bytes

UseMem=1, cycles per kByte=1

Where did you get your timings for '08000000 bytes'?

TNick · December 13, 2009, 11:37:08 AM

Quote from: jj2007 on December 13, 2009, 11:15:09 AM
Please have a look at timers.asm

I did. What now? :)
As I said, the code between macro calls will be executed once. That is because you get strange results like yours when you repeat the request for memory for same amount. The user will - probably - not repeat it's request over and over again. Not the timings are important here,I think, but the difference, which almost always say Virtual is faster.

That being said, I would say we have allocated enough time from our short existence for this matter.

I wish everyone a good Sunday! :)
Nick

dedndave · December 13, 2009, 11:49:31 AM

QuoteThat being said, I would say we have allocated enough time from our short existence for this matter.

lol - you don't know us very well, huh

i see no harm in using an allocate/free cycle for timing measurements
to be fair, if one method is faster at allocation but slower at freeing it, where is the advantage ?
if you pair them up, you can run them in a loop

the way i normally use large pieces of allocated memory, i am not likely to be that concerned about speed, anyways
i can see some instances where you may want to allocate and free blocks
but, i think it is better if you grab what you intend to use, and manage it from there with code
when you are done with it, then free it
that seems much more efficient that allocating several small blocks, no matter how you slice it ( ::) )

sinsi · December 13, 2009, 11:58:26 AM

The trouble with an alloc/touch/free test regime is things that happen over and over tend to get cached and reused.
Allocate 4k gets a page table entry, but de-allocate it and windows might just leave it for a couple of seconds, just in case.
Then windows is justified because we allocate it again in a few milliseconds.

>i think it is better if you grab what you intend to use, and manage it from there with code
My point exactly. Hell, grab the whole 2 gig and parcel it up how you want.
Use the old /3gb and /largeaddressaware and you get 3 gig.

dedndave · December 13, 2009, 12:15:40 PM

well - it is good to use what you need, when you need it, then release it for other processes to use
otherwise, you write memory-hog code, like yahoo messenger - lol
but, there is no need to make the OS micro-manage the blocks for you
i think that is a result of lazy programming :bg

hutch-- · December 13, 2009, 12:27:20 PM

I moved this topic to the Lab where timings and comments are more appropriate.

News:

Re: huge table access