This is the next version that stores the array count in the first array member. I get these results on my old PIV.
Benchmarking array methods on 5 million members
1360 ms array create
843 ms array load data
16 ms array read
2156 ms array delete
Press any key to continue ...
I just found an error, the macro name for the array was still that of the dev code.
Here is its replacement, it does not seem to be any slower with the additional register preservation code in it.
arrget$ MACRO arr,indx
push esi
push edi
mov esi, arr
mov edi, indx
mov esi, [esi+edi*4]
mov eax, esi
pop edi
pop esi
EXITM <eax>
ENDM
I have updated the attachment.
[attachment deleted by admin]
Celeron 2.4 GHz, 448 M Ram, Win XP SP1
Benchmarking array methods on 5 million members
2032 ms array create
828 ms array load data
16 ms array read
1906 ms array delete
AMD 64 Athlon Mobile
512MB RAM
Win XP Home (SP2) (32-bit)
Benchmarking array methods on 5 million members
1532 ms array create
953 ms array load data
15 ms array read
1563 ms array delete
Ossa
P3-500, 512 MB:
Benchmarking array methods on 5 million members
5418 ms array create
4617 ms array load data
170 ms array read
7100 ms array delete
I did a quick cycle count for arrget$ and got 550 cycles for 100 calls, so that's 9 instructions executing in less than 6 cycles.
Sempron 3000+, 1GB
Benchmarking array methods on 5 million members
1265 ms array create
875 ms array load data
16 ms array read
1703 ms array delete
On my ancient Intel P111 846MHz 512MB XP Service Pack 2
.......................................... 1st run
Benchmarking array methods on 5
3896 ms array create
3034 ms array load data
130 ms array read
4837 ms array delete
Press any key to continue ...
............................................. 3rd run
Benchmarking array methods on
3715 ms array create
2985 ms array load data
130 ms array read
4887 ms array delete
Press any key to continue ...
here is running at 50% capacity
Benchmarking array methods on 5 million members
1453 ms array create
907 ms array load data
15 ms array read
1594 ms array delete
Press any key to continue ...
it would be interesting to compare the runtime allocatiion/realese windows black box does when you exchange this for a static array instead?
unfortunatly windows seem limited to be able to run only one instance of console app, instead i hoped to testrun two instances simultanously to compare what happens if two cores does this simulatanously
instead I end up with two console programs running right after each other :(
so Hutch need to rewrite it for that kinda test on my duo 2 core
Magnus,
A static array only has one memory allocation, you chop it up into pointers yourself and it is far faster than variable length string arrays but it cannot handle variable length strings, thats the cost of extra speed.
the test piece is a single thread example, the code can routinely run in as many threads and cores as you want to write.
on core2 duo 2ghz,
Benchmarking array methods on 5 million members
780 ms array create
780 ms array load data
15 ms array read
671 ms array delete
Press any key to continue ...
core2duo 2.66Ghz 1333mhz/4M winxp pro-sp2
------------------------------------------------------------------
Benchmarking array methods on 5 million members
937 ms array create
547 ms array load data
0 ms array read
938 ms array delete
Press any key to continue ...
Benchmarking array methods on 5 million members
875 ms array create
531 ms array load data
0 ms array read
922 ms array delete
Press any key to continue ...
c2d 1.8@3Ghz,win xp sp2,2GB
Benchmarking array methods on 5 million members
797 ms array create
531 ms array load data
16 ms array read
875 ms array delete
Press any key to continue ...
C2D E6850 @ 3Ghz, 4GB, 32bit XP Pro-SP3
Benchmarking array methods on 5 million members
750 ms array create
484 ms array load data
0 ms array read
781 ms array delete
Press any key to continue ...
Benchmarking array methods on 5 million members
33868 ms array create
27940 ms array load data
802 ms array read
44494 ms array delete
Press any key to continue ...
@433 Mhz XP Pro SP2
This is the timing for the rewrite of the dynamic string array.
5 million element test
15 ms array create
1891 ms array load data
16 ms array read
2172 ms array delete
Press any key to continue ...
Faster creation but slower array load. This is the example out of the latest beta of masm32.
[attachment deleted by admin]
Hi Hutch:
bmark.asm(57) : error A2008: syntax error : ,
bmark.asm(76) : error A2008: syntax error : ,
bmark.asm(41) : error A2006: undefined symbol : arralloc$
bmark.asm(93) : error A2006: undefined symbol : arrfree$
Are we missing some macros?
Thanks.
No,
You are missing the latest beta that the example came from. It builds under version 10i, thats why I posted the working binary as well so it cold be run without building it.
5 million element test
31 ms array create
2563 ms array load data
31 ms array read
1859 ms array delete
Celeron 2.4 GHz, 448 MB Ram
Hi hutch-:
5 million element test
401 ms array create
31625 ms array load data
681 ms array read
29392 ms array delete
Press any key to continue ...
Thanks.
core2duo E8500 3.16Ghz 1111Mhz/2M Vista64
5 million element test
15 ms array create
687 ms array load data
0 ms array read
437 ms array delete
Press any key to continue ...
I am getting either 15 or 0 ms for the same test. Doesn't matter which one(create or read). So, I'm thinking maybe its windows switching tasks every 15ms. Maybe test on smaller # of elements needs to be devised?
If such mistake exists then other tests really don't tell much because nobody knows how often Windows switches task with all your drivers & user apps running.
Thanks for reading.
Hi cmpxchg,
Welcome on board. The results at such a low timing interval are not reliable as the GetTickCount() API does not have a fine enough granularity. Its OK at over one quarter of a second but nearly useless for figures that small. All it demonstrates is that the create and read functions are almost negligible against the rest that do more work.
Windows XP SP3 / Vostro 1400 Laptop / Core 2 Duo T9300 @2.50GHz / 4GB of ram
Benchmarking array methods on 5 million members
1000 ms array create
547 ms array load data
16 ms array read
1000 ms array delete
Press any key to continue ...
5 million element test
31 ms array create
1281 ms array load data
16 ms array read
1000 ms array delete
Press any key to continue ...
Windows XP SP2 / Core 2 Duo E4500 @ 2.31 GHz (overclocked 5%) / 2 GB of RAM
Benchmarking array methods on 5 million members
1234 ms array create
860 ms array load data
15 ms array read
1360 ms array delete
Hey Hutch,
I downloaded J and put it on a different drive. I got your code to compile. I am currently looking at $arrset to speed it up. I already saw several speed ups. How do I manually recompile m32lib? I thought there was a batch file I could run. I have to run now, but I will be back later.
Hi Mark:
@echo off
copy masm32.inc \masm32\include\masm32.inc
del masm32.lib : delete any existing MASM32 Library
dir /b *.asm > ml.rsp : create a response file for ML.EXE
\masm32\bin\ml /c /coff @ml.rsp
if errorlevel 0 goto okml
del ml.rsp
echo ASSEMBLY ERROR BUILDING LIBRARY MODULES
goto theend
:okml
\masm32\bin\link -lib *.obj /out:masm32.lib
if exist masm32.lib goto oklink
echo LINK ERROR BUILDING LIBRARY
echo The MASM32 Library was not built
goto theend
:oklink
copy masm32.lib \masm32\lib\masm32.lib
:theend
if exist masm32.lib del *.obj
dir \masm32\lib\masm32.lib
dir \masm32\include\masm32.inc
This is make.bat and is usually in \masm32\m32lib directory
Regards herge.
Thanks Herge! :)
Mark
arralloc is fast already. So I am starting with arrset.
I am getting 1188 with the unmodified code.
with my modifications I am getting 922. I still have some bugs to work out.
I tired all the different Allocs ( in place of SysAllocStringByteLenj), and manually copying the memory. HeapAlloc was the fastest. I did GetHeapMemory in Main()
Mark,
Grab the current beta of masm32, I had to fix the original arralloc function as I had made a mistake with ESI that made it dangerous, its been fixed and no longer has the problem. In the beta its now part of the masm32 library so any mods can be built by running the make.bat file.
I looked at HeapAlloc early in the development but was wary of using it due to fragmentation problems which increase over time as array members are added and removed, OLE is reasonably well geared here but it has always been a bit slower than the lower level allocation methods. I wanted the characteristic of the length stored 4 bytes below the start address as it saves any length calculaions.
Quote from: hutch-- on August 09, 2008, 02:09:10 AM
Mark,
Grab the current beta of masm32, I had to fix the original arralloc function as I had made a mistake with ESI that made it dangerous, its been fixed and no longer has the problem. In the beta its now part of the masm32 library so any mods can be built by running the make.bat file.
I looked at HeapAlloc early in the development but was wary of using it due to fragmentation problems which increase over time as array members are added and removed, OLE is reasonably well geared here but it has always been a bit slower than the lower level allocation methods. I wanted the characteristic of the length stored 4 bytes below the start address as it saves any length calculaions.
I grabbed (J) the same day I posted this. It was the very top Beta in the MASM32 forum. I decided not to do the makeit.bat and my c.bat to compile. To save a step I cut and pasted the code into bmark.asm
I also got better speeds using GlobalAlloc. It was 20 ms slower than Healalloc.
have you thought about removing the prologue and epilogue code for strset? It gets called 5 million times, that adds up over time.
This is what I am doing to copy the data. I am definitely not using REP ( It was slower since it only starts getting fast around 64 bytes, and most of the strings were a LOT smaller than that).
This is the main loop of where I copy the data. I do a dword copy at a time. The buffer will be dword aligned, so I copy all the dwords first, and then after the dword loop, I transfer any odd bytes that don't fit into a dword. Here is the Dword loop.
align 16
my_loop:
mov ecx,[esi]
mov [edx],ecx
add esi,4
add edx,4
sub edi,4
jg my_loop
pop ecx
Quote from: Mark_Larson on August 09, 2008, 12:49:02 PM
Here is the Dword loop.
align 16
my_loop:
mov ecx,[esi]
mov [edx],ecx
add esi,4
add edx,4
sub edi,4
jg my_loop
About 3% faster (Celeron M) and 4 bytes shorter:
@@: mov eax,[esi+ecx]
mov [edi+ecx],eax
sub ecx,4
jnc @b
Mark,
I tried manually setting the memory size in the OLE string call in arrset$ and tried multiple algos but they were not faster than placing the address directly into the OLE string call so I left it at its default in the OLE library that uses REP MOVSD internally. Apart from the OLE call there is little to improve in the arrset$ algo. I used the DWORD string length algo to set the allocation size for the OLE string but there is no reasonable way to make the API any faster that I know of.
This is the algo I changed in the alloc routine, I had forgotten to correct the array count for its 1 based index and it went BANG with small counts.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 16
arralloc proc mcnt:DWORD
; ----------------------------------------------------------------
; return values = handle of pointer array or 0 on allocation error
; ----------------------------------------------------------------
push esi
mov eax, mcnt ; load the member count into EAX
add eax, 1 ; correct for 1 based array
lea eax, [0+eax*4] ; multiply it by 4 for memory size
invoke GlobalAlloc,GMEM_FIXED,eax
mov esi, eax
test eax, eax ; if allocation failure return zero
jz quit
mov eax, esi
mov ecx, mcnt
mov DWORD PTR [eax], ecx ; write count to 1st member
xor edx, edx
@@:
add edx, 1 ; write adress of null string to all members
mov [eax+edx*4], OFFSET d_e_f_a_u_l_t__n_u_l_l_$
cmp edx, ecx
jle @B
mov eax, esi ; return pointer array handle
quit:
pop esi
ret
arralloc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
I pre-allocated 8 bytes per array element in arralloc. The code is running a lot faster. 266-281 cycles for arrset. I need to add more code like support for re-allocating more memory when it runs out of buffer size. Or I might just allocate a second buffer, depending on how slow re-alloc is.
Mark
EDIT: hutch thanks for cutting and pasting the arralloc routine in the forum :)
EDIT2: my new timings, both arrset and arrfree are considerably faster.
0 ms array create
282 ms array load data
15 ms array read
0 ms array delete
EDIT3: I got rid of HeapAlloc. I am exclusively using GlobalAlloc
EDIT4: attaching the executable.
[attachment deleted by admin]
Oooh, Athlons really like your modifications Mark. :wink
Athlon dual-core x64 4000+ (WinXP x32)
5 million element test
47 ms array create
344 ms array load data
31 ms array read
16 ms array delete
Mark,
These numbers are looking good. This is on my antique Northwood PIV.
5 million element test
16 ms array create
500 ms array load data
31 ms array read
0 ms array delete
Tolerate me at the moment, I am trying to track down minor bits and pieces in the version 10 of the masm32 SDK and I don't have much brain left at the moment. Just finished 6 weeks of writing a new editor, finishing off a new script engine, gutted and rebuilt the old sript engine for backwards copatibility, rebuilt 3 DLLs yesterday for the project, wrote a couple of others and i am writig web pages at the moment running on auto-pilate. :bg
Quote from: hutch-- on August 12, 2008, 03:08:42 PM
Mark,
These numbers are looking good. This is on my antique Northwood PIV.
5 million element test
16 ms array create
500 ms array load data
31 ms array read
0 ms array delete
Tolerate me at the moment, I am trying to track down minor bits and pieces in the version 10 of the masm32 SDK and I don't have much brain left at the moment. Just finished 6 weeks of writing a new editor, finishing off a new script engine, gutted and rebuilt the old sript engine for backwards copatibility, rebuilt 3 DLLs yesterday for the project, wrote a couple of others and i am writig web pages at the moment running on auto-pilate. :bg
no worries :bg I'd recommend going to sleep, but I wouldn't do it either if our positions were reversed :bg
I am actually removing the prologue/epilogue code, since it gets called 5 million times. Assuming that that adds 2 cycles ( roughly) that saves 10,000,000 cycles, which will also speed it up. I haven't finished that version yet. So you can always wait.
I installed 64-bit Ubuntu on my computer and it ate my Windows XP. So I am currently running from Linux only. I can't even access the partition. *kicks ubuntu*
Can some people please run the code that I posted and tell me what numbers they get?
It's going to take me a few days to get back up to speed. I can run the program in Wine, but it won't run at full speed. I can use MASM with Wine. So I have no way of testing it. I do have a copy of Windows XP. But I need to do some stuff in Linux first before I can restore XP.
Quote from: Mark_Larson on August 13, 2008, 08:42:42 PM
I installed 64-bit Ubuntu on my computer and it ate my Windows XP.
Boa constrictor eats elephant? Bon appetit! :bg
I got Masm32 10.0 installed under Linux. I am going to be testing it shortly.
EDIT:
the library correctly compiles. I can get code to compile if it doesn't have XMM. It may be because I am using ml 6.14 now instead of 6.15. I need to check the code and make sure that is the problem.
EDIT2: you can do this under wine
wine cmd.exe and it'll give you a DOS shell to do your normal dos commands.