News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Benchmark and test for htodw algos.

Started by hutch--, August 03, 2010, 07:05:52 AM

Previous topic - Next topic

hutch--

I have on and off followed the algo development for replacing htodw that Alex Yakubchik wrote years ago and without claiming to do justice to the full range of algos developed I have written a benchmark to test the algos in real time. The fastest non-sse is lingos long version but for reasons i cannot accurately track down it is highly sensitive to code placement and code before and after it and this is even with controlled padding between the algos. Alex's long version is very consistent over a wide range of tests and on different hardware.

For the short versions Alex's appears to be the most consistent over different processors, Clive's is faster on this quad but a lot slower on the PIVs.

What I have in mind with these algos is to seleect a short and a long version naming them respectively "htodw" and htodw_ex" so that the user has the choice of size or speed where they think it matters.


Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
250 atodw library
109 Alex short
47 Lingo long
63 Alex long
94 clive short
Press any key to continue ...

Prescott P4
Genuine Intel(R) CPU 3.80GHz
312 atodw library
141 Alex short
62 Lingo long
78 Alex long
219 clive short
Press any key to continue ...

Northwood P4
Intel(R) Pentium(R) 4 CPU 2.80GHz
390 atodw library
172 Alex short
78 Lingo long
94 Alex long
266 clive short
Press any key to continue ...

Intel(R) Celeron(TM) CPU 1200MHz
1452 atodw library
471 Alex short
241 Lingo long
211 Alex long
541 clive short
Press any key to continue ...


I could not be bothered turning the i7 on as its too much messing around to get the data in and out but it behaves similar to the Core2 quad.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

On my Core 2 Duo:

Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
468 atodw library
125 Alex short
94 Lingo long
94 Alex long
140 clive short
Press any key to continue ...


:U
Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

Cannot Identify x86 Processor
265 atodw library
124 Alex short
63 Lingo long
78 Alex long
109 clive short
Press any key to continue ...

Thats a Phenom II x6 1055T

(whats up with your CPU detection?)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

sinsi

Poor showing here  :(

Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
483 atodw library
124 Alex short
78 Lingo long
78 Alex long
140 clive short

Light travels faster than sound, that's why some people seem bright until you hear them.

Rockoon

Note that both frktons and sinsi benched the same CPU, but got significantly different timings. Its all on the memory timings it seems.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

sinsi

Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066

Commenting out the 'rcnt' macro is a bit better...

Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
327 atodw library
125 Alex short
62 Lingo long
78 Alex long
124 clive short

Alignment is such a finicky thing eh?
Light travels faster than sound, that's why some people seem bright until you hear them.

frktons

Quote from: sinsi on August 03, 2010, 02:24:21 PM
Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066

Commenting out the 'rcnt' macro is a bit better...

Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
327 atodw library
125 Alex short
62 Lingo long
78 Alex long
124 clive short

Alignment is such a finicky thing eh?

Well, the difference is only in quad vs duo. I've ddr2 1066 as well.
Mind is like a parachute. You know what to do in order to use it :-)

frktons

In order to compile the test, I had to use the
.686 directive, it won't without.

I did like Sinsi, commented out the rcnt macro and the timings are a bit better:


Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
328 atodw library
140 Alex short
78 Lingo long
93 Alex long
140 clive short
Press any key to continue ...


I tried with .486 - .586 but it didn't compile.  ::)
In the first test I just used the executable.

Trying it many times I get different results, like:

Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
327 atodw library
140 Alex short
78 Lingo long
78 Alex long
156 clive short
Press any key to continue ...


and sometimes Lingo is slower than Alex as well.  ::)
Mind is like a parachute. You know what to do in order to use it :-)

Antariy


Peoples, test my original version, please.
Link to archive "http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
This link to already attached archive, to other thread.

In archive - my original versions of procs. Hutch tune my short version of algo, and on some hardware (on my CPU), it seems to be algo slower, than my version. On most not-hi-end CPUs, my version faster.

Test this please, for clearness of on which hardware short proc have most satisfactory results.

Thanks!


Alex

Antariy

Hutch,
by this link "http://www.masm32.com/board/index.php?topic=14438.msg117187#msg117187" I post some my thinks about probability of inconsistency of Lingo's long algo with some changing: placement of algo, data, padding etc.
And read the post below from this. In it some thinkings also.



Alex

Antariy

Guys, you forgot to add these lines to sources of test:

invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1
invoke Sleep,0 ;<--- force to switch mode by dropping current time slice


May be - this is reason of inconsistency timings under Core2 with one core and four cores. Usually, this is cannot change something, but with tests under quad-core Core i7 I have similar results - timings be very inconsistent, and differently by tens of clocks.


Alex

Antariy

Quote from: sinsi on August 03, 2010, 02:24:21 PM
Commenting out the 'rcnt' macro is a bit better...
...............skipped............
Alignment is such a finicky thing eh?

Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file.
Need some "dancing with tambourine" to do it works faster :)
But it have one advantage: his algo use pre-calculated table with all needed values (very BIG table), so it *may* be fastest algo in best case.
My algo have "trimmed" look-up table, and some simple math used in algo to do valid values. So, algo slower in theory, but, under multi-tasking Windows and limited (yes, big, but *limited*) size of caches, my (and any algo with small look-up table) algo have some advantages.



Alex

dedndave

Frank - you might try this to see if it helps get more consistent timings...
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001

take several readings prior to applying the change
then, take several more afterward to verify that it helps
it makes a big difference on my P4 prescott w/htt

hutch--

Just be careful in how you modify the test conditions for algos of this type, for all of its irritations ring3 access real time testing is the closest you will get to the conditions that the algorithms are used under and narrowing these conditions can alter the results so they are no longer referential.

Lingo's algo has been faster on most of the recent hardware but its highly sensitive to code placement both before and after it and this can slow it down by over 50%. I added the variable padding between algos for a reason, it is a way of testing how sensitive the code is to code placement. You can vary the iteration count and the style of padding (nop) (db 0) (int 3) to test if these make any difference on a particular hardware. I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem.

For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimised pass at the data.

Just as reference for the results I originally posted, the quad Core2 I use has 1333 memory which may effect the results.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

"I have a question for Lingo. I have been testing your unrolled algo and in isolation it is the fastest of the non SSE/MMX versions by a reasonable amount but as soon as I add other algos to test against it slows down by over 50%. I have tried various things including putting the tables in the .DATA section which helped a little but when I added another algo for testing it slowed down by a long way. Any idea why its so sensitive to code placement before and after it ?"

But here we see you got the idea and found the way:  :lol

"it is a way of testing how sensitive the code is to code placement."
How did you do it? Is it possible some external links about your way?   :lol

"I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem."
Is it possible to post two different asm and exe files to prove it?  :lol

And the epilog:
"For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimized pass at the data.:"
It is the same as:
"Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file. Need some "dancing with tambourine" to do it works faster :)" by the asian lamer with archaic CPU.

So, with other words Lingo is some kind of liar who try to manipulate the people and Lingo's algos are achieved by fraud...