Benchmark and test for htodw algos.

hutch-- · August 03, 2010, 07:05:52 AM

I have on and off followed the algo development for replacing htodw that Alex Yakubchik wrote years ago and without claiming to do justice to the full range of algos developed I have written a benchmark to test the algos in real time. The fastest non-sse is lingos long version but for reasons i cannot accurately track down it is highly sensitive to code placement and code before and after it and this is even with controlled padding between the algos. Alex's long version is very consistent over a wide range of tests and on different hardware.

For the short versions Alex's appears to be the most consistent over different processors, Clive's is faster on this quad but a lot slower on the PIVs.

What I have in mind with these algos is to seleect a short and a long version naming them respectively "htodw" and htodw_ex" so that the user has the choice of size or speed where they think it matters.

Code Select


Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz
250 atodw library
109 Alex short
47 Lingo long
63 Alex long
94 clive short
Press any key to continue ...

Prescott P4
Genuine Intel(R) CPU 3.80GHz
312 atodw library
141 Alex short
62 Lingo long
78 Alex long
219 clive short
Press any key to continue ...

Northwood P4
Intel(R) Pentium(R) 4 CPU 2.80GHz
390 atodw library
172 Alex short
78 Lingo long
94 Alex long
266 clive short
Press any key to continue ...

Intel(R) Celeron(TM) CPU 1200MHz
1452 atodw library
471 Alex short
241 Lingo long
211 Alex long
541 clive short
Press any key to continue ...

I could not be bothered turning the i7 on as its too much messing around to get the data in and out but it behaves similar to the Core2 quad.

frktons · August 03, 2010, 07:52:02 AM

On my Core 2 Duo:

Code Select


Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
468 atodw library
125 Alex short
94 Lingo long
94 Alex long
140 clive short
Press any key to continue ...

:U

Rockoon · August 03, 2010, 01:03:09 PM

Cannot Identify x86 Processor
265 atodw library
124 Alex short
63 Lingo long
78 Alex long
109 clive short
Press any key to continue ...

Thats a Phenom II x6 1055T

(whats up with your CPU detection?)

sinsi · August 03, 2010, 01:11:06 PM

Poor showing here :(

Code Select


Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
483 atodw library
124 Alex short
78 Lingo long
78 Alex long
140 clive short

Rockoon · August 03, 2010, 01:58:19 PM

Note that both frktons and sinsi benched the same CPU, but got significantly different timings. Its all on the memory timings it seems.

sinsi · August 03, 2010, 02:24:21 PM

Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066

Commenting out the 'rcnt' macro is a bit better...

Code Select


Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
327 atodw library
125 Alex short
62 Lingo long
78 Alex long
124 clive short

Alignment is such a finicky thing eh?

frktons · August 03, 2010, 03:18:52 PM

Quote from: sinsi on August 03, 2010, 02:24:21 PM
Maybe quad vs duo? Memory shoudn't matter (maybe cache?) I have ddr2 1066

Commenting out the 'rcnt' macro is a bit better...
Code Select Expand
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz 327 atodw library 125 Alex short 62 Lingo long 78 Alex long 124 clive short
Alignment is such a finicky thing eh?

Well, the difference is only in quad vs duo. I've ddr2 1066 as well.

frktons · August 03, 2010, 03:33:09 PM

In order to compile the test, I had to use the
.686 directive, it won't without.

I did like Sinsi, commented out the rcnt macro and the timings are a bit better:

Code Select


Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
328 atodw library
140 Alex short
78 Lingo long
93 Alex long
140 clive short
Press any key to continue ...

I tried with .486 - .586 but it didn't compile. ::)
In the first test I just used the executable.

Trying it many times I get different results, like:

Code Select


Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
327 atodw library
140 Alex short
78 Lingo long
78 Alex long
156 clive short
Press any key to continue ...

and sometimes Lingo is slower than Alex as well. ::)

Antariy · August 03, 2010, 10:57:00 PM

Peoples, test my original version, please.
Link to archive "http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
This link to already attached archive, to other thread.

In archive - my original versions of procs. Hutch tune my short version of algo, and on some hardware (on my CPU), it seems to be algo slower, than my version. On most not-hi-end CPUs, my version faster.

Test this please, for clearness of on which hardware short proc have most satisfactory results.

Thanks!

Alex

Antariy · August 03, 2010, 11:07:43 PM

Hutch,
by this link "http://www.masm32.com/board/index.php?topic=14438.msg117187#msg117187" I post some my thinks about probability of inconsistency of Lingo's long algo with some changing: placement of algo, data, padding etc.
And read the post below from this. In it some thinkings also.

Alex

Antariy · August 03, 2010, 11:12:12 PM

Guys, you forgot to add these lines to sources of test:

Code Select


	invoke GetCurrentThread
	invoke SetThreadAffinityMask,eax,1
	invoke Sleep,0 ;<--- force to switch mode by dropping current time slice

May be - this is reason of inconsistency timings under Core2 with one core and four cores. Usually, this is cannot change something, but with tests under quad-core Core i7 I have similar results - timings be very inconsistent, and differently by tens of clocks.

Alex

Antariy · August 03, 2010, 11:28:32 PM

Quote from: sinsi on August 03, 2010, 02:24:21 PM
Commenting out the 'rcnt' macro is a bit better...
...............skipped............
Alignment is such a finicky thing eh?

Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file.
Need some "dancing with tambourine" to do it works faster :)
But it have one advantage: his algo use pre-calculated table with all needed values (very BIG table), so it *may* be fastest algo in best case.
My algo have "trimmed" look-up table, and some simple math used in algo to do valid values. So, algo slower in theory, but, under multi-tasking Windows and limited (yes, big, but *limited*) size of caches, my (and any algo with small look-up table) algo have some advantages.

Alex

dedndave · August 04, 2010, 12:21:29 AM

Frank - you might try this to see if it helps get more consistent timings...

Code Select

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001

take several readings prior to applying the change
then, take several more afterward to verify that it helps
it makes a big difference on my P4 prescott w/htt

hutch-- · August 04, 2010, 01:15:44 AM

Just be careful in how you modify the test conditions for algos of this type, for all of its irritations ring3 access real time testing is the closest you will get to the conditions that the algorithms are used under and narrowing these conditions can alter the results so they are no longer referential.

Lingo's algo has been faster on most of the recent hardware but its highly sensitive to code placement both before and after it and this can slow it down by over 50%. I added the variable padding between algos for a reason, it is a way of testing how sensitive the code is to code placement. You can vary the iteration count and the style of padding (nop) (db 0) (int 3) to test if these make any difference on a particular hardware. I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem.

For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimised pass at the data.

Just as reference for the results I originally posted, the quad Core2 I use has 1333 memory which may effect the results.

lingo · August 04, 2010, 11:49:49 AM

"I have a question for Lingo. I have been testing your unrolled algo and in isolation it is the fastest of the non SSE/MMX versions by a reasonable amount but as soon as I add other algos to test against it slows down by over 50%. I have tried various things including putting the tables in the .DATA section which helped a little but when I added another algo for testing it slowed down by a long way. Any idea why its so sensitive to code placement before and after it ?"

But here we see you got the idea and found the way: :lol

"it is a way of testing how sensitive the code is to code placement."
How did you do it? Is it possible some external links about your way? :lol

"I also found that Lingo's algo was slower if it had not first been run while the rest were effectively indifferent to this problem."
Is it possible to post two different asm and exe files to prove it? :lol

And the epilog:
"For general use the take off time of an algo is important as a vast number of algorithms get called from time to time in normal software operation rather than being set up for an optimized pass at the data.:"
It is the same as:
"Yes, it seems, which Lingo's algo with big look-up table is very dependent from alignment and placement in the file. Need some "dancing with tambourine" to do it works faster :)" by the asian lamer with archaic CPU.

So, with other words Lingo is some kind of liar who try to manipulate the people and Lingo's algos are achieved by fraud...

News:

Benchmark and test for htodw algos.