News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Benchmark and test for htodw algos.

Started by hutch--, August 03, 2010, 07:05:52 AM

Previous topic - Next topic

ecube

Quote from: jj2007 on August 04, 2010, 09:33:58 PM
If you are not too tired, just read the sensitivity thread - it has very little to do with alignment.

the word align is said 103 times in that thread...

Antariy

Lingo's latest test (with many executables):

exe1

Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
141 Lingo long
140 Alex long
407 clive short
Press any key to continue ...

Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
141 Lingo long
140 Alex long
407 clive short
Press any key to continue ...

exe2

Intel(R) Celeron(R) CPU 2.13GHz
578 atodw library
250 Alex short
141 Lingo long
140 Alex long
391 clive short
Press any key to continue ...

Intel(R) Celeron(R) CPU 2.13GHz
594 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...

exe3

Intel(R) Celeron(R) CPU 2.13GHz
579 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...

Intel(R) Celeron(R) CPU 2.13GHz
579 atodw library
250 Alex short
140 Lingo long
141 Alex long
406 clive short
Press any key to continue ...

exe4

Intel(R) Celeron(R) CPU 2.13GHz
593 atodw library
250 Alex short
141 Lingo long
141 Alex long
422 clive short
Press any key to continue ...

Intel(R) Celeron(R) CPU 2.13GHz
578 atodw library
250 Alex short
141 Lingo long
140 Alex long
406 clive short
Press any key to continue ...


So, which results we have. Lingo's proc don't (probably) dependent from code placement, but his code don't faster in this test (which is not really real-world, with flexible string length). With consideration of VERY BIG size, this code is not very useful. But some peoples may have different opinions (especially - author of BIG algo).



Alex

Antariy


hutch--

I ran into the problem with Lingo's algo while writing the first test piece. I could get it to clock 47 ms with no problems but as I added extra algos to do the comparison against its timing altered back and forth as code was added.

The first test piece was fiddled to make sure Lingo's algo was located within the exe so it ran at its full speed and I reported the problem of code placement.

In response to Lingo flapping his mouth off I posted his identical algorithm in another test piece where his algo speed dropped by half simply due to code placement and it is this speed fluctuation that makes it unreliable.

For an algorithm to be general purpose it needs to be called from anywhere anytime and not be dependent on obscure conditions to work properly and at full speed.

I will add the two most reliable algos to the library but it will be based on real time testing across a wide range of hardware, not test piece cooked to make one algo look good.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

You are right, Hutch.

Try to test this: "http://www.masm32.com/board/index.php?action=dlattach;topic=14540.0;id=7903"
This is your algo, Hutch, for testing, but I add flexible length strings only for test. Real-world testing need test not only fixed-length strings, is it?

This is real-world read-time test, not "clocks" test. See, please, maybe, you add some improvements.


Alex

ecube

Quote from: Antariy on August 04, 2010, 11:01:36 PM
E^cube, test this, please:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14438.0;id=7889"
and this:
"http://www.masm32.com/board/index.php?action=dlattach;topic=14540.0;id=7903"

If I don't unpleasant to you, of course.


Alex


the second test did unpleasant me because it took awhile, but here you go.

AMD Athlon(tm) 64 Processor 3000+ (SSE3)
Alex's versions
All algos work on i386+ CPUs
ABCDEF01        Result of Unrolled
ABCDEF01        Result of Unrolled (AMD)
ABCDEF01        Result of Short
35      cycles for Fast version
22      cycles for Fast version under AMD
40      cycles for Small version in faster compilation

Codesizes:
Axhex2dw_Unrolled:      396
Axhex2dw_Unrolled_AMD:  396
Axhex2dw - Small:       69
--- ok ---



Generating test file...
1000000 = item count in file


2172 ms Lingo 1
1500 ms Lingo 2
2016 ms Alex Unrolled
1796 ms Alex Unrolled (AMD)


2172 ms Lingo 1
1500 ms Lingo 2
2032 ms Alex Unrolled
1796 ms Alex Unrolled (AMD)


2172 ms Lingo 1
1500 ms Lingo 2
2000 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2172 ms Lingo 1
1500 ms Lingo 2
2000 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2172 ms Lingo 1
1484 ms Lingo 2
2016 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2156 ms Lingo 1
1484 ms Lingo 2
2016 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2172 ms Lingo 1
1500 ms Lingo 2
2031 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2156 ms Lingo 1
1516 ms Lingo 2
2015 ms Alex Unrolled
1797 ms Alex Unrolled (AMD)


2168 MS Lingo average
1498 MS Lingo2 average
2015 MS Alex average
1796 MS Alex (AMD) average

Size of code:
171      Lingo 1 proc
2076     Lingo 2 proc
396      Alex proc
396      Alex proc (AMD)


Press any key to continue ...


lingo

"I could get it to clock 47 ms with no problems..
me too.. :toothy
but as I added extra algos to do the comparison against its timing altered back and forth as code was added."
May be it is the "problem"... :lol

"it ran at its full speed and I reported the problem of code placement."
So, it will be interesting where is the "right" code placement to run faster?
Or may be how to write speed optimized code and place it at right place
or what is a bad code placement and how to avoid it, etc...
It will be a new invented chapter with rules in the Intel's Optimization Reference Manual... :lol

"I will add the two most reliable algos to the library..."

as usual, so that every stupid lamer can fix and improve them, but it is your human right and responsibility as an owner... :toothy
Example

hutch--

 :bg

I could also get it to clock twice as slow and that was the problem, with no algorithm modification at all the algo slowed to the speed of the small versions because it has a problem with code placement.

Now while the rest were reasonably consistent in their timings apart from the old library version "htodw" that I placed first, the problem is you cannot pick its performance without individually benchmarking an application after each piece of code is added to it and that renders the algorithm unworkable in its present form.

For Lingo flapping his mouth off the questions is "Does the clock tell lies ?" I suggest it does not. In motor racing there is an expression "When the flag drops the bullsh*t stops" and while having an algo that is fast in some places may be good for flapping your mouth off, when the end result is its unreliable in terms of timing, then its no use. An F1 car may win a race but if it can't win a series its a loser.  :bdg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

I have got a bit more consistency out of ingo's algo by setting up the data to be converted for test in this form. All of the other test algos do not change timings.


    .data
      hword db "C0000000",0
      align 16
      pword dd hword
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: E^cube on August 04, 2010, 10:48:47 PM
Quote from: jj2007 on August 04, 2010, 09:33:58 PM
If you are not too tired, just read the sensitivity thread - it has very little to do with alignment.

the word align is said 103 times in that thread...

You are good at counting words. If you had understood the contents of the thread, you would know that alignment by itself does not explain the code location sensitivity. As was demonstrated in Hutch's example.

ecube

Quote from: jj2007 on August 05, 2010, 07:53:20 AM
You are good at counting words. If you had understood the contents of the thread, you would know that alignment by itself does not explain the code location sensitivity. As was demonstrated in Hutch's example.
Quote from: E^cube on August 04, 2010, 06:57:55 PM
alignment and code location aren't the same thing, if it were a simple alignment issue i'm sure hutch would of tweaked it by now.

reading and actually comprehending is the key here, of which you're lacking, and you're right I am good at counting, I said the above on the second page(thats 2) and you repeated what I said above on this page which is 3.  And since 2 comes before 3, you do the math  :thumbu

sinsi

One thing no-one has mentioned is things like ASLR and DEP which depend on the version of windows you are using as well as the CPU.
There are way too many variables when timing code, little things like RAM size can play a part.

Maybe we need a 'windows version' as well as a 'cpu version'
Light travels faster than sound, that's why some people seem bright until you hear them.

ecube

Quote from: sinsi on August 05, 2010, 08:21:14 AM
One thing no-one has mentioned is things like ASLR and DEP which depend on the version of windows you are using as well as the CPU.
There are way too many variables when timing code, little things like RAM size can play a part.

Maybe we need a 'windows version' as well as a 'cpu version'

I think that's a good idea, it'll be the template for all future speed tests, and it can include cpu usage(for all cores), detailed windows info,DEP settings,amount of ram etc...

hutch--

i would not hold out any great hope of higher level factors like DEP playing much of a part in intense algorithm timings. Apart from a processors capacity to schedule instruction depending on their order, the only other factor I can think of that does effect timing is memory speed but it has never been that big a difference. The later P4 generation was running DDR400 followed by DDR2 533, a couple of faster versions and DDR3 at 1333 and I have seen 1666 somewhere. It does speed up memory operation but it has limited impact on raw timings as processor is still a lot faster than memory.

Now processor task switching intervals can have an effect but it tends to heve been reasonably uniform across versions as old as Win95oem through to current Win7 OS versions. From memory intensive algos get faster with longer time slices so if you are using an OS version that allows you to adjust the time slice you can alter that to speed algos up. Something that really did mess up timings was a hyperthreading processor running on Win2000, you had to turn it off in the BIOS as Win2000 did not properly support hyperthreading and saw it as 2 processors.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on August 05, 2010, 03:53:31 AM
I have got a bit more consistency out of ingo's algo by setting up the data to be converted for test in this form. All of the other test algos do not change timings.


    .data
      hword db "C0000000",0
      align 16
      pword dd hword


Hutch,
I tested that one using this macro:

hutchtrick = 0
fnMAC MACRO algo, ptr ; second arg ignored
LOCAL tmp$
  if hutchtrick
tmp$ CATSTR <fn >, <algo>, <, pword>
  else
tmp$ CATSTR <fn >, <algo>, <, offset hword>
  endif
  tmp$
ENDM


However, Lingo's algo is consistent with and without the "hutchtrick":

Intel(R) Pentium(R) 4 CPU 3.40GHz
328 htodw JJ short (124 bytes)
1813 atodw library
781 Alex short
438 Lingo long
468 Alex long
1265 clive short


On the contrary, when using the simple fn algo, "c0000", the first run was consistently much slower for Lingo's algo (and only for Lingo's algo).

Given that fn algo, "c0000" creates a new string every time, one explanation could indeed be the size of his code - see Rockoon's last post in the "sensitivity" thread.

For comparison (consistent timings):
Celeron M, icnt = 50000000:
453 htodw JJ short (124 bytes)
3032 atodw library
1484 Alex short
781 Lingo long
797 Alex long
1500 clive short