The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: nixeagle on May 01, 2012, 02:02:34 AM

Title: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 02:02:34 AM
First off, I'm new so please be gentle. This post will be long.

Getting a statistically reliable baseline is not easy as we have multiple confounding inputs from operating system interrupts/context switches, cache misses, hyperthreading,  CPU power state/throttling and so on. In the face of this, we want to establish a method of measuring how good our data is, just taking the average or searching for the minimum number of ticks recorded is not enough without characterizing the data quality. Would you consider the following data useful? (Context switches were omitted as they take 1000s of ticks.)

(http://i.imgur.com/M1nJU.png)

My answer is emphatically no! For those wondering, the parts of the plot around 100,000 and 250,000 match up to the portions of the test's runtime where I jiggled the mouse around. Yes, jiggling the mouse or attempting to do other work while a test is running can introduce confounding. Most likely this is obvious and or already known to you guys. But this leads to the question of:
QuoteHow can we identify when our timing results are screwed up?

If we time the code by simply taking the average of all runs we get 74.0773 ticks. Alternatively we can count only the fastest run which is 42 ticks. Both numbers only tell a very small part of the picture though. A program that simply outputs one or both is hiding from us the glaring problem in our test data. Running the test again can very easily give us completely different results. No jiggling of the mouse at all will result in a test that looks something like:

(http://i.imgur.com/aqr9f.png)

Which while not perfect, is a step in the right direction as far as repeatability and reliability go. We can still do better, and we will at at the end of this post. Before we do so, let us ask:
QuoteHow do we tell the difference between graph 1 and graph 2?
2 test!

With the following timing code, placed inside of a loop:
00000011 <timeit()>:
  11:   0f 31                   rdtsc
  13:   0f 31                   rdtsc
  15:   0f 31                   rdtsc
  17:   0f ae f0                mfence
  1a:   0f 31                   rdtsc
  1c:   89 c1                   mov    ecx,eax
  1e:   0f ae f0                mfence
  21:   0f 31                   rdtsc
  23:   29 c8                   sub    eax,ecx
  25:   c3                      ret


we manage to get stable samples. Each run through the loop takes 154 ticks. Slow, but consistantly slow! First a plot, then we will take a quick look at the numerical measurements of sample quality.

(http://i.imgur.com/CGVZ1.png)

While still scattered about a bit, we are much better in terms of standard deviation and percentile. The numbers:

Look at the consistency! Yes, we still have unexplained data points that are not 154 ticks, but the vast, vast majority key in at 154 ticks! Ignoring potential problems, this timing code is "ideal" as the loop will take a consistent 154 ticks every time through. This assists us in fairly measuring competing algorithm speeds.

However, this does not imply that the method is flawless!

At the very least, I hope my post starts some interesting conversation on measuring the quality of timing results. I hope it is clear why a simple average or just taking the minimum value is not sufficient.



Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 03:13:20 AM
i am interested to see what code you are timing
as well as the timing code, itself

have you tried timing the same code with Michael's macros ?
perhaps not perfect - i doubt any timing code is - but we are all very "comfortable" with the results
it would be interesting to see how it compares with your timing code
Michael uses CPUID to serialize instructions

also - if you notice the example code i posted in the other thread...
on multicore machines, it may be easier to get stable readings by using SetProcessAffinityMask to bind to a single core
i find that works best in here - some of us have older P4's that jump around, otherwise
to compare apples with apples, it is best to bind for all processors
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 03:19:34 AM
...i might add

the absolute times aren't what we are after, so much as comparing one algorithm to another on diffierent CPU's
our idea of "good timing code" is that which provides repeatable results on a variety of cores, while timing a variety of code
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 03:42:45 AM
Thanks for taking the time to read through my admittedly long post dedndave! The code for the last plot (the stable one) is as follows: (Please note that this is the result of objdump after compiling some C++ code using GCC). For the purposes of my investigation it is good enough. The code loops through and prints each run's result to the standard output which is then piped to a file. From there I use mathematica to generate pretty plots and compute various statistics; two of which are mentioned above.


00000000 <__tcf_0>:
   0:   83 ec 0c                sub    esp,0xc
   3:   b9 00 00 00 00          mov    ecx,0x0
                        4: dir32        .bss
   8:   e8 00 00 00 00          call   d <__tcf_0+0xd>
                        9: DISP32       std::ios_base::Init::~Init()
   d:   83 c4 0c                add    esp,0xc
  10:   c3                      ret

00000011 <timeit()>:
  11:   0f 31                   rdtsc
  13:   0f 31                   rdtsc
  15:   0f 31                   rdtsc
  17:   0f ae f0                mfence
  1a:   0f 31                   rdtsc
  1c:   89 c1                   mov    ecx,eax
  1e:   0f ae f0                mfence
  21:   0f 31                   rdtsc
  23:   29 c8                   sub    eax,ecx
  25:   c3                      ret

00000026 <main>:
  26:   8d 4c 24 04             lea    ecx,[esp+0x4]
  2a:   83 e4 f0                and    esp,0xfffffff0
  2d:   ff 71 fc                push   DWORD PTR [ecx-0x4]
  30:   55                      push   ebp
  31:   89 e5                   mov    ebp,esp
  33:   57                      push   edi
  34:   56                      push   esi
  35:   53                      push   ebx
  36:   51                      push   ecx
  37:   83 ec 18                sub    esp,0x18
  3a:   e8 00 00 00 00          call   3f <main+0x19>
                        3b: DISP32      __main
  3f:   bf e0 93 04 00          mov    edi,0x493e0
  44:   e8 c8 ff ff ff          call   11 <timeit()>
  49:   89 04 24                mov    DWORD PTR [esp],eax
  4c:   b9 00 00 00 00          mov    ecx,0x0
                        4d: dir32       std::cout
  51:   e8 00 00 00 00          call   56 <main+0x30>
                        52: DISP32      std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long)
  56:   83 ec 04                sub    esp,0x4
  59:   89 c6                   mov    esi,eax
  5b:   8b 00                   mov    eax,DWORD PTR [eax]
  5d:   8b 40 f4                mov    eax,DWORD PTR [eax-0xc]
  60:   8b 5c 06 7c             mov    ebx,DWORD PTR [esi+eax*1+0x7c]
  64:   85 db                   test   ebx,ebx
  66:   75 05                   jne    6d <main+0x47>
  68:   e8 00 00 00 00          call   6d <main+0x47>
                        69: DISP32      std::__throw_bad_cast()
  6d:   80 7b 1c 00             cmp    BYTE PTR [ebx+0x1c],0x0
  71:   74 06                   je     79 <main+0x53>
  73:   0f b6 43 27             movzx  eax,BYTE PTR [ebx+0x27]
  77:   eb 18                   jmp    91 <main+0x6b>
  79:   89 d9                   mov    ecx,ebx
  7b:   e8 00 00 00 00          call   80 <main+0x5a>
                        7c: DISP32      std::ctype<char>::_M_widen_init() const
  80:   8b 03                   mov    eax,DWORD PTR [ebx]
  82:   c7 04 24 0a 00 00 00    mov    DWORD PTR [esp],0xa
  89:   89 d9                   mov    ecx,ebx
  8b:   ff 50 18                call   DWORD PTR [eax+0x18]
  8e:   83 ec 04                sub    esp,0x4
  91:   0f be c0                movsx  eax,al
  94:   89 04 24                mov    DWORD PTR [esp],eax
  97:   89 f1                   mov    ecx,esi
  99:   e8 00 00 00 00          call   9e <main+0x78>
                        9a: DISP32      std::ostream::put(char)
  9e:   83 ec 04                sub    esp,0x4
  a1:   89 c1                   mov    ecx,eax
  a3:   e8 00 00 00 00          call   a8 <main+0x82>
                        a4: DISP32      std::ostream::flush()
  a8:   83 ef 01                sub    edi,0x1
  ab:   75 97                   jne    44 <main+0x1e>
  ad:   b8 00 00 00 00          mov    eax,0x0
  b2:   8d 65 f0                lea    esp,[ebp-0x10]
  b5:   59                      pop    ecx
  b6:   5b                      pop    ebx
  b7:   5e                      pop    esi
  b8:   5f                      pop    edi
  b9:   5d                      pop    ebp
  ba:   8d 61 fc                lea    esp,[ecx-0x4]
  bd:   c3                      ret

000000be <_GLOBAL__sub_I__Z6timeitv>:
  be:   83 ec 1c                sub    esp,0x1c
  c1:   b9 00 00 00 00          mov    ecx,0x0
                        c2: dir32       .bss
  c6:   e8 00 00 00 00          call   cb <_GLOBAL__sub_I__Z6timeitv+0xd>
                        c7: DISP32      std::ios_base::Init::Init()
  cb:   c7 04 24 00 00 00 00    mov    DWORD PTR [esp],0x0
                        ce: dir32       .text
  d2:   e8 00 00 00 00          call   d7 <_GLOBAL__sub_I__Z6timeitv+0x19>
                        d3: DISP32      atexit
  d7:   83 c4 1c                add    esp,0x1c
  da:   c3                      ret
  db:   90                      nop


Again, I used C++ for this testing as the only critical code (as far as I'm aware...) is the timeit() function. It is possible factors like flushing the output after every run through the loop is confounding the results, though I'd be surprised as I'd suspect that would show up in the math.

I plan to test the timing macros you linked to tonight or tomorrow and see how those fare. I suspect they will blow me out of the water!  :wink

The main focus of my post is assessing the quality of timing results. A little history, I found this forum through stack overflow a few weeks ago and enjoyed reading past posts in this subforum. During that, I came across some epic arguments over measurement quality and test harnesses. As a mathematics major, I became interested in how to establish some form of confidence in an out of order CPU measurement. To be very clear, my interest is purely academic here. I simply believe a better job can be done of qualifying how good a set of measurements are.

I am very interested in answers to the various questions posed in my original post. Setting affinity is a great idea. It'll assure that we are using the same L1 cache when I get around to analyzing that.

P.S. In reply to your second post, I understand, and my goal atm is simply to qualify how good a set of results are with respect to those parameters. We want the timing loop to be as consistent as possible across all the CPUs. How fast or slow the loop is does not matter. My problem atm is figuring out if for example the code above is running at full throttle or not. If it is not, then code inside the loop could potentially cause the test to no longer be consistent.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 04:14:33 AM
i am not so sure that MFENCE is doing what you want it to
it serializes memory load and store instructions (or more correctly, operations)
from what i can see, RDTSC does not fall into that catagory

you might test that by removing the MFENCE instructions for comparison

instruction serialization has been the uphill battle that Michael spent hours upon hours fighting with - lol
i played with it a little bit - trying to use IRET

i think there are other ways, however, to force in-order execution - without using one of the serialization instructions
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 04:36:44 AM
It works, removing MFENCE causes a huge speedup, but a loss of consistency as compared to with MFENCE.

(http://i.imgur.com/ItiXn.png)

Standard Deviation -> 4.20628
Mean -> 75.8523
Min -> 60
Percentiles -> 60, 60, 60, 60, 64, 64, 64, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77

So there is definitely an effect.

My main problem right now seems to be ensuring the CPU remains in the same throttling mode throughout the test. Around run 270,000 the CPU increased throttling for a good stretch.  :eek Well that and it is clear that MFENCE is doing something. I'll try CPUID soonish, right now messing with the linked to timing code.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 04:56:44 AM
if i understand correctly, you are timing the RDTSC instruction
here are the results i get using Michael's macros on my old P4 prescott
96, 96, 96, 96, 96, 96, 100, 98, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 98, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 95, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 95, 95, 96, 96,


the program is attached, so you can compare on an i7
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 05:08:05 AM
as for the throttle thing...
i use a little reg file when i set up my OS
REGEDIT4

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001

prescotts are notorious for "wobbly" timing - lol
this helps a lot
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 05:11:39 AM
I am actually testing the blank loop, that is nothing in the loop at all.

The problem I see with the code you posted is that it essentially gives only an average.


= 00989680 LOOP_COUNT = 10000000 ;try to choose a value so each run takes about 0.5 seconds


This obscures some of the information, are all of the values clustered around the same average, or are they more like the first plot in the original post?

I did run your program and got this output:  53, 45, 38, 38, 35, 35, 32, 30, 30, 25, 26, 26, 24, 24, 24, 19, 21, 21, 20, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 16, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 16, 15, 15, 15, 16, 15, 15, 16, 15, 15, 15, 15, 15, 15.

Now here I can clearly see that the more we run the program, the more we move into higher throttling. Ideally we would get the CPU into the highest possible throttling before taking any measurements.

I have been working on using the macros to do as I did with the C++ code. Spam out a list of numbers, each number matching a single test run, not an average of 10 million runs.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 05:23:17 AM
i forget now whether Michael takes the average or the fastest time
but - getting stable results depends on running the test enough times
we generally get good results if each run takes about 0.5 seconds
you can do alright with shorter runs - depending on the code you are timing

at any rate - i can change the LOOP_COUNT to 1 and run it   :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 05:26:21 AM
prescott w/htt - LOOP_COUNT = 1
75, 30, 75, 82, 75, 75, 82, 82, 83, 75, 83, 82, 83, 82, 75, 75, 75, 82, 83, 82,
83, 75, 82, 82, 53, 82, 83, 83, 75, 82, 82, 82, 82, 75, 75, 75, 98, 75, 75, 75,
75, 83, 75, 83, 83, 75, 75, 82, 75, 82, 83, 82, 83, 75, 75, 83, 75, 75, 82, 75,
75, 75, 75, 83, 83, 83, 82, 83, 75, 83, 82, 83, 83, 83, 83, 83, 75, 82, 75, 82,
75, 82, 75, 82, 83, 83, 82, 82, 75, 82, 82, 75, 75, 82, 82, 82, 83, 75, 75, 75


you will probably get more stable results on the i7
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 05:32:30 AM
Quote from: dedndave on May 01, 2012, 05:23:17 AM
i forget now whether Michael takes the average or the fastest time
but - getting stable results depends on running the test enough times
we generally get good results if each run takes about 0.5 seconds
you can do alright with shorter runs - depending on the code you are timing
Well yea, but that obscures information on what the sample actually looks like. Looking back at my original post, it is very useful to know if we are dealing with a very spread out sample like plot number 1. Taking the average (mean) of plot number 1 yields ~74 ticks, but that number is near meaningless as there is such a large deviation. Something like plot number 3 is much more desirable as far as statistical significance.

Quote
at any rate - i can change the LOOP_COUNT to 1 and run it   :bg

Go for it, and adjust it to spam something like 1000000 numbers :wink.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 05:41:29 AM
here you go...
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 05:41:44 AM
Quote from: dedndave on May 01, 2012, 05:26:21 AM
prescott w/htt - LOOP_COUNT = 1
75, 30, 75, 82, 75, 75, 82, 82, 83, 75, 83, 82, 83, 82, 75, 75, 75, 82, 83, 82,
83, 75, 82, 82, 53, 82, 83, 83, 75, 82, 82, 82, 82, 75, 75, 75, 98, 75, 75, 75,
75, 83, 75, 83, 83, 75, 75, 82, 75, 82, 83, 82, 83, 75, 75, 83, 75, 75, 82, 75,
75, 75, 75, 83, 83, 83, 82, 83, 75, 83, 82, 83, 83, 83, 83, 83, 75, 82, 75, 82,
75, 82, 75, 82, 83, 83, 82, 82, 75, 82, 82, 75, 75, 82, 82, 82, 83, 75, 75, 75


you will probably get more stable results on the i7

I tossed your data into my mathematica program :) The plot is
(http://i.imgur.com/TCDxU.png)

StandardDeviation -> 6.94087
Mean -> 78.81
GeometricMean -> 78.3778
HarmonicMean -> 77.6976
Median -> 82
Mode -> {75}
MeanDeviation -> 4.5404
Min -> 30
Max -> 98
TrimmedMean -> 79.3667
MeanDeviation -> 4.5404
InterquartileRange -> 8


The important bit here is we still get an average, but we also get information on how good of an approximation that average actually is. Note that your standard deviation is better than that of Plot/Graph/Sample number 1 in the original post!
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 01, 2012, 05:54:37 AM
The technique I still use is to design a benchmark that comes close to how the algo will be used, then run it for at least a second or two to reduce the OS interference range to well under 1%. Just for example if you want an algo that attacks fast on small data then you make a short data test piece and loop it many times to get a duration of a second or more. If alternatively its an algo that hits very large data then I usually make a test data piece that is very large, > 100 mb up to a gigabyte and see how the times run. Its more work but can be tuned very accurately to the required task the algo will perform.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 06:01:30 AM
that is the main use for timing code, afterall
some of us also use it a slightly different way
we want to make a quick change in code to see which sequence is better
for example...
        push    edx
        cmp     eax,ecx
        jnz     SomeLabel

        cmp     eax,ecx
        push    edx
        jnz     SomeLabel

it may not be readily obvious that this is quite different from benchmarking an algorithm
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 06:32:20 AM
Alright, I need to go to bed. I wanted to post statistical info from Mathematica using the output from dedndave's 3rd program. But MMA is taking forever to do the computations and I'm tired. In the morning I'll post those results.

Aside from that, I'd love any guesses to the questions I've posed in the original post. The major one atm is figuring out how to force the CPU into high throttle mode before we take a sample point. Doing so will help remove any deviation from the mean (average).

Finally I'm working on a program that does timing, but keeps a count of how many times we encounter each value. From that I can compute additional information about the reliability of the data.

P.S. dedndave, I feel your program ought to omit values greater than 4 billion from the output. Those tend to result from the counter wrapping and thus are not really sample points.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 01, 2012, 06:57:06 AM
I fail to see how a statistical analysis of the run to run variations (AKA noise) serves any useful purpose here.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 01, 2012, 07:32:59 AM
Quote from: nixeagle on May 01, 2012, 06:32:20 AMFinally I'm working on a program that does timing, but keeps a count of how many times we encounter each value. From that I can compute additional information about the reliability of the data.

If you take that job seriously, buy a Prescott CPU - it's a real pig :boohoo:

Jokes apart, you have done some very interesting things here. A few remarks:

> If we time the code by simply taking the average of all runs we get 74.0773 ticks. Alternatively we can count only the fastest run which is 42 ticks.
Common sense tells us the low number is the right one - nothing can be faster than the speed of light, and everything above suffers from some other activity of the CPU. However, my practical experience with the cyct_xx macros (attached, they build on MichaelW's stuff) says there are a few abnormally low values - and many high outliers. So what I do there to obtain stable values even on a Prescott P4 is to eliminate 5% or so at the low end and 20% or so at the high end - and there are indeed very few values that deviate from the new mean.

> No jiggling of the mouse at all will result in a test that looks something like:
If jiggling has effect, then your design is flawed. Algo testing should be done in Win32 console apps, so no mouse messages from the own process, and it should be short enough to avoid task switches. Therefore the mouse should never influence the results.

By the way, the results for the Sorting a Real8 array thread (http://www.masm32.com/board/index.php?topic=18765.msg158792#msg158792) were obtained with yet another mechanism: Behind the NanoTimer() macro is QPC. Microsoft claims it is better than rdtsc, I am not sure about that, but results are fairly consistent, and it seems convenient for some slightly longer tasks, such as sorting 5 Mio array members.

Again, welcome to the Forum, what you are doing looks very promising :thumbu
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 01, 2012, 10:30:56 AM
Dave,

There was a trick to get reliable timings on a Prescott core, go into the BIOS and turn OFF hyperthreading. This was a mod specific to Win2000 which did not specifically support hyperthreading but it certainly improved timing consistency. XP is supposed to support hyperthreading but it may be worth a quick play to see if the timings improve.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 01, 2012, 02:18:38 PM
i get pretty stable timing as long as:
1) the timing code binds to a single core
2) PerfEnablePackageIdle is enabled (registry)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 07:35:50 PM
Oh wow! So many replies. I'll try to reply to them all here, shout out if I left you out!

First off I've finally gotten my mathematica program to handle the test input from dedndave's 3rd program. To say that the timing data is scattered about is an understatement. But this is to be expected as this is only during the initial startup period. Later I'll modify the program to take 100,000,000 data points. Long story short, the plot that follows is a neat illustration of why we need to properly handle "spinup"/"startup".1

(http://i.imgur.com/YrKoq.png)

Standard Deviation->918.774
Mean->255.389
GeometricMean->21.3827
HarmonicMean->8.23219
Median->12
Mode->{9}
MeanDeviation->404.787
Min->1
Max->528044
TrimmedMean->180.868


Taking a look at a few numbers, we can see a few things jump out.

As part of calculating an average value, I feel that confounding factors should be omitted to the best of our ability. For short tests, we can easily remove all context switches and RDTSC timer wrap-around problems. That still leaves us the problem of getting the CPU to spin up to the highest power state.3

To MichaelW: Maybe I have failed to communicate my intent. We don't want to analyze noise, as that would be fairly useless :wink. What we want to do is measure when our tests have excess noise in them. Noisy tests are less meaningful and harder to reliably repeat than clean tests.

To jj2007:
To Hutch--: Hyperthreading is another issue. Since that means each "thread" alternates every clock tick, when would it be safe to assume an algorithm runs twice as fast on the same processor without hyperthreading enabled? I'll have to test this sometime this summer as well! :D

Finally a little food for thought. Instead of printing every test value out, I'm thinking it might be better to keep a tally of how many times we see each value. Such a task can easily be done using a 1Kib array with elements that look something like:

struct {
  unsigned int ticks;
  unsigned int count;
}

From that we can compute the mean, harmonic mean, min, max, standard deviation, mode, and median of the sample. That would allow the timing program to emit not only the averages as is usually done, but information on how good of a measurement it really is. Thoughts? If you guys like the idea, I'll break down and hack up something in assembler :wink.



Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 01, 2012, 09:20:45 PM
Quote from: nixeagle on May 01, 2012, 07:35:50 PM
[li]I'm not convinced taking the minimum in all cases is the best option, the minimum can easily be an outlier.

That is what I wrote earlier, actually. Let's take a practical example, timngs for finding a substring at index 94 of a 100-byte string:
Loop 2
3       GetTickCount
384     Instr case-sensitive
372     Instr case-insensitive
426     CRT StrStr

Loop 1
3       GetTickCount
384     Instr case-sensitive
372     Instr case-insensitive
426     CRT StrStr

This is on a Celeron, and it is fairly stable. On a P4 it looks less stable, but still much better than "raw" averages. What we do here is to build a 1k table, as you suggested, and see how often values appear. Then they are cut off on both ends, and the average is taken. There are flags to show the process (see cyctShowSlow in the attached source), here for GetTickCount:
Loop 2

0:-4 1:-4 2:-4 3:-4 4:3 5:3 6:3 7:3 8:3 9:3 10:3

...

999:14
GetTickCount

average with cutoff     3       cutoff at n=992 for >4 cycles
average w/o cutoff      3
difference raw-cutoff:  0
Loop 1

0:-267 1:-7 2:-5 3:-5 4:-4 5:3 6:3 7:3 8:3 9:3 10:3

...

999:85
GetTickCount

average with cutoff     3       cutoff at n=993 for >4 cycles
average w/o cutoff      2
difference raw-cutoff:  -1


As you can see, there are a few hefty outliers, such as negative cycle counts (0:-287 ... 10:xx - and this is a one core processor). But 900+ of 1000 values are exactly at 3 cycles for GetTickCount...

This is crude maths, but it works. If you get greater deviations, it is usually because your code is hanging around for too long, allowing the OS to interfere with it. If mouse interrupts or messages affect timings, then something is wrong in design. In most cases, with this cutoff table you can arrive at exactly 370 cycles, not 369 and not 371. Except on the P4, of course :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 01, 2012, 10:15:53 PM
jj2007, your negative cycle count is due to your RDTSC timer wrapping. Since everyone only works with the lower 32 bits of that 64 bit counter, every now and then you wind up with negative numbers. I was using unsigned values and spoke of numbers greater than 4,000,000,000. Either way we are talking about the same thing. Tests should either

Either way, work should be done to remove as much confounding as possible. The same applies for operating system context switches. There is no real reason to record them other than maybe to note as a "stat" how many context switches occurred during the test.

On more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have. Again trying to remove as many environmental factors as possible from the test environment to ensure higher quality data. We can measure the resulting data quality by using some of the methods I described in my original post.

Long story short, if you have a set of samples small standard deviation, you can be very confident that the test is repeatable. Thus I'm really starting to think that having the tests tell you not just the average number of ticks per sample, but also some extra statistical information would be really helpful. Then we can tell how significant the test data really is.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 01, 2012, 11:02:50 PM
Quote from: nixeagle on May 01, 2012, 10:15:53 PMOn more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have.

Why don't you just use a loop?

        mov ecx, 777777777   ; half a second
@@:
        dec ecx
        jns @B
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 12:11:27 AM
Quote from: jj2007 on May 01, 2012, 11:02:50 PM
Quote from: nixeagle on May 01, 2012, 10:15:53 PMOn more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have.

Why don't you just use a loop?

        mov ecx, 777777777   ; half a second
@@:
        dec ecx
        jns @B


Tried that, managed to get fairly unpredictable results.1 Maybe for core i7 (sandy bridge) you have to do something more than just dec ecx; jns @B. If I recall correctly, dec engages one of p0,p1,p52 and the jump engages p5 only. Maybe in order to cause a spinup I need to insert some more useless operations that are not NOP. :eek

As a side note I've played a bit with recording each time run, tallying the number of times a run takes a given number of ticks. For the program corresponding to Plot 3 in the original post, I have the following "dump". Still need to write the mathy stuff for computing standard deviation and whatnot, but I gotta watch a ballgame with friends first.


154:  299191
165:  514
462:  14
176:  132
187:  55
198:  16
330:  4
220:  4
484:  3
319:  2
308:  2
209:  5
363:  3
253:  2
242:  3
352:  3
297:  2
231:  2




Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 02, 2012, 12:46:57 AM
i was giving some thought to the "mouse jiggle" thing
it seems to me that a specially designed message loop would take care of that
maybe one with a semaphore that yields to the timing code

when no messages are in the queue - block the message loop and make the next timing run
when the timing run has completed - unblock the message loop and wait for a "no messages in queue" state to do the next run
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 02, 2012, 12:49:22 AM
Quote from: nixeagle on May 01, 2012, 10:15:53 PM
Since everyone only works with the lower 32 bits of that 64 bit counter, every now and then you wind up with negative numbers.

Not everyone. Why put so much effort and complexity into this without first taking care of the simple stuff? And at least in my code, which uses the full 64-bit count, the most common reason for a negative cycle count is that something caused the reference loop to take more cycles than the test loop.

Also, for a desktop system the only reason I can see for a core that is executing a loop to run at less than full speed, and assuming that the OS is not controlling the speed, is temperature management. And if this is happening under anything resembling normal conditions, then IMO the maximum clock speed rating is just marketing BS, not to be taken seriously.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: Neo on May 02, 2012, 02:23:23 AM
Quote from: dedndave on May 02, 2012, 12:46:57 AM
i was giving some thought to the "mouse jiggle" thing
it seems to me that a specially designed message loop would take care of that
maybe one with a semaphore that yields to the timing code

when no messages are in the queue - block the message loop and make the next timing run
when the timing run has completed - unblock the message loop and wait for a "no messages in queue" state to do the next run
He probably doesn't even have a message loop, so introducing one would probably cause worse problems.  The operating system's interrupt handler kicking in would be enough to mess up the timings not just of the interrupted loop iteration (probably way off the top of the chart), but the next loop as the code cache (etc.) updates.

I made a similar sort of time analysis/plotting tool a while back also designed to identify the statistically relevant samples, albeit primarily for scaling as opposed to constant time operations: http://www.masm32.com/board/index.php?topic=14058.msg111612#msg111612

(wow, it's been forever since I've posted here...)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 02, 2012, 03:33:15 AM
hey Neo - good to see you
you should come by more often   :bg

well - they had mentioned that timing code "must" be done in a console app because of mouse movement
and i think it is quite possible to do it in a GUI app - if you take the proper precautions

i think it might be best to run things in a GUI app
at the end of the day, that's where the code is likely to be used - closer to the real world
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 04:49:45 AM
Alright, team we were rooting for lost :(. Anyway I've started attempting to convert my C++ code to MASM so I can post on here :). I've figured out how to print numbers and write functions :bg. I'm used to NASM or GAS syntax so figuring out MASM's stuff has been interesting ;).

Some answers to all the posts!

To Neo: Right, there is no event loop. Aside from that, awesome link and graphs. :clap:

To MichaelW I'm not measuring processor speed directly, that is the frequency of the CPU is not terribly important outside of attempting to get it to be as consistent as possible. I'm measuring ticks against the RDTSC counter which is a monotonic timer in sandy bridge.1 I know the CPU can execute the loop faster if you jiggle the mouse. Tomorrow, after I finish converting my tallying code to MASM style syntax, I'll investigate further and try to isolate a cause.

Also earlier jj2007 mentioned the QPC2. I looked into that and it is to my understanding, a timer that gives microsecond accuracy.3 Interesting, but for the time being I'm going to stick with RDTSC.

Hopefully tomorrow I'll have a first draft of a MASM program that runs test code and computes the various statistics of the test run. At this point I'm just writing the MASM program for fun :bg. Start of summer break and all.



Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 02, 2012, 05:25:30 AM
Quote from: dedndave on May 02, 2012, 03:33:15 AM
well - they had mentioned that timing code "must" be done in a console app because of mouse movement
and i think it is quite possible to do it in a GUI app - if you take the proper precautions

they thinks one option might be to put the call right here:

.Repeat
call ZeTimings
invoke GetMessage, addr msg, wmNull, wmNull, wmNull
.Break .if !eax
invoke TranslateAccelerator, hwnd, haccl, addr msg
.if !eax
invoke TranslateMessage, addr msg ; msg not an accelerator
invoke DispatchMessage, addr msg
.endif
.Until 0


At least it can't reach WinProc that way ::)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 02, 2012, 06:24:20 AM
that's a thought, too   :U

another way might be to create the GUI window in another thread - with message loop
then, you can control the thread priority levels seperately
i think it's a good idea to keep the timing code in the root thread
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 02, 2012, 07:24:12 AM
 :bg

Set the process priority high enough and the mouse movement will not matter, nothing else will probably run on that core either if you use the critical option.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 05:14:02 PM
Quote from: hutch-- on May 02, 2012, 07:24:12 AM
:bg

Set the process priority high enough and the mouse movement will not matter, nothing else will probably run on that core either if you use the critical option.

Why did I not think of process priorities! :red I have not tested yet, but I suspect this is the solution to my mouse jitter problem! :clap: At least it makes logical sense to me, the user likely has a higher priority than the default for applications. Now just to test.

To dedndave: I'm not doing anything related to GUI, the timescales I'm operating at are such that I suspect an event loop will cause even more problems than it solves.1 Right now I'm focusing on identifying and removing as many outliers as I can to ensure that each run through the test function takes the same amount of time. This means that the standard deviation of the whole sample should be very small. To summerize, in theory2, if all factors are accounted for, the standard deviation of test samples should be 0.

Thanks everyone for listening and offering suggestions. Every time I check back, I find a pile of things to reply to! Which is awesome! :bg Off to poke at MASM assembly some more! :bdg




Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 02, 2012, 05:19:03 PM
no matter what type of program you are running, there are hundreds of message loops running at the same time
every button, menu, control, or window has one

the fact that they are in your process is meaningless if you use threads and process/thread priority levels, as well as core affinity
for very brief test runs, you can essentially take over the CPU
if you look at Michael's macros (as suggested), you will see that he allows you to control process priority

the Sleep function also has more uses than meet the eye   :U
if you alter the process priority class, thread priority level, or process/thread core affinity...
a Sleep,0 call will expire the current time slice to ensure that the change(s) will take place in the next
of course, it can also be used to let the CPU up for some air - so the system gets some core time
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 05:39:59 PM
Quote from: dedndave on May 02, 2012, 05:19:03 PM
no matter what type of program you are running, there are hundreds of message loops running at the same time
every button, menu, control, or window has one

the fact that they are in your process is meaningless if you use threads and process/thread priority levels, as well as core affinity
for very brief test runs, you can essentially take over the CPU
if you look at Michael's macros (as suggested), you will see that he allows you to control process priority
Yea, I've looked at those, but missed out on the priority thing. Makes sense now that it was pointed out to me! :red.

Quote from: dedndave on May 02, 2012, 05:19:03 PM
the Sleep function also has more uses than meet the eye   :U
if you alter the process priority class, thread priority level, or process/thread core affinity...
a Sleep,0 call will expire the current time slice to ensure that the change(s) will take place in the next
of course, it can also be used to let the CPU up for some air - so the system gets some core time
Interesting. Thank you! :U
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 02, 2012, 07:01:21 PM
here is some reading you'll enjoy....

About Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms681917%28v=vs.85%29.aspx

Using Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686937%28v=vs.85%29.aspx

Process and Thread Functions
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847%28v=vs.85%29.aspx

these are the related functions that i seem to use most
it is important to note that thread priority level is, let's call it, "a modification of the process priority class"
i.e., it is added to or subtracted from the process priority class to obtain the actual thread priority

GetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx

SetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx

SetThreadAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx

i forget what function is used to get thread affinity   :P
it is initially the same as the process that created it

GetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683211(v=vs.85).aspx

GetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683235(v=vs.85).aspx

SetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686219(v=vs.85).aspx

SetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx

Sleep
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 11:00:59 PM
Quote from: dedndave on May 02, 2012, 07:01:21 PM
here is some reading you'll enjoy....

About Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms681917%28v=vs.85%29.aspx

Using Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686937%28v=vs.85%29.aspx

Process and Thread Functions
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847%28v=vs.85%29.aspx

these are the related functions that i seem to use most
it is important to note that thread priority level is, let's call it, "a modification of the process priority class"
i.e., it is added to or subtracted from the process priority class to obtain the actual thread priority

GetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx

SetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx

SetThreadAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx

i forget what function is used to get thread affinity   :P
it is initially the same as the process that created it

GetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683211(v=vs.85).aspx

GetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683235(v=vs.85).aspx

SetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686219(v=vs.85).aspx

SetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx

Sleep
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx

Thanks.

A quick update, I managed to use up 6GB and entered into swap death trying to plot and compute stats on 10 million runs :boohoo:. So at this point I need to just code this stuff into the program and possibly find another program to produce plots :lol.

Also attached is my current source, just for the heck of it :P. Granted I'm nowhere near done with this and things like ignoring context switches, setting affinity are not done yet and not printing each run's number out to standard out. Please excuse my bad MASM :red.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 02, 2012, 11:21:18 PM
You have an innovative approach to allocation here :bg

      g_tallyBucketSize equ 100000
align 16
if 01
dd 2*g_tallyBucketSize dup(0) ; 2800 ms
else
REPEAT 2*g_tallyBucketSize ; 7000 ms
dd 0
ENDM
endif


Normally we use dup(0), or dup(?) in the .data? section. Masm (even recent versions) has a bug: Assembly slows down dramatically if you come near the 100000 dup(?) range. Your macro emulates dup, but unfortunately it's not faster than dup - 7 seconds instead of 2.8. For comparison: The 100000 dup(0) takes only half a second in JWasm...
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 02, 2012, 11:50:19 PM
Quote from: jj2007 on May 02, 2012, 11:21:18 PM
You have an innovative approach to allocation here :bg

      g_tallyBucketSize equ 100000
align 16
if 01
dd 2*g_tallyBucketSize dup(0) ; 2800 ms
else
REPEAT 2*g_tallyBucketSize ; 7000 ms
dd 0
ENDM
endif


Normally we use dup(0), or dup(?) in the .data? section. Masm (even recent versions) has a bug: Assembly slows down dramatically if you come near the 100000 dup(?) range. Your macro emulates dup, but unfortunately it's not faster than dup - 7 seconds instead of 2.8. For comparison: The 100000 dup(0) takes only half a second in JWasm...

:lol, now I know! I was borrowing from what I usually do in NASM ::). I'll modify the program to do it the correct way after this post. I'm actually starting to get some decent stabalization! Not perfect by any means, but the updated program's output can be used to produce this graph.

(http://i.imgur.com/AvcXr.png)

Standard Deviation->95.3503
Mean->170.673
GeometricMean->153.233
HarmonicMean->140.363
Median->107
Mode->{107}
MeanDeviation->72.2174
Min->94,Max->27250
Length->1002001
TrimmedMean->162.845
MeanDeviation->72.2174
MedianDeviation->4.
QuartileDeviation->61.5
InterquartileRange->123


Once I figure out how to meaningfully exclude the initial jitter in these tests I'll move on to having the program itself compute the various statistics of the run. This run is not the greatest due to the wildly varying data points at the start of the run.

[small]Edit[/small]


Whoops! I forgot to attach the program for the last set of data up there. I don't have the exact code for that anymore, sorry. What I do have is some updated data and code :bg.

(http://i.imgur.com/2EoBB.png)

Standard Deviation->89.3507
Mean->106.188
GeometricMean->105.984
HarmonicMean->105.947
Median->107
Mode->{107}
MeanDeviation->1.68561
Min->91
Max->88343
Length->1002001
TrimmedMean->105.909
MeanDeviation->1.68561
MedianDeviation->0.
QuartileDeviation->1.5
InterquartileRange->3.


Unfortunately the standard deviation is telling us that we still have outliers in our dataset not accounted for. Look at our max and we can see that these are context switches.

I'm still not sure what the best way to remove those in a general way is yet. By this I mean I'd like to avoid hard-coding in a number that gets removed. Better would be to infer a number from the test data so that we automatically adjust. A crude way might be to remove anything outside of Mean +/- StandardDeviation. However, I'm not sure how statistically sound that is. My friend is over for the ballgame again, so I'll be heading out for now. :)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 03, 2012, 02:54:06 AM
I think you should be able to eliminate most or all of the initial "jitter" by delaying for 3 to 5 seconds after the app loads before you start testing, to allow the system activities involved in launching an application to subside.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 03, 2012, 03:09:07 AM
Whoo the team we were rooting for won! Plus, I think I've thought of a nearly perfect way to ensure a stable baseline. Some quick coding and the results bear themselves out. :8)

Let us start off with the "traditional" plot:

(http://i.imgur.com/d8WJu.png)

The cool thing here is we can identify all of the effects. That uptick at the start is the core warming up. I think I have a way to correctly discount that effect, just require that timings be stable for 100 sample runs before starting to print numbers to standard output. The stats bear out the improvement:

My method of achieving this boils down to collecting "batches" of sample results. So we run through the code under test n (I chose n=12) times and store the results in a buffer. The only way we allow the results to "count" is if all the results in the buffer are the same. If they are not we discard all of the data for the run and retry. This allows us to cleanly discard cpu throttling and context switches.

Attached is the code, please comment! :dance: After this I'm going to get the program to compute the pretty statistics instead of having mathematica do that. That way just running the program gives you all the info you need to judge confidence in the results.

P.S. this attachment has a working program. Feel free to run it and tell me your results. If it outputs the same number repeatedly, all is good. If it does not I'd really like to know. :bg

P.P.S, MichealW you posted while I was typing this up, we both came to the same conclusion :U.

P.P.P.S dedndave: I'm really curious to see if your "quirky" CPU even prints numbers at all. :bdg The only way for a number to print is for the CPU to take the exact same amount of time 12 times in a row.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 03, 2012, 05:11:46 AM
no output - and i can hear the gears grinding   :lol

you may have a good basic idea, though
take 12 measurements and allow 2 or 3 of them to be tossed out
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 03, 2012, 05:15:47 AM
Quote from: dedndave on May 03, 2012, 05:11:46 AM
no output - and i can hear the gears grinding   :lol

you may have a good basic idea, though
take 12 measurements and allow 2 or 3 of them to be tossed out

Thanks for trying, that was what I was afraid of. If you stick around for another 15 minutes or so... I think I have a fix so this will work on all CPUs. Basically if after 1000 iterations no good results occur, step n down to 11 and repeat until results start happening.  :bg

edit: Make that 30 minutes, silly x86 and its lack of registers :(.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 03, 2012, 05:27:30 AM
ok - i'll wait
no hurry - but i've got a boring movie and a warm wifee - lol
perfect recipe for sleepage   :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 03, 2012, 05:51:35 AM
Quote from: dedndave on May 03, 2012, 05:27:30 AM
ok - i'll wait
no hurry - but i've got a boring movie and a warm wifee - lol
perfect recipe for sleepage   :bg

Alrighty, I think this is bug free ::).

It is setup to ensure that there is enough time for the cpu to spin up. So you know something is happening it'll print out when it adjusts the number of ticks required to output a number.

It'll look something like this:

$ cat 4.dat
Now requiring only 10 entries.
Now requiring only 9 entries.
91
91
91
91
91
91
91
91
...


Thanks for trying :bg

Edit: Oh darned off by one mistake :red, when it says "Now requiring only 10 entries.", it really means to say "Now requiring only 11 entries." Nothing wrong with the actual code at least. :wink
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 03, 2012, 06:07:02 AM
it still spins the hell out of my hard drive - i don't like that
i modified it to CSV instead of CR/LF - and i terminated it early
Now requiring only 10 entries.
Now requiring only 9 entries.
Now requiring only 8 entries.
Now requiring only 7 entries.
Now requiring only 6 entries.
Now requiring only 5 entries.
Now requiring only 4 entries.
473, 472, 473, 472, 472, 473, 472, 473, 473, 473, 472, 473, 473, 532, 472, 473,
472, 472, 472, 473, 472, 472, 472, 472, 473, 472, 473, 473, 473, 472, 472, 472,
473, 472, 472, 472, 472, 472, 473, 473, 472, 473, 473, 472, 472, 473, 473, 472,
472, 472, 473, 472, 472, 472, 473, 473, 472, 473, 472, 473, 472, 517, 473, 473,
472, 473, 548, 472, 473, 472, 473, 473, 472, 473, 473, 473, 472, 473, 472, 472,
473, 473, 472, 473, 472, 473, 540, 472, 473, 473, 472, 472, 472, 472, 473, 473,
473, 473, 472, 472, 525, 472, 473, 472, 510, 472, 473, 472, 472, 472, 472, 472,
473, 472, 473, 472, 472, 472, 473, 473, 472, 472, 472, 473, 472, 473, 510, 472,
472, 473, 540, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 473, 473,
473, 472, 472, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 472, 472,
472, 472, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 473,
473, 472, 473, 472, 473, 472, 473, 472, 473, 473, 472, 525, 473, 472, 473, 473,
473, 472, 472, 472, 472, 473, 473, 473, 510, 473, 472, 473, 473, 472, 473, 473,
472, 472, 472, 473, 473, 472, 472, 495, 472, 472, 472, 472, 472, 473, 472, 510,
473, 473, 495, 495, 495, 473, 472, 473, 472, 495, 473, 472, 473, 472, 472, 473,
473, 472, 540, 510, 473, 473
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 03, 2012, 06:08:21 AM
Quote from: dedndave on May 03, 2012, 06:07:02 AM
it still spins the hell out of my hard drive - i don't like that
i modified it to CSV instead of CR/LF - and i terminated it early
Now requiring only 10 entries.
Now requiring only 9 entries.
Now requiring only 8 entries.
Now requiring only 7 entries.
Now requiring only 6 entries.
Now requiring only 5 entries.
Now requiring only 4 entries.
473, 472, 473, 472, 472, 473, 472, 473, 473, 473, 472, 473, 473, 532, 472, 473,
472, 472, 472, 473, 472, 472, 472, 472, 473, 472, 473, 473, 473, 472, 472, 472,
473, 472, 472, 472, 472, 472, 473, 473, 472, 473, 473, 472, 472, 473, 473, 472,
472, 472, 473, 472, 472, 472, 473, 473, 472, 473, 472, 473, 472, 517, 473, 473,
472, 473, 548, 472, 473, 472, 473, 473, 472, 473, 473, 473, 472, 473, 472, 472,
473, 473, 472, 473, 472, 473, 540, 472, 473, 473, 472, 472, 472, 472, 473, 473,
473, 473, 472, 472, 525, 472, 473, 472, 510, 472, 473, 472, 472, 472, 472, 472,
473, 472, 473, 472, 472, 472, 473, 473, 472, 472, 472, 473, 472, 473, 510, 472,
472, 473, 540, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 473, 473,
473, 472, 472, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 472, 472,
472, 472, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 473,
473, 472, 473, 472, 473, 472, 473, 472, 473, 473, 472, 525, 473, 472, 473, 473,
473, 472, 472, 472, 472, 473, 473, 473, 510, 473, 472, 473, 473, 472, 473, 473,
472, 472, 472, 473, 473, 472, 472, 495, 472, 472, 472, 472, 472, 473, 472, 510,
473, 473, 495, 495, 495, 473, 472, 473, 472, 495, 473, 472, 473, 472, 472, 473,
473, 472, 540, 510, 473, 473


Hmm, why would it spin your harddrive? I don't even touch that!

Edit: After getting over that shock :P... Your outputs look much more consistant than what I remember you showing earlier on in this discussion. Give me a few minutes and I'll have MMA produce some pretty graphs!

Second Edit: Alright, here is the plot and stat output:
(http://i.imgur.com/8MDnO.png)

Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 03, 2012, 11:04:04 AM
well - i think it's the hard drive - maybe it's the CPU fan - lol
either way - it's worrysome
when i have a program that does that, i use Sleep to let the OS have some core time
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 03, 2012, 05:34:34 PM
Quote from: dedndave on May 03, 2012, 11:04:04 AM
well - i think it's the hard drive - maybe it's the CPU fan - lol
either way - it's worrysome
when i have a program that does that, i use Sleep to let the OS have some core time

Yea, sounds like the CPU fan to me. I mean the program is so small that there is no chance of swapping going on and the program does not write or read from the disk during runtime.

Given a night to sleep on the results, I'm starting to suspect that different cpu generations might be best served by their own set of timing code selected at runtime based on CPUID. It really looks to me like your CPU is safe to simply take the minimum for algorithm comparisons. That puts your baseline at 472 ticks, and I suspect no matter how many times you run the program you will always get 472 ticks. Any additional work (in the timeit function) will result in a higher minimum. Most importantly, your results should be consistent over program runs.

In summary: I believe for your processor generation, you should set gu_steps to 4 when running tests using this timing program. You can test my theory out if you like by

Both sets would be really useful to me.

Now I suspect the reason I have been having so much trouble with this is due to the sandy bridge architecture. The i7 is able to "overclock" itself automatically for very short durations of time. Thus taking the absolute minimum in my case is not really the best idea as it is more often then not an outlier. So I have to resort to other methods to analyze results.

Later today I'll update with a new program that computes the statistics of a test run along with some other modifications. :dance:



Title: Updated program.
Post by: nixeagle on May 04, 2012, 12:46:46 AM
For my core i7, this gives extremely stable values. There are still some "interesting" quirks, but we are getting there! The baseline case gives 107 ticks every time you run the program!

If you look at the source code, I've commented some options that you can try tweaking and checking the results. These are mostly directed for dedndave :bdg.

Also note that the output is now in csv format and it all comes at once at the end of the program. So for about 30 seconds to a minute the program will look like it is not doing anything! By default the program will output 100 values. Additionally please note that my program does not "pause" and prompt for keyboard input after running, just start your own shell and run it from there. I build and test my code from inside Emacs and have a single keyboard shortcut I use for invoking it.

Now that the output is stable, I played a bit with the test function and found that if you put a multiplication on a core i7 processor, it'll take much longer to reach stabilization. I'm still mulling ideas how to automatically detect when the CPU is stable and results are worth taking.

Please take some time to try this out and tell me your findings! I'm curious how other processors work. If someone has AMD, I'd love to see this run on that. Does it stabilize? Does it have any other issues? Once I have a stable baseline, I plan to step up to harder problems as I find this whole measuring thing very interesting as far as the mathematics of it goes! :dance:

P.S. Statistics output will be completed after the ballgame. :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 04, 2012, 03:11:00 AM
Running on my P3 Windows 2000 system, with testit stripped down to just a RET, I get a uniform 117 cycles. If in the do_sample_run procedure I comment out the "jne retry" then I get an occasional higher value, usually 120, but through perhaps 20 trials I saw values as high as 155.

Running on my P4 Windows XP SP3 system I get a uniform 380 cycles, and with no retries, except for the first few runs where I did see some higher values, and once where the results looked like they came from an RNG, I get a uniform 380 cycles.

How do you intend to remove the overhead from the results?
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 04, 2012, 03:22:56 AM
Quote from: MichaelW on May 04, 2012, 03:11:00 AM
With the retries how do you propose to remove the overhead from the results?
You don't. This is meant to serve as a baseline. Both the baseline and the algorithm under test will have the same overhead :wink.

Awesome test results :U.
Thanks for testing! :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 04, 2012, 03:47:34 AM
When I disabled the retries my results were somewhat non-uniform. Cycle counts on my P3 tend to be very repeatable, but what surprised me here was how repeatable the P4 counts were, without retries. Given my experiences with other P4s I was expecting a lot of scatter.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 04, 2012, 04:03:43 AM
Does your P3 or P4 support XMM registers? Or do I need to use the floating point stack to be compatible with those? I ask because I am in the middle of implementing the statistics calculation stuff.

And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 04, 2012, 05:51:17 AM
Quote from: nixeagle on May 04, 2012, 04:03:43 AM
And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.

There is a lot of choice, see e.g. Sorting a Real8 array (http://www.masm32.com/board/index.php?topic=18765.0), or check \masm32\help\masmlib.chm for "Sorting Functions".
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 04, 2012, 06:06:06 PM
Quote from: jj2007 on May 04, 2012, 05:51:17 AM
Quote from: nixeagle on May 04, 2012, 04:03:43 AM
And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.

There is a lot of choice, see e.g. Sorting a Real8 array (http://www.masm32.com/board/index.php?topic=18765.0), or check \masm32\help\masmlib.chm for "Sorting Functions".
Thanks! :U

In related news, follows is the output of the i7 on battery power. Much higher than when plugged into the wall, but stable!

264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 272, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264,
Min:   264
Max:   272
Mean:  264
Range: 8
Title: Second update, this time with Sleep for dedndave.
Post by: nixeagle on May 04, 2012, 07:20:51 PM
For dedndave's CPU fan thingie, I've worked in a invoke Sleep,1000 call that gets run before every sample. The downside is sampling takes much longer, the upside is... maybe your CPU fan won't go nuts. ::)

The problem I see with doing Sleep for a long period of time is you essentially allow all the caches and such to get polluted with other stuff. Thus when the program gets control back, it has to spend some time doing excess retries to get everything back to a repeatable state. You can only get deterministic behavior if you are able to discard all the confounding elements.

Essentially I'm treating this like a scienctific experiment. I wish to identify and control all possible external sources not related to what is under test. What I cannot eliminate, I ensure will be present during both the baseline run and any algorithm tests afterwards.

Also adjusted are some alignment issues, the critical loops are now aligned on 16 byte boundaries.

The attached version will take 25 stable samples and exit emitting a min, max, mean, and range. It discards only one sample during startup. If you want to adjust this behavior, edit the options prefixed with gu_. The important ones are (IMHO) well commented. Again, emphasize that this version takes a while, as in 2 to 5 minutes for me.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 04, 2012, 08:16:52 PM
that's 1 second   :eek
probably no need to go that far

if you can get it down to where the sleep period is at least as long as the test period, that works pretty well
if you let it run free, it consumes 50 % of CPU time
get it below ~25 %, and you'll be ok for short periods
if you want to get thousands of samples, which takes a while, try to get below ~17 %
that would be sleep period = 2 x test period

you can use the results of the last pass to determine the Sleep parameter   :U
(provided you know the CPU frequency)

QueryPerformanceFrequency should work on most systems
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 04, 2012, 08:24:08 PM
Quote from: dedndave on May 04, 2012, 08:16:52 PM
that's 1 second   :eek
probably no need to go that far

if you can get it down to where the sleep period is at least as long as the test period, that works pretty well
if you let it run free, it consumes 50 % of CPU time
get it below ~25 %, and you'll be ok for short periods
if you want to get thousands of samples, which takes a while, try to get below ~17 %
that would be sleep period = 2 x test period

you can use the results of the last pass to determine the Sleep parameter   :U
(provided you know the CPU frequency)

QueryPerformanceFrequency should work on most systems

Mind playing with the sleep parameter on your CPU? I'm really of mind that it can be done without for the reasons explained in my last post. But if you can figure out some reasonable values that give stable results... I'd be thrilled. Try maybe 15 instead of 1000? The line to change is 159. Remember, not always is the test period knowable ahead of time.

I won't be back on for 5 or 6 hours, see you all later! :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 04, 2012, 08:26:20 PM
ok - i'll play with it a little in a couple hours
have some things to finish up

btw - you can leave your priority class set high if you use Sleep   :P
you only hog as much as you decide you want to - lol
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 03:58:45 AM
Quote from: dedndave on May 04, 2012, 08:26:20 PM
ok - i'll play with it a little in a couple hours
have some things to finish up

btw - you can leave your priority class set high if you use Sleep   :P
you only hog as much as you decide you want to - lol

Awesome! Btw what version of windows are you using? Windows 7 won't allow you to set a program to realtime, so it just sets at high priority. I ought to lookup the define for that ::).

I'm tired and going to bed, but a quick upload. Updated program does not sleep at all as I got tired of waiting on it ::). But it is just commented out. Honestly though, give it a try without sleep at all. It takes way less time then the one you tried last time. It only takes 25ish samples instead of trying to take 1000. Remember I had it taking lots of samples in order to convince myself of its stability :wink. Edit: dedndave, if you do try it without sleep, don't forget to turn gu_default_sample_batch_size down from 16 to something like 4. That way it won't have to downsample and burn CPU time doing so.

New update also computes variance. This is computed by looping over all the test results, subtracting the mean from each one, then squaring each and summing them all together. This sum is then divided by the total number of elements in the result set (25 for us). It is used to sum up in one number how scattered about the results are. For example, we have the results:
{50,50,50,55,50,45,53,50,50,42}
Our mean is 49.5, but I'm not using floating point numbers right now. So round down to 49. Now go through the list and subtract 49 from each number:
{1,1,1,6,1,-4,4,1,1,-7}.
Thus we now have a list of differences from the mean value. Square them:
{1,1,1,36,1,16,16,1,1,49},
and sum the list giving 123. Finally divide this by the total number of elements in the list, in our case 10 giving 12.3. Again I round down as I have not implemented this using floats yet. Thus the program will emit 12. My core i7 using this program regularly gets a variance of 0.

Once I get this done using floating point numbers... I'll compute the standard deviation... which is just the square root of the variance. :bdg

Please keep on testing this! :dance:

P.S. I'm all ears for any tips from folks on formatting/improving the actual code. Don't be afraid to suggest improvements to my style, etc! :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 04:36:26 AM
sorry - i never got back to this
my other thing is taking longer than expected   :eek

prescott w/htt - XP media center edition 2005 SP3
525, 525, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495,
495, 495, 495, 495, 495, 495, 495, 495, 495,
Min:      495 ticks
Max:      525 ticks
Range:    30 ticks
Mean:     497 ticks
Variance: 66 ticks^2


you can probably get to realtime if you play with the permissions
generally, there is no need to go beyond HIGH_PRIORITY_CLASS
and, if you do, you better have all your ducks in a row
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 04:47:39 AM
this ensures that you are starting with a fresh time-slice each time
it should yield more consistent results (may vary from one platform to another)
; warmup
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    ;rdtsc
    ; now go for real!
    INVOKE  Sleep,0                    ;stick this in there
    xor eax, eax
    cpuid
    rdtsc
   
    push eax
    push edx
   
    call [gu_testfunction]

    xor eax, eax
    cpuid
    rdtsc
    push eax
    push edx
    call print_runtime

    ret
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 04:49:16 PM
Quote from: dedndave on May 05, 2012, 04:47:39 AM
this ensures that you are starting with a fresh time-slice each time
it should yield more consistent results (may vary from one platform to another)
; warmup
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    ;rdtsc
    ; now go for real!
    INVOKE  Sleep,0                    ;stick this in there
    xor eax, eax
    cpuid
    rdtsc
   
    push eax
    push edx
   
    call [gu_testfunction]

    xor eax, eax
    cpuid
    rdtsc
    push eax
    push edx
    call print_runtime

    ret


No offense, but that is likely the worst place to stick it in! :eek The whole purpose of that string of CPUID immediately before calling gu_testfunction is to warmup the CPUID instruction so it takes a constant amount of time for each invocation.

From what I understand a Sleep 0 call causes a context switch back to the operating system, which then checks the scheduler to see if there are any other tasks requiring CPU time before handing back off to you. There are multiple problems with this from a stability standpoint:

If you really want a sleep call, I believe the place I have it commented out, around line 159 or thereabouts is the best location possible. Since Sleep potentially gives you a new timeslice, you want the chance to cause the CPU to spin up through the spinup call. Following that we execute up to gu_sample_retry_attempts * g_steps attempts to get a stable reading. This is the portion of the program where we want to have full control of what is in the hot CPU cache, decoders, ROB, and soforth.

Also dedndave, your baseline measurements are awesome! Seriously compare that to what you were showing when we started this thread up. A variance of 66 ticks^2 means you have a standard deviation of about 8 ticks with very few outliers. I think I can do a few more improvements to drop that closer to 0 and decrease the total amount of time required to get a stable result, but even if I can't, this is a very good result!

Next, assuming nobody finds any huge variances with this program, will be to start setting up the interface to the test driver. That is coming up with a protocol where the user writes one baseline function (see testit_baseline for an example) and up to N functions to test against that baseline. Additionally I'll be adding a few more statistical measurements to assist with assessing the quality of the output. Finally, when all is said and done, the finished program output will emit only the statistics of the run, one line for each function under test.

To folks posting test results, please continue to do so! I'm also interested to know how long the program takes to produce a set of results.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 04:59:55 PM
ok - so stick the Sleep,0 in before the warmup   :bg
timeslices are very short - although, i can't nail that down with any documentation
i am thinking on the order of 500 cycles or less

i think the intel document suggested only 3 CPUID's for a warmup
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 05:16:40 PM
ok - timeslices appear to be a bit longer than i thought
i measure about 4300 cycles for prescott w/htt, XP MCE2005 SP3
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 06:20:31 PM
Quote from: dedndave on May 05, 2012, 04:59:55 PM
ok - so stick the Sleep,0 in before the warmup   :bg
timeslices are very short - although, i can't nail that down with any documentation
i am thinking on the order of 500 cycles or less

i think the intel document suggested only 3 CPUID's for a warmup
Intel suggests 3, but more gives better variance. Feel free to tweak though.

Quote from: dedndave on May 05, 2012, 05:16:40 PM
ok - timeslices appear to be a bit longer than i thought
i measure about 4300 cycles for prescott w/htt, XP MCE2005 SP3
How are you measuring? I suspect you get more than that as 4300 cycles on a 3.0 GHz processor amounts to only 1433 ns or 1.43 microseconds. I suspect the operating system gives you more time than that as the cost of a context switch tends to be around 500 cycles or so (iirc). I suspect you get about 10 microseconds or more in reality. Maybe I'm wrong. :eek
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 08:02:03 PM
context switching happens very fast
i measured it by timing Sleep,0
i figure the average number of cycles Sleep,0 consumes is roughly half a time slice
it probably consumes a little more for the overhead
i made the measurement with a HIGH_PRIORITY_CLASS setting
i suppose i could go to REALTIME_PRIORITY_CLASS to verify it   :P
that would only reduce the result, though - not increase it
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 08:15:21 PM
Quote from: dedndave on May 05, 2012, 08:02:03 PM
context switching happens very fast
i measured it by timing Sleep,0
i figure the average number of cycles Sleep,0 consumes is roughly half a time slice
it probably consumes a little more for the overhead
i made the measurement with a HIGH_PRIORITY_CLASS setting
i suppose i could go to REALTIME_PRIORITY_CLASS to verify it   :P
that would only reduce the result, though - not increase it

:eek. I'm afraid that if a timeslice is half the time of a sleep, the operating system is consuming close to 33% of the CPU resources. That simply can't be right. Think about it, if Sleep 0 takes 2000 cycles (little less than half of 4300 cycles) the CPU would constantly be running at 33% utilization. All just for the operating system to context switch, check scheduler, etc.

To get a proper empirical measurement, you would need to loop over RDTSC repeatedly, storing the results (both EDX and EAX in a buffer) over a period of 2 to 5 seconds. Following that go through the collected data and look for large "jumps" in the recorded time on the clock. Those will represent actual context switches.

If you like, I can implement a program that does this. :bg

Edit: Actually a run for 0.5 to 1 second would be enough. I suspect the operating system allows programs at least 10 microseconds worth of runtime. That yields an optimistic 100,000 possible timeslices per second assuming 0 context switch overhead. If you factor in a overhead of 2 microseconds between each timeslice you have 83,333 possible timeslices per second. Even at this though, note that the overhead imposed by the operating system is 20%.

Edit 2: If you assume context switches are faster, say 1 microsecond per, then you have 90,000 possible timeslices (assuming nobody aborts early) with the best case operating system overhead of 10%. You don't see your CPU running at a constant 10% do you? This is why I suspect 10 microseconds is the lowerbound of possible timeslice size.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 08:18:29 PM
remember - Sleep,0 relinquishes the remainder of the current timeslice
assuming that your code is not syncronized with context switching....
on average, you are throwing out half a time slice   :U

(http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/26.gif)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 08:25:52 PM
Quote from: dedndave on May 05, 2012, 08:18:29 PM
remember - Sleep,0 relinquishes the remainder of the current timeslice
assuming that your code is not syncronized with context switching....
on average, you are throwing out half a time slice   :U

(http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/26.gif)

Hmm, my problem is the math is not working out. See my edit in the prior post. Could you toss up the program you used to do the measurements? I'd like to have a looksee.  :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 08:30:49 PM
here you go
i'd be interested to see how other platforms handle it - OS/CPU
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 08:35:43 PM
Quote from: nixeagle on May 05, 2012, 08:25:52 PMHmm, my problem is the math is not working out.

as Dr Emmit Brown would say, "You're not thinking four-dimensionally !"

timeslices could easily be shorter than my measurement
but, by nature of the Sleep function, they can't be much longer
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 08:39:59 PM
Output for me on windows 7 with intel core i7 (sandy bridge).

1681, 1375, 1295, 1111, 1094, 1030, 996, 949, 866, 877, 825, 752, 749, 778, 671,
679,
Press any key to continue ...


What you are measuring is how long it takes to execute the sleep call.

       counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS

;---------

;code to be timed goes here

       INVOKE  Sleep,0

;---------

       counter_end

Remember, coming out of Sleep 0, you are handed a new timeslice. So I suspect you are tossing away timeslices without even going near half of the way through it.

What we want to measure is how long you have before the operating system takes control from you. The best way to measure that is to do something similar to what I described a few posts ago :bg.

Am I making sense? Basically I think you are measuring the wrong thing! :wink
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 05, 2012, 08:43:58 PM
i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 08:50:08 PM
Quote from: dedndave on May 05, 2012, 08:43:58 PM
i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop

:lol, ok after I eat I'll do up a test program that shoves numbers in a buffer and after running spams them out. Just pipe the result to a file and attach in a zip file and I'll plot them in Mathematica. The context switches will be obvious as they will be the portions where the linear plot jumps from say 200000000 to 200005000 instead of 200000000 to 200000100. It'll look like a "step" on an upward sloping line. :bdg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 05, 2012, 09:23:12 PM
The Sleep 0 is use to synchronize the test code with the start of a time slice, maximizing the time the code can run before the end of the time slice. The point is to avoid having the system interrupt the execution of the test code.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 09:28:00 PM
Quote from: nixeagle on May 05, 2012, 08:50:08 PM
Quote from: dedndave on May 05, 2012, 08:43:58 PM
i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop

:lol, ok after I eat I'll do up a test program that shoves numbers in a buffer and after running spams them out. Just pipe the result to a file and attach in a zip file and I'll plot them in Mathematica. The context switches will be obvious as they will be the portions where the linear plot jumps from say 200000000 to 200005000 instead of 200000000 to 200000100. It'll look like a "step" on an upward sloping line. :bdg

Alright, this was easier than I thought :bg. I changed timeit.asm such that

gu_default_sample_batch_size equ 1
gu_total_samples equ 25000


Which basically turns the program into a "normal" timing loop that attempts to take 25000 samples. Here is the result:

Min:      186 ticks
Max:      8697 ticks
Range:    8511 ticks
Mean:     238 ticks
Variance: 9877 ticks^2   // This got so large that it wrapped around a 32 bit integer. Needs 64bit support soon! Or do this with doubles...


I piped the result into a csv file and fed it to MMA. Mathematica found 5 points where the computer took longer than 500 ticks. That means this program took 6 slices to execute. Even if we (wrongly) assume that all runs other than context switches take 186 ticks, we have the following:
186*25000 / 6.
Which equals 775,000 ticks on my machine. In other words, you get roughly 750/2=375 microseconds. To put this in human time: you get about 0.000375 seconds. Granted my margin of error here is huge, but I can say with certainity that it is larger than my lowerbound calculation of 10microseconds earlier.

Am I making sense now? If not, where am I wrong. :wink
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 09:30:54 PM
Quote from: MichaelW on May 05, 2012, 09:23:12 PM
The Sleep 0 is use to synchronize the test code with the start of a time slice, maximizing the time the code can run before the end of the time slice. The point is to avoid having the system interrupt the execution of the test code.


Alright, my method instead detects the context switch and discards the result. Well it does that when gu_default_sample_batch_size is greater than 1.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: MichaelW on May 05, 2012, 09:58:06 PM
Under conditions where the system and the test code are running on the same core, the system must interrupt the test code at the end of a time slice, if only to determine that the current thread is going to get the next time slice.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 05, 2012, 10:12:20 PM
Quote from: MichaelW on May 05, 2012, 09:58:06 PM
Under conditions where the system and the test code are running on the same core, the system must interrupt the test code at the end of a time slice, if only to determine that the current thread is going to get the next time slice.
Right. The question was how often does that occur. :wink I believe I've shown it must be substantially larger than what dedndave had by measuring how long Sleep 0 took. That just tells you how much time the operating system takes to complete the Sleep 0 call, but it does not tell you how long the operating system will allow you to go before pre-emptively taking control from you.

I did an interesting graph with Mathematica that ought to help make it intuitively clear why the OS can't context switch every 2 microseconds. For if it did, the operating system would have to peg the CPU at 50% utilization all of the time!

(http://i.imgur.com/PUiQt.png)

Here the X axes represents the number of nanoseconds a program is permitted before having control taken away by the operating system. The Y axes represents what percentage of the CPU is idle or free for the user's program to make use of. This is done under the assumption that context switching takes 1433 nanoseconds (as on dedndave's processor by his Sleep 0 call).
Title: Baseline program updated, used to actually benchmark an "algorithm"!
Post by: nixeagle on May 06, 2012, 02:41:22 AM
I think you guys will find this interesting. I have a stable baseline that is capable of comparing against test functions. That is we can use this to figure out precisely how many clocks something takes. At least on my i7.

Below is the program output. It ought to speak for itself. Real "production" versions of this will only do one run of the baseline and each algorithm. Plus the baseline won't get shown. Just the difference from the algorithms to the baseline. :)

Spinning up the processor.
Running 4 full runs of testit_baseline.
Batch Size: 8

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 104, 107, 107, 107, 107, 110, 107, 107, 107, 107, }

Min:      104 ticks
Max:      110 ticks
Range:    6 ticks
Mean:     107 ticks
Variance: 0 ticks^2
Batch Size: 7
Call Count: 2379447 calls

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 92, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 104, }

Min:      92 ticks
Max:      107 ticks
Range:    15 ticks
Mean:     106 ticks
Variance: 8 ticks^2
Batch Size: 7
Call Count: 663870 calls

{107, 107, 104, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, }

Min:      104 ticks
Max:      107 ticks
Range:    3 ticks
Mean:     106 ticks
Variance: 1 ticks^2
Batch Size: 7
Call Count: 625968 calls

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 110, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, }

Min:      107 ticks
Max:      110 ticks
Range:    3 ticks
Mean:     107 ticks
Variance: 0 ticks^2
Batch Size: 7
Call Count: 1125312 calls

Running 4 full runs of testit, which contains two imul.

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, 123, 119, 123, 123, 123, 123, 123, 123, }

Min:      104 ticks
Max:      123 ticks
Range:    19 ticks
Mean:     122 ticks
Variance: 14 ticks^2
Batch Size: 6
Call Count: 2835779 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, }

Min:      123 ticks
Max:      123 ticks
Range:    0 ticks
Mean:     123 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 2246256 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, }

Min:      104 ticks
Max:      123 ticks
Range:    19 ticks
Mean:     121 ticks
Variance: 26 ticks^2
Batch Size: 5
Call Count: 398668 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, }

Min:      123 ticks
Max:      123 ticks
Range:    0 ticks
Mean:     123 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 318908 calls
Press any key to continue ...


Please take the time to test this and paste the outputs!

P.S. dedndave, others the program now requests HIGH priority instead of realtime as we don't need realtime here. Before running this, you may want to adjust gu_default_sample_batch_size down to 4 or 5 from 8. Your call on that :wink
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 06, 2012, 03:03:21 AM
Looks promising :U

{336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336,
336, 336, 336, 336, 336, 336, 336, 336, 336, }

Min:      336 ticks
Max:      336 ticks
Range:    0 ticks
Mean:     336 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 13224 calls
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 03:05:59 AM
Quote from: jj2007 on May 06, 2012, 03:03:21 AM
Looks promising :U

{336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336,
336, 336, 336, 336, 336, 336, 336, 336, 336, }

Min:      336 ticks
Max:      336 ticks
Range:    0 ticks
Mean:     336 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 13224 calls


Is that from testit_baseline or testit? I'd like to see both if possible :bg. But awesome that one works for you :dance:. Btw folks, about how long does it take for the program to prompt you to push a key?
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 06, 2012, 06:14:08 AM
Something to note with the SleepEx() API (Sleep is just a wrapper to it). Unless you set some duration you do not get a garranteed yield to another thread, set it to "1" or higher and you do.


invoke SleepEx,1,0
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 05:03:26 PM
Quote from: hutch-- on May 06, 2012, 06:14:08 AM
Something to note with the SleepEx() API (Sleep is just a wrapper to it). Unless you set some duration you do not get a garranteed yield to another thread, set it to "1" or higher and you do.


invoke SleepEx,1,0


Wow, did not know that. Thanks! :clap:

In other news, to keep everyone up to date, I'm working on a "practical" test example. Pitting qWord's cdiv macro against the naive div we get timings that look something like this:


Spinning up the processor.
Running 4 full runs of testit_baseline.
Batch Size: 8

{129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, }

Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 62580 calls

{132, 132, 132, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 132, }

Min:      129 ticks
Max:      132 ticks
Range:    3 ticks
Mean:     129 ticks
Variance: 1 ticks^2
Batch Size: 8
Call Count: 72996 calls

{132, 132, 132, 132, 132, 132, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 132, 129, 129, 129, }

Min:      129 ticks
Max:      132 ticks
Range:    3 ticks
Mean:     129 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 24983 calls

{129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, }

Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 65051 calls

Running 2 full runs of testit_div which contains integer division.

{440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, }

Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1953 calls

{440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, }

Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1925 calls

Running 2 full runs of testit_cdiv, a constant division macro by qWord.

{173, 176, 173, 176, 176, 176, 179, 173, 173, 173, 176, 179, 176, 176, 176, 176, 176, 176, 176, 173, 176, 176, 176, 176, 176, }

Min:      173 ticks
Max:      179 ticks
Range:    6 ticks
Mean:     175 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 36911 calls

{173, 173, 176, 176, 173, 173, 176, 173, 173, 176, 176, 176, 176, 176, 176, 176, 176, 173, 173, 173, 176, 176, 173, 173, 176, }

Min:      173 ticks
Max:      176 ticks
Range:    3 ticks
Mean:     174 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 19656 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...


Hopefully by the end of today I'll have the code reasonably cleaned up and workable so it'll just say two lines instead of all this spam. One line will indicate time taken for testit_cdiv and the other line will indicate time taken for testit_div. Even as is though, the timings are consistant!

A question for the MASM pros out there:

I have code that looks like:

runner proc
    ...
   
    print " ",13,10
    print "Running 2 full runs of testit_div which contains integer division.",13,10
    mov [gu_testfunction], testit_div
    call run_driver
    print " ", 13,10

    ...
runner endp

where the ... stand for repeated instances of the 5 lines shown, just with different strings and a different function name moved into gu_testfunction. Anyway I can do this better? Code it as a macro or otherwise make it simpler for users of the test harness? Basically with this: you write a function and tell the harness to test it.

P.S. I am planning on releasing this code either under a BSD license (basically do whatever you want with it, just keep the comment saying who wrote the code and don't claim it as your own) or just simply releasing to the public domain. I just want to improve the quality of micro-benchmarking.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 05:07:42 PM
you could write a little proc with parameters for the test routine and string
then...
        INVOKE  RunTest,Proc1,offset szString1
        INVOKE  RunTest,Proc2,offset szString2
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 05:09:48 PM
Quote from: dedndave on May 06, 2012, 05:07:42 PM
you could write a little proc with parameters for the test routine and string
then...
        INVOKE  RunTest,Proc1,offset szString1
        INVOKE  RunTest,Proc2,offset szString2

Would that be better done as a macro at this point? Sorry, I am not really that great at MASM :red. Most of my code is a more or less direct translation of the way I'd write it using NASM.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 05:11:53 PM
i would do it as a proc
the speed isn't important - so go for smaller size
if you expand a macro everytime, the code will be larger
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 05:15:49 PM
Quote from: dedndave on May 06, 2012, 05:11:53 PM
i would do it as a proc
the speed isn't important - so go for smaller size
if you expand a macro everytime, the code will be larger

Alright, then I'll have to figure out how invoke works :red.

P.S. dedndave, did you test out the latest version of this? It ought to be fairly fast and I'm really concerned with ensuring stability of the test results on your jittery processor. You don't have to paste the results, I'd just appreciate knowing if the outputs for the 8 full sample runs are stable and have a low variance/range.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 05:28:01 PM
;prototypes should be near the beginning of the source file

RunTest PROTO   :LPVOID,:LPSTR

;
;
;

RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        print   chr$(20,13,10)
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   chr$(20,13,10)
        ret

RunTest ENDP
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 05:31:07 PM
prescott w/htt - XP MCE2005 SP3
{555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555,
555, 555, 555, 555, 555, 555, 555, 555, 555, }

Min:      555 ticks
Max:      555 ticks
Range:    0 ticks
Mean:     555 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1491 calls

Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 05:40:22 PM
Quote from: dedndave on May 06, 2012, 05:28:01 PM
;prototypes should be near the beginning of the source file

RunTest PROTO   :LPVOID,:LPSTR

;
;
;

RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        print   chr$(20,13,10)
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   chr$(20,13,10)
        ret

RunTest ENDP

Cool code! Learned about chr$. So I can now change it to chr$(13,10) :U. However the invoke stuff still confuses me :(. With the code you gave, attempted to invoke it with:

invoke RunTest,testit_baseline,"Running testit_baseline."

but got these two errors:

timeit.asm(519) : error A2084: constant value too large
timeit.asm(519) : error A2114: INVOKE argument type mismatch : argument : 2


Does this mean I have to do something fancy to pass a constant string?

Quote from: dedndave on May 06, 2012, 05:31:07 PM
prescott w/htt - XP MCE2005 SP3
{555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555,
555, 555, 555, 555, 555, 555, 555, 555, 555, }

Min:      555 ticks
Max:      555 ticks
Range:    0 ticks
Mean:     555 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1491 calls


Wow, very awesome! Was that for the baseline or for the "test" function with imul.

Thanks for the awesome help :8)
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 05:48:12 PM
Hutch has a fancy macro for doing that - fn or fnx, i think
you'll have to look in the hlhelp.chm file

you could probably also use chr$
        INVOKE  RunTest,testit_baseline,chr$("Running testit_baseline.")

otherwise, define the string in the .DATA section, then pass a pointer
        .DATA

szStr0 db "Running testit_baseline.",0

        .CODE

        INVOKE  RunTest,testit_baseline,offset szStr0


Wow, very awesome! Was that for the baseline or for the "test" function with imul.
i guess it's the IMUL
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 05:55:31 PM
Quote from: dedndave on May 06, 2012, 05:48:12 PM
Hutch has a fancy macro for doing that - fn or fnx, i think
you'll have to look in the hlhelp.chm file

you could probably also use chr$
        INVOKE  RunTest,testit_baseline,chr$("Running testit_baseline.")

otherwise, define the string in the .DATA section, then pass a pointer
        .DATA

szStr0 db "Running testit_baseline.",0

        .CODE

        INVOKE  RunTest,testit_baseline,offset szStr0


Wow, very awesome! Was that for the baseline or for the "test" function with imul.
i guess it's the IMUL
Awesome, chr$(...) will work just fine for now. If we want prettier syntax later after this works, we can do it then. Great to hear that it is IMUL that is so stable :U. Mind double checking that the baseline was stable for me? Of the functions under test, the most important one to have stable is the baseline as that serves as the yardstick that everything else is compared against. A baseline function with high variance/range can potentially lead to negative timing results.

P.S. the invoke now segfaults, but that I can figure out myself :bg. Probably something with the stack :eek.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 06:02:24 PM
well - you have been doing some strange stuff with stack parameters - lol
(like popping the return address etc)

anyways...
i made a little mistake in 2 places
        print   chr$(20,13,10)
should have been
        print   chr$(32,13,10)

you could also do it this way...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   esp
        pop     eax
        ret

RunTest ENDP
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 06:05:36 PM
or, if you return a value in EAX when you call the test...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 06, 2012, 06:08:44 PM
i bet i know why you are having problems - lol

your run_driver proc probably trashes EBP
quick fix...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     gu_testfunction,eax  ;braces not required
        push    ebp
        call    run_driver
        pop     ebp
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 06, 2012, 06:17:55 PM
Quote from: dedndave on May 06, 2012, 06:08:44 PM
i bet i know why you are having problems - lol

your run_driver proc probably trashes EBP
quick fix...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     gu_testfunction,eax  ;braces not required
        push    ebp
        call    run_driver
        pop     ebp
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP


Yea, you and I came to the same solution. Trashing ebp has been the cause of most of my segfaults, so it is the first thing I look for ::). Thanks though, I've copied and pasted your code into the program if that is ok with you :bdg.

I think I'm going to go through and do a quick cleanup of the program, remove commented out junk and so on and then do a new release that runs testit_baseline only once, then runs testit_div followed by testit_cdiv. That should help people see some of the uses for this program while I work on actually subtracting the baseline from the other test results and so on. Granted to do so, I'll need to implement a way to detect the commonest result from a sample run. That is if testit_baseline returns '172' twenty-four times and '169' once... we will choose to use 172, treating 169 as an outlier1.

P.S: If I'm doing strange stuff, I'd be thrilled to have that pointed out to me. I don't mind critiques on how I'm doing things.



Title: New update!
Post by: nixeagle on May 06, 2012, 10:24:19 PM
Alright, I've done some cleanup on the assembly file and have the runtime of the test harness down to about 15 seconds on this computer. That includes the time required to spin up the processor!

The part of the program that is meant to be changed by users is found at the very bottom of testit.asm and looks like this:

align 16
runner proc
    invoke RunTest,testit_baseline,chr$("Running testit_baseline.")
    invoke RunTest,testit_div, chr$("Running testit_div.")
    invoke RunTest,testit_cdiv, chr$("Running testit_cdiv.")
   

    print "Please subtract testit_baseline time from both, then divide by 32.",13,10
    print "... yes this program will do that computation automatically soon!",13,10

    ret
runner endp

Where testit_baseline is a function that does the minimal work shared by all the other functions under test. For example here all the functions push ebx, so the baseline function does as well. This way anything that can be considered "overhead" and unrelated to the test will be factored out.

My output looks like this:

Spinning up the processor.

Running testit_baseline.
Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 800800 calls

Running testit_div.
Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 14434 calls

Running testit_cdiv.
Min:      173 ticks
Max:      198 ticks
Range:    25 ticks
Mean:     173 ticks
Variance: 16 ticks^2
Batch Size: 7
Call Count: 3070267 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!


As we can see qWord's cdiv is much faster than the naive division! As far as computing how much time each function takes, that works something like this:
(440-139)/32
gives 9.71875 ticks per time through the inner loop.1 Compare that to
(173-139)/32
gives 1.375 ticks per time through the inner loop for testit_cdiv. Clearly qWord's implementation is fastest!

Attached is the latest program. Please give it a try and tell us about your results! Is qWord's implementation ever slower? Please note that this version of timeit.zip includes qWord's ConstDiv.inc file to make it easier for you guys to recompile timeit.asm with different settings or whatnot. Enjoy! :dance:



Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 06, 2012, 10:33:50 PM
Hi,
This is the complete output. I have not understood, though, where to find testit_baseline ::)
Celeron M:
Min:      456 ticks
Max:      480 ticks
Range:    24 ticks
Mean:     479 ticks
Variance: 2 ticks^2
Batch Size: 7
Call Count: 6920857 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Title: Program update to fix my screwup!
Post by: nixeagle on May 06, 2012, 10:43:13 PM
Quote from: jj2007 on May 06, 2012, 10:33:50 PM
Hi,
This is the complete output. I have not understood, though, where to find testit_baseline ::)
Celeron M:
Min:      456 ticks
Max:      480 ticks
Range:    24 ticks
Mean:     479 ticks
Variance: 2 ticks^2
Batch Size: 7
Call Count: 6920857 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...


OH! I'm so dumb. No wonder everyone has only been giving one test result. :red I test my programs in Cygwin's bash shell so cls does not actually do anything. I get the same results when I try to run it by double clicking on the icon. Attached is a new version that clears the screen only once when it starts up and then leaves it unmodified all the way to the end. You should see all 3 test results now. :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 07, 2012, 03:50:45 AM
DOH !

prescott w/htt XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 555506 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 49938 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3682 calls
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 07, 2012, 04:49:25 AM
Quote from: dedndave on May 07, 2012, 03:50:45 AM
DOH !

prescott w/htt XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 555506 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 49938 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3682 calls


Whoa! very nice! You can take those numbers to the bank! Tomorrow I'll go ahead and finish off the calculation part. That way the program computes the number of ticks taken instead of mental arithmetic! Since you have 0 variance, computing the number of ticks taken is trivial.

Starting off with ticks taken for timings_cdiv we have

(720 - 570)/32 = 4.6875 ticks.


And time taken for timings_div we have

(1575 - 570)/32 = 31.4063 ticks.


So clearly qWord's constant division macro is faster by a factor of 6.7 times!

Thank you for testing! I'm very happy with the results on your processor! :dance: I'll be thrilled if others take the time to post some results too.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 07, 2012, 05:11:10 AM
well - we all knew it was faster
but, i seriously doubt it is 5 clock cycles
i didn't look at the code - but it probably takes something like twice that, depending on the variation
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 07, 2012, 05:14:28 AM
Quote from: dedndave on May 07, 2012, 05:11:10 AM
well - we all knew it was faster
but, i seriously doubt it is 5 clock cycles
i didn't look at the code - but it probably takes something like twice that, depending on the variation
Depends on how many mul can be in flight at a time. Take a look at the code and tell me what you think. I'll look up agner fog's CPU timings for the prescott (if they exist) and we can sort this out. I am very confident in the test results though.

Plus I wonder if hyperthreading messes with this? Only way to figure that out is to disable it for a run and see what it says.

Edit: here is the disassembly of testit_cdiv

00000480 <testit_cdiv@0>:
480:   53                      push   ebx
481:   b9 1f 00 00 00          mov    ecx,0x1f
486:   b8 7f 96 98 00          mov    eax,0x98967f
48b:   ba 47 ac c5 a7          mov    edx,0xa7c5ac47
490:   83 c0 01                add    eax,0x1
493:   72 02                   jb     497 <testit_cdiv@0+0x17>
495:   f7 e2                   mul    edx
497:   c1 ea 10                shr    edx,0x10
49a:   49                      dec    ecx
49b:   79 e9                   jns    486 <testit_cdiv@0+0x6>
49d:   5b                      pop    ebx
49e:   c3                      ret


Remember the cost of the loop and pushing/popping ebx is factored out of the measurement time due to testit_baseline. Follows is the disassembly for that:

00000470 <testit_baseline@0>:
470:   53                      push   ebx
471:   b9 1f 00 00 00          mov    ecx,0x1f
476:   49                      dec    ecx
477:   79 fd                   jns    476 <testit_baseline@0+0x6>
479:   5b                      pop    ebx
47a:   c3                      ret


Your timings are reasonable assuming the cost of a multiply is less than 5 clocks. Remember the non multiply stuff can be done out of order during the multiply.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 07, 2012, 05:17:26 AM
well - i generally bind to a single core for all timing tests
i think many forum members do, also

the number of MUL's that can happen at once is 1   :bg
they are all dependent on EAX and EDX
not to mention - they probably use some specialized ciruitry to perform MUL
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 07, 2012, 05:21:43 AM
Quote from: dedndave on May 07, 2012, 05:17:26 AM
well - i generally bind to a single core for all timing tests
i think many forum members do, also

the number of MUL's that can happen at once is 1   :bg
they are all dependent on EAX and EDX
not to mention - they probably use some specialized ciruitry to perform MUL

See my edit: and yes the test binds itself to one core. But that does not mean htt is disabled. In a way the test code is still only getting half of your CPU's time.

Edit: ugh, looking at the disassembly for timeit_baseline it looks like I forgot to make that do mov eax, 9999999 like the other two functions under test. So that means in a way the timings should be shorter than what you have!

Edit 2: This is the code actually being timed:

486:   b8 7f 96 98 00          mov    eax,0x98967f
48b:   ba 47 ac c5 a7          mov    edx,0xa7c5ac47
490:   83 c0 01                add    eax,0x1
493:   72 02                   jb     497 <testit_cdiv@0+0x17>
495:   f7 e2                   mul    edx
497:   c1 ea 10                shr    edx,0x10

I can easily see this executing in less than 5 clocks due to out of order execution. While the CPU is busy with that mul instruction, it can start work on the next iteration of the inner loop.

Edit 3: Ugh, according to Agner Fog the latency for MUL is 11 clocks on a prescott. The pipeline would allow everything else to execute as they use different execution ports, but no way around the MUL latency. Now to figure out how in the heck we are undercounting. Maybe we are skipping the mul and taking the jb 497 branch?

Edit 4: The DIV number looks in the ballpark according to Agner Fog's number as DIV has a 34 tick reciprocal throughput. We are at 31.4. A little under, but not nearly as bad as the MUL case. Have you ever overclocked your processor? If yes, I'll have to check into how RDTSC works on the prescott. I know it is different from Core i7 as here it is a monotonic 2Ghz clock, regardless of what speed the actual CPU is at.

Edit 5: This really is weird. The thing is your CPU actually ran at this speed. The function got called, the CPU churned through it and was measured. There is no denying that. The question is why and right now I'm at a loss. I can't see agner forgetting to note a reciprocal throughput for MUL in his timing data. I mean he has IMUL r32,r32 with a reciprocal throughput of 2.5 on an operation that has a latency of 10 clocks.

Edit 6: I just confirmed that the MUL instruction is actually taken by running the program in gdb. Now I'm really stumped! Can someone else please take the program for a spin and post the results along with what processor you have?

The difference in timing between 4.86 ticks and 11 ticks is statistically significant here, so we can't just dismiss this as measurement error.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 07, 2012, 09:12:31 PM
Attachment from reply #103, on Celeron M, Win XP SP3:

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5600 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5663 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5063205 calls
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 07, 2012, 10:05:30 PM
Quote from: jj2007 on May 07, 2012, 09:12:31 PM
Attachment from reply #103, on Celeron M, Win XP SP3:

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5600 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5663 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5063205 calls


Thanks, from what I understand, the Celeron M is essentially a Pentinum M with less Cache and no speedstep. The lack of speedstep actually makes measuring things easier on your processor. Agner Fog says DIV has a 12-20 clock reciprocal throughput with a 15-39 latency and  MUL has a 1 clock reciprocal throughput with a 5 clock latency. Your numbers are close enough to those timings as you have:
(696 - 348)/32 = 10.875 clocks
for DIV and
(480 - 348)/32 = 4.125 clocks
for MUL. The division on yours is slightly under, just as dedndave's was. The real difference is your multiplcation times. Those are higher than the 1 cycle reciprocal throughput. However the extra time can easily be accounted to pipelining effects and out of order execution. That is executing the instructions around the MUL.

Looking at your results and comparing to dedndave's my two possible conclusions are:

I find agner fog being wrong very unlikely, so there has to be some other effect going on here. Especially as jj2007's measurements are pretty close to being spot on with expected results.

Attached program is just like the prior program in #103, just adds an extra testit_baseline call after all tests are run. The assumption is timeit_baseline will take the same amount of time; both when it is run at the start and when it is run at the end.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 08, 2012, 12:01:53 AM
Here it is...
Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5649 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5593 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 4206034 calls

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5670 calls
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 08, 2012, 12:16:05 AM
Quote from: jj2007 on May 08, 2012, 12:01:53 AM
Here it is...
Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5649 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5593 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 4206034 calls

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5670 calls


:clap: Looks very good to me. You even have the exact same timings as before, which is a large part of what I'm trying to achieve here. Timings that are consistent no matter how many times you run the program and no matter what order the algorithms are in. For your processor we have that down to a 't'. :bg.

Major thanks for giving us a third opinion. Now just to figure out the oddities with dedndave's processor. If you or anyone else had ideas on the strange measurement results he has... speak up please! Once I'm happy with his processor and we have stable results for everyone, I'll polish the program up some and try testing some different algorithms for fun :D.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 08, 2012, 02:20:24 AM
well - it's not just my processor - lol
XP Media Center Edition does some funny stuff with timers, etc - for fast video rendering
i have an old-style TV card in this machine, as well
Title: Reply to dedndave and call for testing of program in post #111!
Post by: nixeagle on May 08, 2012, 04:03:31 AM
Quote from: dedndave on May 08, 2012, 02:20:24 AM
well - it's not just my processor - lol
XP Media Center Edition does some funny stuff with timers, etc - for fast video rendering
i have an old-style TV card in this machine, as well

Can you please at least run the program in post #111? What that'll do is confirm that the measurements are stable. If the second call to timeit_baseline completes faster, we will know the problem is spinning up your processor. If it is not, I'm down to either:

If anyone else wants to test, please don't hesitate to post some times!

P.S. I know of no way for the operating system to modify the value returned by RDTSC. I simply don't believe it is possible. The simplest solution to our dilemma will be if your test with the program in #111 shows up a discrepancy between testit_baseline runs.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 08, 2012, 04:12:26 AM
from post 111
prescott w/htt - XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 583233 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 66367 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3584 calls

Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 563234 calls
Title: Thanks dedndave
Post by: nixeagle on May 08, 2012, 04:22:12 AM
Amazing how those values are exactly the same as the last time you ran this. Plus notice that testit_baseline takes the same amount of time regardless of when you run it. So this disqualifies my best theory. :(

I'll sleep on this and see if I can't think of some reason for this behavior. I might write up another test for you to try tomorrow. On the bright side, at least we have consistent timings on your "jittery" CPU :bg.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: FORTRANS on May 08, 2012, 12:40:03 PM
Hi,

   Results from two older processors: P-III and P-MMX.

Regards,

Steve N.


Spinning up the processor.

Running testit_baseline.
Min:      200 ticks
Max:      200 ticks
Range:    0 ticks
Mean:     200 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3535 calls

Running testit_div.
Min:      1420 ticks
Max:      1420 ticks
Range:    0 ticks
Mean:     1420 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3542 calls

Running testit_cdiv.
Min:      333 ticks
Max:      333 ticks
Range:    0 ticks
Mean:     333 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3542 calls

Running testit_baseline.
Min:      200 ticks
Max:      200 ticks
Range:    0 ticks
Mean:     200 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3528 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_baseline.
Min:      89 ticks
Max:      89 ticks
Range:    0 ticks
Mean:     89 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3528 calls

Running testit_div.
Min:      1450 ticks
Max:      1450 ticks
Range:    0 ticks
Mean:     1450 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3647 calls

Running testit_cdiv.
Min:      470 ticks
Max:      470 ticks
Range:    0 ticks
Mean:     470 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3570 calls

Running testit_baseline.
Min:      89 ticks
Max:      89 ticks
Range:    0 ticks
Mean:     89 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3521 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: lingo on May 08, 2012, 02:19:43 PM
CPU - i7 2600K; Win7 Ult 64 SP1
Spinning up the processor.

Running testit_baseline.
Min:      145 ticks
Max:      158 ticks
Range:    13 ticks
Mean:     145 ticks
Variance: 1 ticks^2
Batch Size: 8
Call Count: 434693 calls

Running testit_div.
Min:      497 ticks
Max:      544 ticks
Range:    47 ticks
Mean:     520 ticks
Variance: 121 ticks^2
Batch Size: 8
Call Count: 780528 calls

Running testit_cdiv.
Min:      198 ticks
Max:      207 ticks
Range:    9 ticks
Mean:     199 ticks
Variance: 6 ticks^2
Batch Size: 8
Call Count: 609322 calls

Running testit_baseline.
Min:      145 ticks
Max:      158 ticks
Range:    13 ticks
Mean:     145 ticks
Variance: 5 ticks^2
Batch Size: 8
Call Count: 394219 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...

Title: Awesome results guys! Plus response to FORTRANS and lingo.
Post by: nixeagle on May 08, 2012, 06:53:24 PM
To FORTRANS: Your results are reasonable and match what I'd expect. I have for the P3 the following timings:

(1420 - 200)/32 = 38.125 ticks  // For div
(333 - 200)/32  = 4.15625 ticks // For cdiv


Based on AgnerFog IMUL ought to take 4 ticks latency and 1 tick reciprocal throughput. So your getting 4 ticks for the whole measurement seems within reason considering there are other operations occurring other than the multiplication. Plus Agner Fog does not have timings for MUL this processor. Just the IMUL so I'll just have to assume they are close enough. Either way, you are certainly within expectations here.

The P-MMX takes

(1450 - 200)/32 = 39.0625 ticks // div
(470 - 200)/32  = 8.4375 ticks  // cdiv

To be honest, I'm not real sure which CPU this is. I'm guessing this is a P1 with MMX. Here MUL takes 9 ticks. I'm going to assume the other operations occur while MUL is occurring and thus the time for the other operations does not contribute to the total time. Still we are half a clock too fast with our measurement. My only guess is testit_baseline is a tad too pessimistic. The same story goes for DIV. Agner's measurement has it at 41 ticks, we have it at about 39 ticks. I'll be doing another release of the timing program today that you might want to try out.

For measuring comparative algorithm speeds, all the results I'm seeing seem spot on. The only possible error I'm seeing is testit_baseline is too pessimistic and should complete sooner than it actually does. But as this is a constant overhead applied to all algorithms under test, no one algorithm is going to gain an unfair advantage with respect to any other.

To lingo: Your results practically mirror mine as far as stability goes. Which makes sense, we are both running a i7. What makes me real happy with your results is that testit_baseline is taking the same amount of time at both program start and program end. This means both algorithms in between the testit_baseline (should) have optimal and consistent timings. That is, you are going to always get  measurements within the standard deviation of the sample.

Now computing the amount of time taken by the algorithms on your processor is going to involve a little more work due to the variance (and thus standard deviation) of the measurements. In the following, I use the notation {lower_bound_of_mean, mean, upper_bound_of_mean}. What I mean by this is that the mean should always appear between the lower and upper bounds of the mean. In a way, this expresses the quality of the measurement and how much confidence we can take in them. Off we go!

({509, 520, 531} - 145)/32                     = {11.375, 11.7188, 12.0625} ticks   // div
({199 - Sqrt[6], 199, 199 + Sqrt[6]} - 145)/32 = {1.61095, 1.6875, 1.76405} ticks   // cdiv

For simplicity I've assumed the baseline is 145 ticks with no standard deviation. Including the deviation of the baseline measurement just makes the math even more complicated. I'll have the final timing program do it, but for this post it is simple enough to disregard it, if only to avoid explaining it! The impact on the final measurements is on the order of ~0.1 to ~0.05 ticks.

As we can see the difference between cdiv and div on limbo's i7 is quite significant. Most importantly, no matter how many times we run this test, the value of Mean will always appear between the bounds specified1.


Major thanks to both FORTRANS and limbo for giving us more test data to analyze! :bg I'll update with a new program later today that will hopefully fix a few potential flaws2 and if I have time, add floating point calculations. For that I'll have to figure out how to print "floats" or "doubles" using masm. ::).



Title: Re: Awesome results guys! Plus response to FORTRANS and lingo.
Post by: FORTRANS on May 08, 2012, 09:26:44 PM
Quote from: nixeagle on May 08, 2012, 06:53:24 PM
To FORTRANS: Your results are reasonable and match what I'd expect.

Hi nixeagle,

   Good.

Quote
To be honest, I'm not real sure which CPU this is. I'm guessing this is a P1 with MMX.

   Yes, that is correct.

Cheers,

Steve N.
Title: New update! Renewed call for testing!
Post by: nixeagle on May 08, 2012, 10:29:02 PM
Alright, this update is mostly focused at dedndave, however I'd be very interested to see testing results from others. What I have done is create some extra copies of the test functions and named them short_testit_div, long_testit_div and so forth. These are mirror images of the original functions, just with a shorter or longer inner timing loop.
This will uncover any pipelining effects that occur as well as allow me to correlate the results to further establish how trustworthy the results really are.

Additionally the newest update includes an output revamp. Instead of spamming 8 lines per function tested, we spam only 4.

Edit: Removed battery power test results to shorten the length of this post and draw attention to the more interesting parts. If anyone knows how to "collapse or otherwise put scrollbars on code sections... I'd greatly appreciate it.

My test results when my i7 is plugged in

Spinning up the processor.

Running testit_long_baseline.
Min:      220 ticks, Max:      223 ticks, Range:    3 ticks
Mean:     220 ticks, Variance: 1 ticks^2
Batch Size: 8      , Call Count: 34230 calls

Running long_testit_div.
Min:      1496 ticks, Max:      1496 ticks, Range:    0 ticks
Mean:     1496 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 18319 calls

Running long_testit_cdiv.
Min:      434 ticks, Max:      478 ticks, Range:    44 ticks
Mean:     476 ticks, Variance: 53 ticks^2
Batch Size: 8      , Call Count: 19483198 calls

Running testit_long_baseline.
Min:      220 ticks, Max:      223 ticks, Range:    3 ticks
Mean:     220 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 26089 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      129 ticks, Max:      129 ticks, Range:    0 ticks
Mean:     129 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 172053 calls

Running testit_div.
Min:      440 ticks, Max:      440 ticks, Range:    0 ticks
Mean:     440 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 22743 calls

Running testit_cdiv.
Min:      173 ticks, Max:      179 ticks, Range:    6 ticks
Mean:     175 ticks, Variance: 1 ticks^2
Batch Size: 8      , Call Count: 127876 calls

Running testit_baseline.
Min:      129 ticks, Max:      132 ticks, Range:    3 ticks
Mean:     129 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 168182 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      86 ticks, Max:      94 ticks, Range:    8 ticks
Mean:     93 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 4516969 calls

Running short_testit_div.
Min:      132 ticks, Max:      132 ticks, Range:    0 ticks
Mean:     132 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 16716 calls

Running short_testit_cdiv.
Min:      97 ticks, Max:      101 ticks, Range:    4 ticks
Mean:     100 ticks, Variance: 1 ticks^2
Batch Size: 6      , Call Count: 1798979 calls

Running short_testit_baseline.
Min:      88 ticks, Max:      94 ticks, Range:    6 ticks
Mean:     93 ticks, Variance: 1 ticks^2
Batch Size: 6      , Call Count: 898280 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!

Which gives the following results for cdiv:

(100-93)/4= 1.75 ticks  // short_testit_cdiv
(175-129)/32 = 1.4375 ticks  // testit_cdiv
({476-Sqrt[53],476,476+Sqrt[53]}-220)/128 = {1.94312, 2., 2.05688} ticks // long_testit_cdiv


As the first two have insignificent variance, I chose to disregard it in the calculation. Notice how looping 32 times in the inner loop results in a timing of ~1.4 ticks per cycle. The interesting result on my i7 is how looping 128 times in the inner loop results in 2 ticks per cycle, plus or minus ~0.06 ticks. This is very consistent across program runs. I do not have any idea why a longer inner loop would cause this.

And the results for div.

(132-93)/4= 9.75 ticks  // short_testit_div
(440-129)/32= 9.71875 ticks // testit_div
(1496-220)/128 = 9.96875 ticks // long_testit_div


Notice how consistent the timings are for division, no matter how many times you run the program or how many times the inner loop is run. To me this information is interesting :bg. I'm going to have to figure out a way to work in varying the inner loop length into the final timing program :8).

I really would appreciate posts pasting testing results as it helps me validate the approach taken. I believe we are nearly there! Later tonight or tomorrow, depending on the ballgame, I'll post an updated program that does the math I've done manually here automatically. But to do that I need to figure out how to do floating point calculations and print floating point numbers out in masm ::).
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: jj2007 on May 08, 2012, 10:52:39 PM
Here you are :bg

Spinning up the processor.

Running testit_long_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 81466 calls

Running long_testit_div.
Min:      1920 ticks, Max:      1920 ticks, Range:    0 ticks
Mean:     1920 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 72261 calls

Running long_testit_cdiv.
Min:      984 ticks, Max:      996 ticks, Range:    12 ticks
Mean:     991 ticks, Variance: 34 ticks^2
Batch Size: 8      , Call Count: 19328246 calls

Running testit_long_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 77287 calls
☻Please subtract testit_long_baseline time from both, then divide by 128

Running testit_baseline.
Min:      348 ticks, Max:      348 ticks, Range:    0 ticks
Mean:     348 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 32535902 calls

Running testit_div.
Min:      696 ticks, Max:      696 ticks, Range:    0 ticks
Mean:     696 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 4380 calls

Running testit_cdiv.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 796815 calls

Running testit_baseline.
Min:      348 ticks, Max:      348 ticks, Range:    0 ticks
Mean:     348 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 23628705 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      264 ticks, Max:      264 ticks, Range:    0 ticks
Mean:     264 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 53740 calls

Running short_testit_div.
Min:      312 ticks, Max:      312 ticks, Range:    0 ticks
Mean:     312 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 3865 calls

Running short_testit_cdiv.
Min:      288 ticks, Max:      288 ticks, Range:    0 ticks
Mean:     288 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 18250 calls

Running short_testit_baseline.
Min:      264 ticks, Max:      264 ticks, Range:    0 ticks
Mean:     264 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 55020 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 09, 2012, 01:12:24 AM
prescott w/htt - xp mce2005 sp3
Running testit_long_baseline.
Min:      735 ticks, Max:      1125 ticks, Range:    390 ticks
Mean:     794 ticks, Variance: 7547 ticks^2
Batch Size: 5      , Call Count: 29589995 calls

Running long_testit_div.
Min:      4837 ticks, Max:      4845 ticks, Range:    8 ticks
Mean:     4844 ticks, Variance: 6 ticks^2
Batch Size: 5      , Call Count: 2212 calls

Running long_testit_cdiv.
Min:      1290 ticks, Max:      2235 ticks, Range:    945 ticks
Mean:     1365 ticks, Variance: 53492 ticks^2
Batch Size: 3      , Call Count: 1756543 calls

Running testit_long_baseline.
Min:      765 ticks, Max:      765 ticks, Range:    0 ticks
Mean:     765 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 9548 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      570 ticks, Max:      705 ticks, Range:    135 ticks
Mean:     574 ticks, Variance: 197 ticks^2
Batch Size: 3      , Call Count: 85822 calls

Running testit_div.
Min:      1575 ticks, Max:      2325 ticks, Range:    750 ticks
Mean:     1577 ticks, Variance: 2131 ticks^2
Batch Size: 3      , Call Count: 19086 calls

Running testit_cdiv.
Min:      720 ticks, Max:      720 ticks, Range:    0 ticks
Mean:     720 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 1006 calls

Running testit_baseline.
Min:      570 ticks, Max:      720 ticks, Range:    150 ticks
Mean:     578 ticks, Variance: 542 ticks^2
Batch Size: 3      , Call Count: 110724 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 1030 calls

Running short_testit_div.
Min:      615 ticks, Max:      615 ticks, Range:    0 ticks
Mean:     615 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 20838 calls

Running short_testit_cdiv.
Min:      487 ticks, Max:      615 ticks, Range:    128 ticks
Mean:     491 ticks, Variance: 69 ticks^2
Batch Size: 3      , Call Count: 55542 calls

Running short_testit_baseline.
Min:      480 ticks, Max:      570 ticks, Range:    90 ticks
Mean:     480 ticks, Variance: 78 ticks^2
Batch Size: 3      , Call Count: 3310 calls

Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: FORTRANS on May 09, 2012, 01:42:23 PM
Hi,

   Results from P-III, P-MMX, and a Mobile Intel(R) Celeron(R).

Regards,

Steve N.


Spinning up the processor.

Running testit_long_baseline.
Min:      394 ticks, Max:      394 ticks, Range:    0 ticks
Mean:     394 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3542 calls

Running long_testit_div.
Min:      5260 ticks, Max:      5260 ticks, Range:    0 ticks
Mean:     5260 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3682 calls

Running long_testit_cdiv.
Min:      860 ticks, Max:      860 ticks, Range:    0 ticks
Mean:     860 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3549 calls

Running testit_long_baseline.
Min:      394 ticks, Max:      394 ticks, Range:    0 ticks
Mean:     394 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3535 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      202 ticks, Max:      202 ticks, Range:    0 ticks
Mean:     202 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3549 calls

Running testit_div.
Min:      1420 ticks, Max:      1420 ticks, Range:    0 ticks
Mean:     1420 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3556 calls

Running testit_cdiv.
Min:      333 ticks, Max:      333 ticks, Range:    0 ticks
Mean:     333 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3549 calls

Running testit_baseline.
Min:      202 ticks, Max:      202 ticks, Range:    0 ticks
Mean:     202 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3528 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      137 ticks, Max:      137 ticks, Range:    0 ticks
Mean:     137 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3598 calls

Running short_testit_div.
Min:      309 ticks, Max:      309 ticks, Range:    0 ticks
Mean:     309 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3535 calls

Running short_testit_cdiv.
Min:      175 ticks, Max:      175 ticks, Range:    0 ticks
Mean:     175 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3535 calls

Running short_testit_baseline.
Min:      137 ticks, Max:      137 ticks, Range:    0 ticks
Mean:     137 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3591 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_long_baseline.
Min:      310 ticks, Max:      310 ticks, Range:    0 ticks
Mean:     310 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 10320282 calls

Running long_testit_div.
Min:      5674 ticks, Max:      5674 ticks, Range:    0 ticks
Mean:     5674 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3710 calls

Running long_testit_cdiv.
Min:      1718 ticks, Max:      1718 ticks, Range:    0 ticks
Mean:     1718 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3598 calls

Running testit_long_baseline.
Min:      310 ticks, Max:      310 ticks, Range:    0 ticks
Mean:     310 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 10638530 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      118 ticks, Max:      118 ticks, Range:    0 ticks
Mean:     118 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3507 calls

Running testit_div.
Min:      1450 ticks, Max:      1450 ticks, Range:    0 ticks
Mean:     1450 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3577 calls

Running testit_cdiv.
Min:      470 ticks, Max:      470 ticks, Range:    0 ticks
Mean:     470 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3528 calls

Running testit_baseline.
Min:      118 ticks, Max:      118 ticks, Range:    0 ticks
Mean:     118 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3521 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      62 ticks, Max:      62 ticks, Range:    0 ticks
Mean:     62 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3521 calls

Running short_testit_div.
Min:      213 ticks, Max:      213 ticks, Range:    0 ticks
Mean:     213 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3542 calls

Running short_testit_cdiv.
Min:      106 ticks, Max:      106 ticks, Range:    0 ticks
Mean:     106 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3528 calls

Running short_testit_baseline.
Min:      62 ticks, Max:      62 ticks, Range:    0 ticks
Mean:     62 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3521 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_long_baseline.
Min:      443 ticks, Max:      443 ticks, Range:    0 ticks
Mean:     443 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3542 calls

Running long_testit_div.
Min:      5306 ticks, Max:      5306 ticks, Range:    0 ticks
Mean:     5306 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3542 calls

Running long_testit_cdiv.
Min:      924 ticks, Max:      924 ticks, Range:    0 ticks
Mean:     924 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3542 calls

Running testit_long_baseline.
Min:      443 ticks, Max:      443 ticks, Range:    0 ticks
Mean:     443 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3535 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      273 ticks, Max:      273 ticks, Range:    0 ticks
Mean:     273 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3563 calls

Running testit_div.
Min:      1473 ticks, Max:      1473 ticks, Range:    0 ticks
Mean:     1473 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3577 calls

Running testit_cdiv.
Min:      384 ticks, Max:      384 ticks, Range:    0 ticks
Mean:     384 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 91364 calls

Running testit_baseline.
Min:      273 ticks, Max:      273 ticks, Range:    0 ticks
Mean:     273 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3556 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      183 ticks, Max:      183 ticks, Range:    0 ticks
Mean:     183 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3514 calls

Running short_testit_div.
Min:      335 ticks, Max:      335 ticks, Range:    0 ticks
Mean:     335 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3556 calls

Running short_testit_cdiv.
Min:      206 ticks, Max:      206 ticks, Range:    0 ticks
Mean:     206 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3612 calls

Running short_testit_baseline.
Min:      183 ticks, Max:      183 ticks, Range:    0 ticks
Mean:     183 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3507 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: mineiro on May 09, 2012, 04:56:01 PM
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE2)

Spinning up the processor.

Running testit_long_baseline.
Min:      396 ticks, Max:      603 ticks, Range:    207 ticks
Mean:     399 ticks, Variance: 592 ticks^2
Batch Size: 7      , Call Count: 9204369 calls

Running long_testit_div.
Min:      1818 ticks, Max:      2727 ticks, Range:    909 ticks
Mean:     1827 ticks, Variance: 8180 ticks^2
Batch Size: 7      , Call Count: 3006 calls

Running long_testit_cdiv.
Min:      774 ticks, Max:      774 ticks, Range:    0 ticks
Mean:     774 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 9714 calls

Running testit_long_baseline.
Min:      396 ticks, Max:      405 ticks, Range:    9 ticks
Mean:     396 ticks, Variance: 1 ticks^2
Batch Size: 7      , Call Count: 3220512 calls
☻Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      333 ticks, Max:      333 ticks, Range:    0 ticks
Mean:     333 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 24516 calls

Running testit_div.
Min:      675 ticks, Max:      675 ticks, Range:    0 ticks
Mean:     675 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 3048 calls

Running testit_cdiv.
Min:      414 ticks, Max:      414 ticks, Range:    0 ticks
Mean:     414 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 436272 calls

Running testit_baseline.
Min:      333 ticks, Max:      333 ticks, Range:    0 ticks
Mean:     333 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 16746 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      252 ticks, Max:      252 ticks, Range:    0 ticks
Mean:     252 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 13290 calls

Running short_testit_div.
Min:      297 ticks, Max:      450 ticks, Range:    153 ticks
Mean:     307 ticks, Variance: 247 ticks^2
Batch Size: 7      , Call Count: 18758514 calls

Running short_testit_cdiv.
Min:      270 ticks, Max:      270 ticks, Range:    0 ticks
Mean:     270 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 6894 calls

Running short_testit_baseline.
Min:      252 ticks, Max:      252 ticks, Range:    0 ticks
Mean:     252 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 11292 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Title: Pretty result tables, Thanks to everyone who took measurements!
Post by: nixeagle on May 09, 2012, 07:51:18 PM
Wow! :clap: Thanks to everyone who took the time to run tests for me. We have so many processors being tested now that I had to write a small mathematica function to do the math and show the actual results! Later today I'll upload the latest version of the timing code that will output the "real" numbers. That way you guys don't have to wait on me to translate it and I don't have to spend 30 minutes translating them ::).

The new program also strives to minimize the error bounds. If a full sample run comes back with a large error, the program will re-run that test up to 5 more times in the hopes of getting a smaller error for the total measurement. This is especially important for baseline measurements as any error in the baseline measurement propagates to all of the functions compared against it.

Please let me know what you all think of the timing and such below. Plus to any who have time to look at the code, please suggest improvements to the style/way of doing that. This has become the largest project I've ever done in raw assembly. :lol

Again, thanks to everyone!

jj2007's Unknown processor
No really, I forgot which one this was :lol.

128 iterations32 iterations4 iterations
div11.25 +/- 0 ticks10.88 +/- 0 ticks12.00 +/- 0 ticks
cdiv3.99 +/- 0.52 ticks4.13 +/- 0 ticks6.00 +/- 0 ticks

Note the really nice pipelining effect for cdiv here. This is something div is unable to do.

dedndave's prescott
128 iterations32 iterations4 iterations
div31.87 +/- 0.22 ticks31.34 +/- 8.53 ticks33.75 +/- 0 ticks
cdiv4.69 +/- 20.44 ticks4.56 +/- 2.48 ticks2.75 +/- 4.15 ticks

Note how the measurements for cdiv have large errors when compared to the measurements. Two of the three have an error that is greater than the actual mean value. What this tells us is that we can't really trust those measurements. The updated program I'm working on attempts to check if results are meaningful and if not, it will try to resample/re-run the test in order to get significant results.

The important thing here is: we can tell when our timings are suspect!

FORTRANS' P-III
128 iterations32 iterations4 iterations
div38.02 +/- 0 ticks38.06 +/- 0 ticks43.00 +/- 0 ticks
cdiv3.64 +/- 0 ticks4.09 +/- 0 ticks9.50 +/- 0 ticks

Again note the nice pipelining effect for cdiv that div does not get.

FORTRANS' P-MMX
128 iterations32 iterations4 iterations
div41.91 +/- 0 ticks42.56 +/- 0 ticks37.75 +/- 0 ticks
cdiv11.00 +/- 0 ticks11.00 +/- 0 ticks11.00 +/- 0 ticks

Here no matter how many times you loop, the throughput is the same.

FORTRANS' Mobile Intel Celeron
128 iterations32 iterations4 iterations
div37.99 +/- 0 ticks37.50 +/- 0 ticks38.00 +/- 0 ticks
cdiv3.76 +/- 0 ticks3.47 +/- 0 ticks5.75 +/- 0 ticks

Here we have a nice speedup again when we stay in the inner loop longer.

mineiro's Pentium
128 iterations32 iterations4 iterations
div11.18 +/- 7.99 ticks10.69 +/- 0 ticks13.75 +/- 7.86 ticks
cdiv2.95 +/- 0.09 ticks2.53 +/- 0 ticks4.50 +/- 0 ticks

Here we have a nice warmup effect for cdiv. One might also say div has a warmup effect, but we really can't say anything for it due to the amount of error in the measurements.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 10, 2012, 12:01:29 AM
Here is the result on my Core2 quad.



Spinning up the processor.

Running testit_long_baseline.
Min:      369 ticks, Max:      369 ticks, Range:    0 ticks
Mean:     369 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3017630 calls

Running long_testit_div.
Min:      1530 ticks, Max:      1530 ticks, Range:    0 ticks
Mean:     1530 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 370356 calls

Running long_testit_cdiv.
Min:      738 ticks, Max:      747 ticks, Range:    9 ticks
Mean:     745 ticks, Variance: 13 ticks^2
Batch Size: 8      , Call Count: 3584 calls

Running testit_long_baseline.
Min:      369 ticks, Max:      369 ticks, Range:    0 ticks
Mean:     369 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 2946244 calls
☻Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     269 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 24366835 calls

Running testit_div.
Min:      540 ticks, Max:      549 ticks, Range:    9 ticks
Mean:     548 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 22754298 calls

Running testit_cdiv.
Min:      342 ticks, Max:      342 ticks, Range:    0 ticks
Mean:     342 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 3030 calls

Running testit_baseline.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     269 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 17146668 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      234 ticks, Max:      243 ticks, Range:    9 ticks
Mean:     238 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 9344100 calls

Running short_testit_div.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     265 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 6207672 calls

Running short_testit_cdiv.
Min:      243 ticks, Max:      243 ticks, Range:    0 ticks
Mean:     243 ticks, Variance: 0 ticks^2
Batch Size: 4      , Call Count: 12390 calls

Running short_testit_baseline.
Min:      234 ticks, Max:      243 ticks, Range:    9 ticks
Mean:     238 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 5224032 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Title: Timings for hutch--'s core 2, finishing next program update after the game :)
Post by: nixeagle on May 10, 2012, 12:39:48 AM
Since hutch-- was nice enough to post timing results, I thought I'd run them through my MMA program. We have the following:

128 iterations32 iterations4 iterations
div9.07 +/- 0 ticks8.72 +/- 0.35 ticks6.75 +/- 3.16 ticks
cdiv2.94 +/- 0.32 ticks2.28 +/- 0.25 ticks1.25 +/- 2.24 ticks

I see some curious results for div when iterating 4 times only, but the error in that sample is more than enough to render the whole measurement meaningless. My next program update will address this case as I'll be having it strive to get the lowest possible error. Not that we are always able to do so; but for those times we can't, the important thing is that we know there is an error. Just taking the mean is not enough here.

The same deal goes with the measurements for cdiv when looping 4 times. But here notice that the error in the measurement is larger than the value with a relative deviation of 174%!

Finally after the ballgame I'll try to finish up what I'll call v0.5 of the timing/benchmark program. Major thanks to everyone for taking the time to test this and answer my questions. :U

P.S. Hutch--, does that have speedstep or something else causing clock variation? Just curious :bdg.



Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: hutch-- on May 10, 2012, 01:47:53 AM
The dev box I use is a Core2 quad running at 3 gig and it has been far more reliable in timings than the last generation of PIVs that I used. My old 2.8 gig Northwood was very consistent in terms of timing but the later 3.8 gig Prescott PIV which had the extra 1 meg cache wandered a lot more with its timings. On the PIVs I ran Win2000 with the hyperthreading turned off in the BIOS as Win2000 did not properly support it. The Core2 does not support it either but the later i7 I have that I rarely ever turn on does. It is also very consistent in terms of timings.

When I time an algo I write a test piece that is as close as I can get to its real world task then bash it to death in ring3 to get its averages. You occasionally get a quirky low reading but what I am after is the average running in ring3 while the OS is running and performing the normal task switching.
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: nixeagle on May 10, 2012, 02:41:39 AM
Quote from: hutch-- on May 10, 2012, 01:47:53 AM
The dev box I use is a Core2 quad running at 3 gig and it has been far more reliable in timings than the last generation of PIVs that I used. My old 2.8 gig Northwood was very consistent in terms of timing but the later 3.8 gig Prescott PIV which had the extra 1 meg cache wandered a lot more with its timings. On the PIVs I ran Win2000 with the hyperthreading turned off in the BIOS as Win2000 did not properly support it. The Core2 does not support it either but the later i7 I have that I rarely ever turn on does. It is also very consistent in terms of timings.

When I time an algo I write a test piece that is as close as I can get to its real world task then bash it to death in ring3 to get its averages. You occasionally get a quirky low reading but what I am after is the average running in ring3 while the OS is running and performing the normal task switching.

These timings are basically micro-benchmarks. We are measuring in terms of ticks here. Once I finish up this program that handles the "optimal" situation, I'd like to move on to measuring algorithm performance when the cache is not dedicated to the algorithm under test. From there I'll work on adding more complex things in. Right now I'm working on getting the most consistent results possible. To do that, I have to identify confounding sources of error in the sample and attempt to remove them. I'm aiming to do computer science here :wink:. With what I have now, I think I've done a decent job.

I'm a little tired, the ballgame is not even over. So I think I'll finish this up tomorrow. However I have quite some progress from my last update. Follows is the code output. Please suggest possible improvements to how the output looks. How would you want it to look?


Spinning up the processor.
long_testit_div: 10.0798 +/- 0.0000 ticks
long_testit_cdiv: 2.0876 +/- 0.1004 ticks
testit_div: 10.7298 +/- 0.0000 ticks
testit_cdiv: 1.4485 +/- 0.0442 ticks
short_testit_div: 10.8500 +/- 0.4546 ticks
short_testit_cdiv: 2.5000 +/- 1.0782 ticks


I'd release tonight, but I believe there is a small mathematical error in how sampling is averaged and I'd like to correct that before releasing what I'll call v0.5. :dance:


Edit: Follows are the timings when my laptop is under battery power. Added just for variety. :U


long_testit_div: 25.4694 +/- 0.0371 ticks
long_testit_cdiv: 5.1557 +/- 0.1759 ticks
testit_div: 25.8500 +/- 0.0000 ticks
testit_cdiv: 4.6625 +/- 0.2624 ticks
short_testit_div: 24.0000 +/- 0.2500 ticks
short_testit_cdiv: 4.0000 +/- 0.2500 ticks
Title: Update to v0.5. Please run and test!
Post by: nixeagle on May 11, 2012, 12:30:28 AM
Alright, finally got this to the point I'm happy with calling this "0.5". Everything mentioned in earlier posts of this thread is in here except for the error minimization code. I need to re-do that and have it work off of the relative deviation rather than just a "range". As is, it causes too much slowdown for what it is worth.

My results on my i7:

Spinning up the processor.
long_testit_div: 10.0148 +/- 1.3194 ticks
long_testit_cdiv: 1.9891 +/- 0.2029 ticks
testit_div: 9.8562 +/- 2.1695 ticks
testit_cdiv: 1.4321 +/- 0.0765 ticks
short_testit_div: 9.5420 +/- 0.2500 ticks
short_testit_cdiv: 1.7339 +/- 0.4330 ticks
Press any key to continue ...


Yey for not having to manually calculate this! :dance:

Again, I'd really appreciate some posts showing your results and suggestions on how to improve the program output. :bg
Title: Re: Figuring out a statistically reliable baseline for out of order processors.
Post by: dedndave on May 11, 2012, 12:35:00 AM
prescott w/htt - xp mce2005 sp3
long_testit_div: 32.0050 +/- 0.3598 ticks
long_testit_cdiv: 4.2274 +/- 0.3646 ticks
testit_div: 31.0509 +/- 1.3254 ticks
testit_cdiv: 4.2843 +/- 0.7849 ticks
short_testit_div: 33.0279 +/- 0.9013 ticks
short_testit_cdiv: 3.5929 +/- 0.5000 ticks


i like the way you versioned it as "0.5" - showing off your new floating point skills, i see - lol
Title: Re: Pretty result tables, Thanks to everyone who took measurements!
Post by: FORTRANS on May 11, 2012, 12:39:03 PM
Quote from: nixeagle on May 09, 2012, 07:51:18 PM
FORTRANS' P-MMX
128 iterations32 iterations4 iterations
div41.91 +/- 0 ticks42.56 +/- 0 ticks37.75 +/- 0 ticks
cdiv11.00 +/- 0 ticks11.00 +/- 0 ticks11.00 +/- 0 ticks

Here no matter how many times you loop, the throughput is the same.

Hi,

   Makes sense probably.  Pentium 1 was not out of order, does
not do branch prediction, and has (mostly?) hard-wired execution,
as opposed to convering to micro-ops or using micro coding.

   And here are some new results.

P-III

Spinning up the processor.
long_testit_div: 38.0156 +/- 0.0000 ticks
long_testit_cdiv: 3.6406 +/- 0.0000 ticks
testit_div: 38.0625 +/- 0.0000 ticks
testit_cdiv: 4.0937 +/- 0.0000 ticks
short_testit_div: 43.0000 +/- 0.0000 ticks
short_testit_cdiv: 9.5000 +/- 0.0000 ticks
Press any key to continue ...

P-MMX

Spinning up the processor.
long_testit_div: 41.9062 +/- 0.0000 ticks
long_testit_cdiv: 11.0000 +/- 0.0000 ticks
testit_div: 41.6250 +/- 0.0000 ticks
testit_cdiv: 11.0000 +/- 0.0000 ticks
short_testit_div: 37.7500 +/- 0.0000 ticks
short_testit_cdiv: 11.0000 +/- 0.0000 ticks
Press any key to continue ...

Mobile Celeron

Spinning up the processor.
long_testit_div: 37.9921 +/- 0.0000 ticks
long_testit_cdiv: 3.7578 +/- 0.0000 ticks
testit_div: 37.5000 +/- 0.0000 ticks
testit_cdiv: 3.4687 +/- 0.0000 ticks
short_testit_div: 38.0000 +/- 0.0000 ticks
short_testit_cdiv: 5.7500 +/- 0.0000 ticks
Press any key to continue ...

Regards,

Steve N.