Figuring out a statistically reliable baseline for out of order processors.

nixeagle · May 04, 2012, 08:24:08 PM

Quote from: dedndave on May 04, 2012, 08:16:52 PM
that's 1 second :eek
probably no need to go that far

if you can get it down to where the sleep period is at least as long as the test period, that works pretty well
if you let it run free, it consumes 50 % of CPU time
get it below ~25 %, and you'll be ok for short periods
if you want to get thousands of samples, which takes a while, try to get below ~17 %
that would be sleep period = 2 x test period

you can use the results of the last pass to determine the Sleep parameter :U
(provided you know the CPU frequency)

QueryPerformanceFrequency should work on most systems

Mind playing with the sleep parameter on your CPU? I'm really of mind that it can be done without for the reasons explained in my last post. But if you can figure out some reasonable values that give stable results... I'd be thrilled. Try maybe 15 instead of 1000? The line to change is 159. Remember, not always is the test period knowable ahead of time.

I won't be back on for 5 or 6 hours, see you all later! :bg

dedndave · May 04, 2012, 08:26:20 PM

ok - i'll play with it a little in a couple hours
have some things to finish up

btw - you can leave your priority class set high if you use Sleep :P
you only hog as much as you decide you want to - lol

nixeagle · May 05, 2012, 03:58:45 AM

Quote from: dedndave on May 04, 2012, 08:26:20 PM
ok - i'll play with it a little in a couple hours
have some things to finish up

btw - you can leave your priority class set high if you use Sleep :P
you only hog as much as you decide you want to - lol

Awesome! Btw what version of windows are you using? Windows 7 won't allow you to set a program to realtime, so it just sets at high priority. I ought to lookup the define for that ::).

I'm tired and going to bed, but a quick upload. Updated program does not sleep at all as I got tired of waiting on it ::). But it is just commented out. Honestly though, give it a try without sleep at all. It takes way less time then the one you tried last time. It only takes 25ish samples instead of trying to take 1000. Remember I had it taking lots of samples in order to convince myself of its stability :wink. Edit: dedndave, if you do try it without sleep, don't forget to turn gu_default_sample_batch_size down from 16 to something like 4. That way it won't have to downsample and burn CPU time doing so.

New update also computes variance. This is computed by looping over all the test results, subtracting the mean from each one, then squaring each and summing them all together. This sum is then divided by the total number of elements in the result set (25 for us). It is used to sum up in one number how scattered about the results are. For example, we have the results:

Code Select

{50,50,50,55,50,45,53,50,50,42}
Our mean is 49.5, but I'm not using floating point numbers right now. So round down to 49. Now go through the list and subtract 49 from each number:

Code Select

{1,1,1,6,1,-4,4,1,1,-7}.
Thus we now have a list of differences from the mean value. Square them:

Code Select

{1,1,1,36,1,16,16,1,1,49},
and sum the list giving 123. Finally divide this by the total number of elements in the list, in our case 10 giving 12.3. Again I round down as I have not implemented this using floats yet. Thus the program will emit 12. My core i7 using this program regularly gets a variance of 0.

Once I get this done using floating point numbers... I'll compute the standard deviation... which is just the square root of the variance. :bdg

Please keep on testing this! :dance:

P.S. I'm all ears for any tips from folks on formatting/improving the actual code. Don't be afraid to suggest improvements to my style, etc! :bg

dedndave · May 05, 2012, 04:36:26 AM

sorry - i never got back to this
my other thing is taking longer than expected :eek

prescott w/htt - XP media center edition 2005 SP3

Code Select

525, 525, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495,
495, 495, 495, 495, 495, 495, 495, 495, 495,
Min:      495 ticks
Max:      525 ticks
Range:    30 ticks
Mean:     497 ticks
Variance: 66 ticks^2

you can probably get to realtime if you play with the permissions
generally, there is no need to go beyond HIGH_PRIORITY_CLASS
and, if you do, you better have all your ducks in a row

dedndave · May 05, 2012, 04:47:39 AM

this ensures that you are starting with a fresh time-slice each time
it should yield more consistent results (may vary from one platform to another)

Code Select

; warmup
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    xor eax, eax
    cpuid
    ;rdtsc 
    ; now go for real!
    INVOKE  Sleep,0                    ;stick this in there
    xor eax, eax
    cpuid
    rdtsc
   
    push eax
    push edx
    
    call [gu_testfunction]

    xor eax, eax
    cpuid
    rdtsc
    push eax
    push edx
    call print_runtime

    ret

nixeagle · May 05, 2012, 04:49:16 PM

Quote from: dedndave on May 05, 2012, 04:47:39 AM
this ensures that you are starting with a fresh time-slice each time
it should yield more consistent results (may vary from one platform to another)
Code Select Expand
; warmup xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid ;rdtsc ; now go for real! INVOKE Sleep,0 ;stick this in there xor eax, eax cpuid rdtsc push eax push edx call [gu_testfunction] xor eax, eax cpuid rdtsc push eax push edx call print_runtime ret

No offense, but that is likely the worst place to stick it in! :eek The whole purpose of that string of CPUID immediately before calling gu_testfunction is to warmup the CPUID instruction so it takes a constant amount of time for each invocation.

From what I understand a Sleep 0 call causes a context switch back to the operating system, which then checks the scheduler to see if there are any other tasks requiring CPU time before handing back off to you. There are multiple problems with this from a stability standpoint:

The other application can verywell cause the CPU to spin back down.
The other application will cause all the caches to no longer be "hot"
Context switch overhead is likely to do the same to the caches. Assuming there is a context switch, I can't imagine how there would not be one.

If you really want a sleep call, I believe the place I have it commented out, around line 159 or thereabouts is the best location possible. Since Sleep potentially gives you a new timeslice, you want the chance to cause the CPU to spin up through the spinup call. Following that we execute up to gu_sample_retry_attempts * g_steps attempts to get a stable reading. This is the portion of the program where we want to have full control of what is in the hot CPU cache, decoders, ROB, and soforth.

Also dedndave, your baseline measurements are awesome! Seriously compare that to what you were showing when we started this thread up. A variance of 66 ticks^2 means you have a standard deviation of about 8 ticks with very few outliers. I think I can do a few more improvements to drop that closer to 0 and decrease the total amount of time required to get a stable result, but even if I can't, this is a very good result!

Next, assuming nobody finds any huge variances with this program, will be to start setting up the interface to the test driver. That is coming up with a protocol where the user writes one baseline function (see testit_baseline for an example) and up to N functions to test against that baseline. Additionally I'll be adding a few more statistical measurements to assist with assessing the quality of the output. Finally, when all is said and done, the finished program output will emit only the statistics of the run, one line for each function under test.

To folks posting test results, please continue to do so! I'm also interested to know how long the program takes to produce a set of results.

dedndave · May 05, 2012, 04:59:55 PM

ok - so stick the Sleep,0 in before the warmup :bg
timeslices are very short - although, i can't nail that down with any documentation
i am thinking on the order of 500 cycles or less

i think the intel document suggested only 3 CPUID's for a warmup

dedndave · May 05, 2012, 05:16:40 PM

ok - timeslices appear to be a bit longer than i thought
i measure about 4300 cycles for prescott w/htt, XP MCE2005 SP3

nixeagle · May 05, 2012, 06:20:31 PM

Quote from: dedndave on May 05, 2012, 04:59:55 PM
ok - so stick the Sleep,0 in before the warmup :bg
timeslices are very short - although, i can't nail that down with any documentation
i am thinking on the order of 500 cycles or less

i think the intel document suggested only 3 CPUID's for a warmup

Intel suggests 3, but more gives better variance. Feel free to tweak though.

Quote from: dedndave on May 05, 2012, 05:16:40 PM
ok - timeslices appear to be a bit longer than i thought
i measure about 4300 cycles for prescott w/htt, XP MCE2005 SP3

How are you measuring? I suspect you get more than that as 4300 cycles on a 3.0 GHz processor amounts to only 1433 ns or 1.43 microseconds. I suspect the operating system gives you more time than that as the cost of a context switch tends to be around 500 cycles or so (iirc). I suspect you get about 10 microseconds or more in reality. Maybe I'm wrong. :eek

dedndave · May 05, 2012, 08:02:03 PM

context switching happens very fast
i measured it by timing Sleep,0
i figure the average number of cycles Sleep,0 consumes is roughly half a time slice
it probably consumes a little more for the overhead
i made the measurement with a HIGH_PRIORITY_CLASS setting
i suppose i could go to REALTIME_PRIORITY_CLASS to verify it :P
that would only reduce the result, though - not increase it

nixeagle · May 05, 2012, 08:15:21 PM

Quote from: dedndave on May 05, 2012, 08:02:03 PM
context switching happens very fast
i measured it by timing Sleep,0
i figure the average number of cycles Sleep,0 consumes is roughly half a time slice
it probably consumes a little more for the overhead
i made the measurement with a HIGH_PRIORITY_CLASS setting
i suppose i could go to REALTIME_PRIORITY_CLASS to verify it :P
that would only reduce the result, though - not increase it

:eek. I'm afraid that if a timeslice is half the time of a sleep, the operating system is consuming close to 33% of the CPU resources. That simply can't be right. Think about it, if Sleep 0 takes 2000 cycles (little less than half of 4300 cycles) the CPU would constantly be running at 33% utilization. All just for the operating system to context switch, check scheduler, etc.

To get a proper empirical measurement, you would need to loop over RDTSC repeatedly, storing the results (both EDX and EAX in a buffer) over a period of 2 to 5 seconds. Following that go through the collected data and look for large "jumps" in the recorded time on the clock. Those will represent actual context switches.

If you like, I can implement a program that does this. :bg

Edit: Actually a run for 0.5 to 1 second would be enough. I suspect the operating system allows programs at least 10 microseconds worth of runtime. That yields an optimistic 100,000 possible timeslices per second assuming 0 context switch overhead. If you factor in a overhead of 2 microseconds between each timeslice you have 83,333 possible timeslices per second. Even at this though, note that the overhead imposed by the operating system is 20%.

Edit 2: If you assume context switches are faster, say 1 microsecond per, then you have 90,000 possible timeslices (assuming nobody aborts early) with the best case operating system overhead of 10%. You don't see your CPU running at a constant 10% do you? This is why I suspect 10 microseconds is the lowerbound of possible timeslice size.

dedndave · May 05, 2012, 08:18:29 PM

remember - Sleep,0 relinquishes the remainder of the current timeslice
assuming that your code is not syncronized with context switching....
on average, you are throwing out half a time slice :U

nixeagle · May 05, 2012, 08:25:52 PM

Quote from: dedndave on May 05, 2012, 08:18:29 PM
remember - Sleep,0 relinquishes the remainder of the current timeslice
assuming that your code is not syncronized with context switching....
on average, you are throwing out half a time slice :U

Hmm, my problem is the math is not working out. See my edit in the prior post. Could you toss up the program you used to do the measurements? I'd like to have a looksee. :bg

dedndave · May 05, 2012, 08:30:49 PM

here you go
i'd be interested to see how other platforms handle it - OS/CPU

dedndave · May 05, 2012, 08:35:43 PM

Quote from: nixeagle on May 05, 2012, 08:25:52 PMHmm, my problem is the math is not working out.

as Dr Emmit Brown would say, "You're not thinking four-dimensionally !"

timeslices could easily be shorter than my measurement
but, by nature of the Sleep function, they can't be much longer

News:

Figuring out a statistically reliable baseline for out of order processors.