Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

nixeagle

Alright, team we were rooting for lost :(. Anyway I've started attempting to convert my C++ code to MASM so I can post on here :). I've figured out how to print numbers and write functions :bg. I'm used to NASM or GAS syntax so figuring out MASM's stuff has been interesting ;).

Some answers to all the posts!

To Neo: Right, there is no event loop. Aside from that, awesome link and graphs. :clap:

To MichaelW I'm not measuring processor speed directly, that is the frequency of the CPU is not terribly important outside of attempting to get it to be as consistent as possible. I'm measuring ticks against the RDTSC counter which is a monotonic timer in sandy bridge.1 I know the CPU can execute the loop faster if you jiggle the mouse. Tomorrow, after I finish converting my tallying code to MASM style syntax, I'll investigate further and try to isolate a cause.

Also earlier jj2007 mentioned the QPC2. I looked into that and it is to my understanding, a timer that gives microsecond accuracy.3 Interesting, but for the time being I'm going to stick with RDTSC.

Hopefully tomorrow I'll have a first draft of a MASM program that runs test code and computes the various statistics of the test run. At this point I'm just writing the MASM program for fun :bg. Start of summer break and all.




jj2007

Quote from: dedndave on May 02, 2012, 03:33:15 AM
well - they had mentioned that timing code "must" be done in a console app because of mouse movement
and i think it is quite possible to do it in a GUI app - if you take the proper precautions

they thinks one option might be to put the call right here:

.Repeat
call ZeTimings
invoke GetMessage, addr msg, wmNull, wmNull, wmNull
.Break .if !eax
invoke TranslateAccelerator, hwnd, haccl, addr msg
.if !eax
invoke TranslateMessage, addr msg ; msg not an accelerator
invoke DispatchMessage, addr msg
.endif
.Until 0


At least it can't reach WinProc that way ::)

dedndave

that's a thought, too   :U

another way might be to create the GUI window in another thread - with message loop
then, you can control the thread priority levels seperately
i think it's a good idea to keep the timing code in the root thread

hutch--

 :bg

Set the process priority high enough and the mouse movement will not matter, nothing else will probably run on that core either if you use the critical option.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

nixeagle

Quote from: hutch-- on May 02, 2012, 07:24:12 AM
:bg

Set the process priority high enough and the mouse movement will not matter, nothing else will probably run on that core either if you use the critical option.

Why did I not think of process priorities! :red I have not tested yet, but I suspect this is the solution to my mouse jitter problem! :clap: At least it makes logical sense to me, the user likely has a higher priority than the default for applications. Now just to test.

To dedndave: I'm not doing anything related to GUI, the timescales I'm operating at are such that I suspect an event loop will cause even more problems than it solves.1 Right now I'm focusing on identifying and removing as many outliers as I can to ensure that each run through the test function takes the same amount of time. This means that the standard deviation of the whole sample should be very small. To summerize, in theory2, if all factors are accounted for, the standard deviation of test samples should be 0.

Thanks everyone for listening and offering suggestions. Every time I check back, I find a pile of things to reply to! Which is awesome! :bg Off to poke at MASM assembly some more! :bdg




  • 1: I might have to add an event loop sometime later this summer, but right now I'm trying to keep this following the KISS principles.
  • 2: Alas rarely do theory and practice match! But the idea is if it is possible to ensure cache and processor states are the same every time through the loop, the result timing should be consistent. Failing that, I want to measure how much deviation from "perfection" there is in a test run.

dedndave

no matter what type of program you are running, there are hundreds of message loops running at the same time
every button, menu, control, or window has one

the fact that they are in your process is meaningless if you use threads and process/thread priority levels, as well as core affinity
for very brief test runs, you can essentially take over the CPU
if you look at Michael's macros (as suggested), you will see that he allows you to control process priority

the Sleep function also has more uses than meet the eye   :U
if you alter the process priority class, thread priority level, or process/thread core affinity...
a Sleep,0 call will expire the current time slice to ensure that the change(s) will take place in the next
of course, it can also be used to let the CPU up for some air - so the system gets some core time

nixeagle

Quote from: dedndave on May 02, 2012, 05:19:03 PM
no matter what type of program you are running, there are hundreds of message loops running at the same time
every button, menu, control, or window has one

the fact that they are in your process is meaningless if you use threads and process/thread priority levels, as well as core affinity
for very brief test runs, you can essentially take over the CPU
if you look at Michael's macros (as suggested), you will see that he allows you to control process priority
Yea, I've looked at those, but missed out on the priority thing. Makes sense now that it was pointed out to me! :red.

Quote from: dedndave on May 02, 2012, 05:19:03 PM
the Sleep function also has more uses than meet the eye   :U
if you alter the process priority class, thread priority level, or process/thread core affinity...
a Sleep,0 call will expire the current time slice to ensure that the change(s) will take place in the next
of course, it can also be used to let the CPU up for some air - so the system gets some core time
Interesting. Thank you! :U

dedndave

here is some reading you'll enjoy....

About Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms681917%28v=vs.85%29.aspx

Using Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686937%28v=vs.85%29.aspx

Process and Thread Functions
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847%28v=vs.85%29.aspx

these are the related functions that i seem to use most
it is important to note that thread priority level is, let's call it, "a modification of the process priority class"
i.e., it is added to or subtracted from the process priority class to obtain the actual thread priority

GetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx

SetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx

SetThreadAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx

i forget what function is used to get thread affinity   :P
it is initially the same as the process that created it

GetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683211(v=vs.85).aspx

GetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683235(v=vs.85).aspx

SetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686219(v=vs.85).aspx

SetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx

Sleep
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx

nixeagle

Quote from: dedndave on May 02, 2012, 07:01:21 PM
here is some reading you'll enjoy....

About Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms681917%28v=vs.85%29.aspx

Using Processes and Threads
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686937%28v=vs.85%29.aspx

Process and Thread Functions
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847%28v=vs.85%29.aspx

these are the related functions that i seem to use most
it is important to note that thread priority level is, let's call it, "a modification of the process priority class"
i.e., it is added to or subtracted from the process priority class to obtain the actual thread priority

GetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx

SetProcessAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx

SetThreadAffinityMask
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx

i forget what function is used to get thread affinity   :P
it is initially the same as the process that created it

GetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683211(v=vs.85).aspx

GetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683235(v=vs.85).aspx

SetPriorityClass
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686219(v=vs.85).aspx

SetThreadPriority
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx

Sleep
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298%28v=vs.85%29.aspx

Thanks.

A quick update, I managed to use up 6GB and entered into swap death trying to plot and compute stats on 10 million runs :boohoo:. So at this point I need to just code this stuff into the program and possibly find another program to produce plots :lol.

Also attached is my current source, just for the heck of it :P. Granted I'm nowhere near done with this and things like ignoring context switches, setting affinity are not done yet and not printing each run's number out to standard out. Please excuse my bad MASM :red.

jj2007

You have an innovative approach to allocation here :bg

      g_tallyBucketSize equ 100000
align 16
if 01
dd 2*g_tallyBucketSize dup(0) ; 2800 ms
else
REPEAT 2*g_tallyBucketSize ; 7000 ms
dd 0
ENDM
endif


Normally we use dup(0), or dup(?) in the .data? section. Masm (even recent versions) has a bug: Assembly slows down dramatically if you come near the 100000 dup(?) range. Your macro emulates dup, but unfortunately it's not faster than dup - 7 seconds instead of 2.8. For comparison: The 100000 dup(0) takes only half a second in JWasm...

nixeagle

Quote from: jj2007 on May 02, 2012, 11:21:18 PM
You have an innovative approach to allocation here :bg

      g_tallyBucketSize equ 100000
align 16
if 01
dd 2*g_tallyBucketSize dup(0) ; 2800 ms
else
REPEAT 2*g_tallyBucketSize ; 7000 ms
dd 0
ENDM
endif


Normally we use dup(0), or dup(?) in the .data? section. Masm (even recent versions) has a bug: Assembly slows down dramatically if you come near the 100000 dup(?) range. Your macro emulates dup, but unfortunately it's not faster than dup - 7 seconds instead of 2.8. For comparison: The 100000 dup(0) takes only half a second in JWasm...

:lol, now I know! I was borrowing from what I usually do in NASM ::). I'll modify the program to do it the correct way after this post. I'm actually starting to get some decent stabalization! Not perfect by any means, but the updated program's output can be used to produce this graph.



Standard Deviation->95.3503
Mean->170.673
GeometricMean->153.233
HarmonicMean->140.363
Median->107
Mode->{107}
MeanDeviation->72.2174
Min->94,Max->27250
Length->1002001
TrimmedMean->162.845
MeanDeviation->72.2174
MedianDeviation->4.
QuartileDeviation->61.5
InterquartileRange->123


Once I figure out how to meaningfully exclude the initial jitter in these tests I'll move on to having the program itself compute the various statistics of the run. This run is not the greatest due to the wildly varying data points at the start of the run.

[small]Edit[/small]


Whoops! I forgot to attach the program for the last set of data up there. I don't have the exact code for that anymore, sorry. What I do have is some updated data and code :bg.



Standard Deviation->89.3507
Mean->106.188
GeometricMean->105.984
HarmonicMean->105.947
Median->107
Mode->{107}
MeanDeviation->1.68561
Min->91
Max->88343
Length->1002001
TrimmedMean->105.909
MeanDeviation->1.68561
MedianDeviation->0.
QuartileDeviation->1.5
InterquartileRange->3.


Unfortunately the standard deviation is telling us that we still have outliers in our dataset not accounted for. Look at our max and we can see that these are context switches.

I'm still not sure what the best way to remove those in a general way is yet. By this I mean I'd like to avoid hard-coding in a number that gets removed. Better would be to infer a number from the test data so that we automatically adjust. A crude way might be to remove anything outside of Mean +/- StandardDeviation. However, I'm not sure how statistically sound that is. My friend is over for the ballgame again, so I'll be heading out for now. :)

MichaelW

I think you should be able to eliminate most or all of the initial "jitter" by delaying for 3 to 5 seconds after the app loads before you start testing, to allow the system activities involved in launching an application to subside.
eschew obfuscation

nixeagle

Whoo the team we were rooting for won! Plus, I think I've thought of a nearly perfect way to ensure a stable baseline. Some quick coding and the results bear themselves out. :8)

Let us start off with the "traditional" plot:



The cool thing here is we can identify all of the effects. That uptick at the start is the core warming up. I think I have a way to correctly discount that effect, just require that timings be stable for 100 sample runs before starting to print numbers to standard output. The stats bear out the improvement:
  • Standard Deviation -> 15.3634
  • Mean -> 94.0829
  • GeometricMean -> 93.2859
  • HarmonicMean -> 92.7632
  • Median -> 91
  • Mode -> {91}
  • MeanDeviation -> 5.83704
  • Min -> 86
  • Max -> 256
  • Length -> 1001
  • TrimmedMean -> 91.1387
  • MeanDeviation -> 5.83704
  • MedianDeviation -> 0.
  • QuartileDeviation -> 0.
  • InterquartileRange -> 0.

My method of achieving this boils down to collecting "batches" of sample results. So we run through the code under test n (I chose n=12) times and store the results in a buffer. The only way we allow the results to "count" is if all the results in the buffer are the same. If they are not we discard all of the data for the run and retry. This allows us to cleanly discard cpu throttling and context switches.

Attached is the code, please comment! :dance: After this I'm going to get the program to compute the pretty statistics instead of having mathematica do that. That way just running the program gives you all the info you need to judge confidence in the results.

P.S. this attachment has a working program. Feel free to run it and tell me your results. If it outputs the same number repeatedly, all is good. If it does not I'd really like to know. :bg

P.P.S, MichealW you posted while I was typing this up, we both came to the same conclusion :U.

P.P.P.S dedndave: I'm really curious to see if your "quirky" CPU even prints numbers at all. :bdg The only way for a number to print is for the CPU to take the exact same amount of time 12 times in a row.

dedndave

no output - and i can hear the gears grinding   :lol

you may have a good basic idea, though
take 12 measurements and allow 2 or 3 of them to be tossed out

nixeagle

Quote from: dedndave on May 03, 2012, 05:11:46 AM
no output - and i can hear the gears grinding   :lol

you may have a good basic idea, though
take 12 measurements and allow 2 or 3 of them to be tossed out

Thanks for trying, that was what I was afraid of. If you stick around for another 15 minutes or so... I think I have a fix so this will work on all CPUs. Basically if after 1000 iterations no good results occur, step n down to 11 and repeat until results start happening.  :bg

edit: Make that 30 minutes, silly x86 and its lack of registers :(.