Code timing macros

Webring · June 21, 2005, 04:34:29 PM

Great macros MichaelW, they've helped me to optimize what I thought was already a fast byte conversion alg to make it truely fast. :dance: and Citric i'll take a look at your code when I get home.

Ratch · June 24, 2005, 04:17:45 AM

MichaelW,
I have two reservations about your timing macros.

1) To quote INTEL, the people who should know:

QuoteWhen using the CPUID instruction, however, the programmer must also take into account the cycles it takes for the
instruction to complete, and subtract this from the recorded number of cycles. A strange quirk of the CPUID
instruction is that it can take longer to complete the first couple of times it is called. Thus, the best policy is to call the
instruction three times, measure the elapsed time on the third call, then subtract this measurement from all future
measurements.

http://www.math.uwaterloo.ca/~jamuir/rdtscpm1.pdf

I do not see your macro calling the CPUID/RDTSC three times.

2) I believe you loop through the test code X number of times, accumulate the time/counts and divide by the loop count for a average. Sounds logical. My method is to loop the test code X number of times, select the shortest time, and discard the longer times. I believe this method gives a lower bound of what the machine and algo are capable of doing. This method automatically discards "unlucky" combinations of time slices, interrupts, or other time/count stealers over which the user has little or only indirect control. Since I switched to this method, my results have been remarkably consistent compared to the average method. Try it, you will probably like it. Ratch

MichaelW · June 24, 2005, 06:09:35 AM

Ratch,

Thanks for the information on the CPUID instruction. As I read it I would simply need to call CPUID a few times before I set up the empty loop. The example code calls CPUID and RDTSC three times, but I see nothing in the text about needing to "warm-up" RDTSC. Also, the code contains an obvious error in that it fails to control the CPUID function number, which on my P3 has a relatively large effect on the timing.
EAX = 0, 79 cycles
EAX = 1, 81 cycles
EAX = 2, 105 cycles
EAX >= 3, 129 cycles

I originally considered using only the lowest count, but I decided against it because I could not, and do not, see any benefit in doing so. I understand the principle well enough, but what could be the benefit of knowing how the code performs under optimum execution conditions when it will not typically be used under optimum execution conditions. A code sequence that is optimal for optimum execution conditions may not be optimal for average execution conditions. But I could be wrong, and if I am I hope that someone will enlighten me.

In any event, consistency seems to me to be the most important aspect here, and with high loop counts, I get very consistent results for most code, even when using HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, and even with a bunch of other crap running on the system. For example, for the timings above, running the code below on my P3-500 with 38 processes showing in Task Manager, including a firewall and a live Internet connection, I ran the tests 10+ times and the results were identical each time.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    FOR func,<0,1,2,3,4,5>
      counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
        mov  eax, func
        cpuid
      counter_end
      print ustr$(eax)," cycles",13,10
    ENDM

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

EDIT:

As a test I added 3 calls to CPUID as described above, and there was no significant change in the timings (the EAX = 2 cycles was 104 instead of 105 on most runs).

Petroizki · June 28, 2005, 05:11:27 PM

Note that the cpuid instruction changes the ebx register, when it is called with eax = 0.

The counter_begin and counter_end changes ebx register, without preserving it.

MichaelW · June 30, 2005, 07:42:12 AM

Hi petroizki,

Yes, because of the CPUID calls EBX is being changed, and this could be a problem for user code that depends on EBX being preserved. The change in EBX is not specific to CPUID function 0 -- all of the standard and extended functions (that I have seen documented) change EBX. The only reasonable solution that I can see would be to preserve EBX at the start of counter_begin and restore it at the end of counter_end. This is one of the changes that I am planning on making just as soon as I get a chance to focus on something other than work.

SoKaR · July 06, 2005, 11:06:28 AM

29 cycles
29 cycles
192 ms
192 ms

Centrino 1,5 ghz Dothan
XP Home, SP2

Codewarp · July 19, 2005, 01:13:40 AM

Quote
Posted on: June 23, 2005, 11:09:35 pmPosted by: MichaelW

I originally considered using only the lowest count, but I decided against it because I could not, and do not, see any benefit in doing so. I understand the principle well enough, but what could be the benefit of knowing how the code performs under optimum execution conditions when it will not typically be used under optimum execution conditions. A code sequence that is optimal for optimum execution conditions may not be optimal for average execution conditions. But I could be wrong, and if I am I hope that someone will enlighten me.

In any event, consistency seems to me to be the most important aspect here, and with high loop counts, I get very consistent results for most code, even when using HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, and even with a bunch of other crap running on the system. For example, for the timings above, running the code below on my P3-500 with 38 processes showing in Task Manager, including a firewall and a live Internet connection, I ran the tests 10+ times and the results were identical each time.

I have been disturbed by these comments, ever since I read them a while back, but it took me a while to become clear about it. Let me take a stab at it and see where it goes...

(1) Taking the smallest clock count

There is always a model to these things, whether explicily stated or not. The model here is peak-performance--the smallest clock count obtainable, usually found and reported after many reruns. It's simple and easy to reproduce, to evaluate and to know and understand what it means. But now, you have arbitrarily decided that you are not going to report the lowest clock count, based on some real-world computing notion you have.

Now, on the surface of it, that sounds really great! The trouble is, you have not offered any basis for a different model. I would love to have another model to evaluate cycle counts against--then we could show both metrics together, getting ever closer to the true we seek... However, I suggest that until you can clearly and unambiguously define this "real world computing" notion and connect it to what you are actually doing, you must admit that your method is to insert noise into the results of a peak-performance model, making fuzzy results that are harder to interpret and compare. The fact that your results are "very consistent" is good, but not sufficient to overcome this defficiency.

(2) Messing with thread priority for "benchmarking" purposes

I have been successfully benchmarking at the clock cycle level for many years and I have never found it necessary to raise thread priority in order to measure performance. Many applications will run differently than "normal", often skewing the timing results--so much for real-world benchmarking... Once a benchmark is rerun many times and the outliers peeled off (as your methods do quite nicely), you can pretty much run at any priority--so why not use the level it will ultimately deployed with? Actually, I routinely run my software at slightly below :eek normal priority, unless I really need different specific level. Everthing runs just as fast, but not even hardcore computation and infinite loops at this priority will ever bog down the machine for anything else running. It leaves other programmers scratching their heads... :lol and my benchmarks are stable as well. :dance: Messing with thread priority for "benchmarking" purposes is a bad idea rooted in lack of experience and superstition. :tdown

l_d_allan · February 21, 2006, 03:38:34 PM

<alert comment="masm32 newbie">

Would it be appropriate to have timers.asm (or something equivalent) be part of the standard distribution ... rather than something you have to track down to get test code to work?

Quote ; These macros require a .586 or higher processor directive.

Is there code that works adequately with a .486 processor? .386 processor? This could perhaps be something like the Windows hi-res QueryPerformanceCounter.

Could a "production release version" detect the processor and use the applicable code?

</alert>

MichaelW · February 21, 2006, 04:36:11 PM

Actually, although I had not previously noticed, the statement regarding the .586 or higher processor directive is not correct. The counter macros do require .586 or higher because of the CPUID and RDTSC instructions, but the timer macros, assuming _EMMS is not defined, should work OK with .486, or even .386 combined with .387.

Detecting the processor/processor features is no problem, but I know of no method of counting clock cycles that does not depend on RDTSC.

jdoe · April 09, 2006, 08:48:36 PM

Hi,

I have downloaded the timing macros to do some test and it seems that it don't work anymore with MASM32 v9.

The error come from using crt__ultoa instead of dwtoa in the ustr$ (macros.asm)
Simply comments crt__ultoa and uncomments dwtoa make it work again.

MichaelW · April 09, 2006, 09:10:15 PM

Hi jdoe,

Thanks for pointing this out. I knew about the change in macros.asm, but I had not thought to update the test app. The problem is in the includes, and adding these statements will correct it:

Code Select


include \masm32\include\msvcrt.inc
includelib \masm32\lib\msvcrt.lib

johnsa · July 05, 2006, 11:05:16 AM

Hey all,

Was just reading an article on Intel's website regarding the innacuracies of using RDTSC due to speed step and other energy saving mechanisms which alter the CPU frequency. Apparently they recommend ETimer which uses Windows QueryPerformanceCounter and Frequency functions, which apparently won't use RDTSC if Energy Saving is present in the machine, they'll use another timer such as the power-saving or bios etc.

Any thoughts on this? (Personally I'd much prefer to stick to rdtsc) :wink

Ossa · July 05, 2006, 01:05:00 PM

As far as I understand it, RDTSC counts CPU cycles, which is what we're interested in - we don't care about the accurate "real" time. Therefore, when you have speed step change the only errors from RDTSC that occur are when the task is switched mid-way through a timing. That is why taking a series of identical measurements is important.

The actual timings from RDTSC when a processor is at low or high speed is identical (I have a laptop that does 800 at "low speed" and 2200 at "full speed" - they come out with identical timings - this was very hard to find out empirically though).

However, to avoid measuring during the change, I often add "spin-up" code at the start (like a loop doing many adds/mults). But as I said before, the best idea is to use multiple timings and either use the mean (or better yet, the mode).

Ossa

johnsa · July 05, 2006, 01:47:57 PM

I agree that the key information we want is the cycle-count, but having the accurate ms value is useful too. I'm sure there are scenarios where you might have a higher cycle count but fewer ms depending on parallel execution etc etc.. so some simple code may use more cycles but run in fewer ms... or am I smoking my socks? :bg

Ratch · July 05, 2006, 01:49:33 PM

Ossa,

I have found it extremely difficult to get timing results which I am confident are correct. So much so that I have stopped doing timings. It doesn't seem to make much difference anyway. I just try to do basic techniques like avoiding extensive use LOOP, REP MOVSB, etc. And arranging instructions so that a register has time to "settle" before using it again. Other "tricks" are found in optimization manuals. Although sometimes I can get consistency, oftentimes just by doing a minor thing like changing the position of a block of code, rebooting, or even using one different instruction makes a large timing change that is unexplainable. I believe that is because of cache effects and the "mix" of other background programs. I think that the best timing result is the fastest time you can get. Anything slower is caused by factors other than a slow instruction sequence or algorithm. So to iterate, if you want to see what your best instruction sequence is, use the fastest single time. If you want to see how well your code gets along with the cache and its neighbor programs, use the mean, median, or mode averages. Ratch

News:

Code timing macros

Webring

Ratch

SoKaR

Codewarp

l_d_allan

jdoe

Ratch