News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Measuring execution time of procedure

Started by Gunnar Vestergaard, October 18, 2009, 05:41:22 PM

Previous topic - Next topic

Gunnar Vestergaard

Let us say that I implement an algorithm. And there may be more than one way to implement it, where speed is the primary concern. I can imagine that I will often find the need for measuring the execution time of such two different implementations.

For instance, one could implement the algorithm as simply as possible and then implement the same algorithm once more, instead utilizing the CPU pipeline and cache better.

So, for finding the execution time for two implementations, does the Windows API provide any function that simply measures time down to the CPU cycle level? Or even better, does the Pentium have a register that counts CPU cycles?

dedndave

#1
hi Gunnar
measuring clock cycles is our favourite pass-time - lol
MichaelW has written a set of timing macros that most of us use
they may be found in the first post of the first thread in the Laboratory Forum
timers is a good one for most uses - unzip it and put it in the masm32\macros folder
if you look through the rest of that forum, you will find many example programs that use these macros

here is a simple recent example.....
http://www.masm32.com/board/index.php?action=dlattach;topic=12368.0;id=6694

BogdanOntanu

Windows has QueryPerformaceCounter and QueryPerformanceFrequency API that can be used for this.
Additionally in ASM you can use the RDTSC instruction (Read Time Stamp Counter) for this purpose beacuse it is allowed for applications from user mode (ring-3 mode).

The macros that MichaelW has created for this are very good and you should start there.

If you go deeper than this ... in kernel mode and if you write a kernel mode driver... then you can have access to additional counters and performance specific registers of the CPU. Here you can get additional info about the pipeline,cache and predictions.

However the problem with performance counting under a multitasking OS is that other threads or kernel / driver dispatch routines will interfere with your measurement. In order to measure in a  reliable way you must average a huge amount of reads and "hope" to eliminate the preemptive multitasking effect (other things running in the background) .

Alternatively you can try the measurement under a non multitasking OS like DOS or my SOLAR OS (where you have full control to disable "other" things)  or write your own minimal OS or a testbed bootable application and embed your test routines in there. This way you can eliminate the errors... kind of complicated ... but it should give you "pure" results.

Oh and one last thing: beware of synthetic tests.

Small clean little tests can show some results but in real code situations the results can be reversed. I have noticed this a lot when doing performance testing in large applications. The bottle neck moves around and the results are contradictory when compared to small simple "synthetic" test routines.

Hence you must measure in the real life context of the application in order to obtain valid results. This suggests that the most important optimization you can make is/are algorithmic and simplification of code rather than "low level code tricks" that look too well when tested outside of the context.

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

Gunnar Vestergaard

By using RDTSC, I can confirm what is mentioned in the MASM32 optimization section: Using the
LOOP Label

is slow, but using

DEC CX
JNZ Label

is much faster.

hutch--

Gunnar,

You conclusion is correct, the old LOOP instruction is very slow. RE timing algorithms, I echo what Bogdan has said here, design your test to be as close as possible to the circumstances that it runs under and test it with a big enough sample to get the variations down under 1%. In application level ring3 you always have other factors that interfere with very accurate timing so it is nearly impossible to get much under 1%.

ring0 level access is where the action is in terms of high precision timing but the level of hassle is rarely ever worth it as the software you run is run in ring3.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mirno

Word of warning - using RDTSC may lead to strange results when using dual processors, or multi-core with independant clock throttling.

RDTSC keeps a count of the number of clock ticks per core, but if one CPU is throttled (due to low load), then their counts may get out of sync. Starting on core a, and finishing on core b, the returns of RDTSC not in sync leads to strange results.

Mirno

dedndave

Mirno is right - if you look at the example i linked, it selects core 0 prior to running the timing tests
it is easy to do - and all machines have a core 0 - no need to see if it is multicore first
although real-world conditions are the best way to go,
you can get an idea of which instructions are eating time in simpler tests
it can take a considerable amount of time to devise real-world simulations for every little instruction
if you are shuffling data or searching etc, then a real-world like test is the only way to get meaningful results

Gunnar Vestergaard

When using the CPU cycle counting macros from the first topic of The Laboratory, I found that some register values are not preserved. If I remember correctly, the source of these macros doesn't mention at all what registers are preserved and what registers are changed. It should. I am pretty sure that the CPUID instruction changes the EAX, ECX and EDX registers. Is that correct?

jj2007

esi, edi, ebp and esp are being preserved, the rest are not.

MichaelW

Gunnar,

CPUID changes EAX, EBX, ECX, and EDX. The second set of macros (counter2.asm) includes instructions to preserve EBX around the CPUID instruction, and the first set (timers.asm) does not.
eschew obfuscation

dedndave

for some reason, i get "very odd" numbers from counters2
i get merely "odd" numbers from timers - lol
i have modified mine to save EBX