I've been exploring the title topic and thought a lab report might be of mutual benefit. Michael has put together several macros, timers.asm and counter2.asm, that provide a starting point and syntax to build upon. The essential idea is using the RDTSC command to read the time-stamp counter before and after execution of a block of code and infer something from the changes. One problem I've encountered with this instruction is that, on my P4, values returned are always multiples of 4, reducing precision to 4 clocks for a single measurement.
Serialization, ensuring that all prior commands have been completed before the clock is read, is a related issue. Convention says to insert a CPUID command before each RDTSC. There's a 1997 Intel document, rdtscpm1.pdf, that can be found on the web. For more recent discussion:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30226599.aspx
Whether or not I use CPUID inside timing loops, results aren't changed with either a P1 or P4 CPU. With the P4, the instruction burns 450 cycles, a not inconsiderable overhead and my preference is to comment out the code but have it there for insertion should questions arise. The 'naked' RDTSCs all follows JNZs and this may circumvent out-of-sequence RDTSC executions.
Multitasking interruptions have been suggested to be one cause for erratic results. My own experience has been that, running at the lowest priority, erratic results are only encountered for high loop counts where the estimated time is of order milliseconds, commensurate with guesstimates for time-slice quantums. It's often the case that clock differences of 10K or more can be reproduced to the cycle. To maximize sensitivity to interruptions, I prefer using the lowest priority (IDLE) to ensure they're not an issue..
If results are to be meaningful, as well as reproducible, they should be independent of:
1. algorithm
2. loop repetitions over a significant range
3. warmup repetitions beyond some minimum count for cache stabilization
4. process priorities when not time-slice limited
To date I've compared 3 algorithms:
1. Mean Clocks
a. Read clock - T1
b. Execute N empty reference loops and read clock - T2
c..Read clock - T3
d. Execute N loops with embedded code and read clock - T4
e. Calculate [(T4-T3)-(T2-T1)]/N
2. Minimum Clocks (quad-loops)
a. Repeat N times:
1. Read clock - T1(i)
2. Execute 4 empty reference loops and read clock - T2(i)
3. If T2(i)-T1(i) < Trmin then Trmin=T2(i)-T1(i)
b. Repeat N times:
1. Read clock - T3(i)
2. Execute 4 loops with embedded code and read clock - T4(i)
3. If T4(i)-T3(i) < Tcmin then Tcmin=T4(i)-T3(i)
c. Calculate (Tcmin - Trmin)/4
3. Adjunct Mean Clocks (quad-loops)
a. Repeat N times
1. Read clock - T1(i)
2. Execute 4 empty reference loops and read clock - T2(i)
3. Execute 4 loops with embedded code and read clock - T3(i)
4. Tadj = sum[ T3(i) - 2*T2(i) + T1(i)]
b. Calculate Tadj/(4*N)
Because of the 4-clock resolution of the RDTSC instruction, the latter two algorithms were coded with a 4-fold subloop to achieve 1-clock precision.
On the first go-round, the latter two algorithms returned agreeably similar results, while the Mean Clock algorithm invariably returned lesser, if not negative, values. When this algorithm was more closely examined by looking at raw cycle counts, it was obvious that an empty loop required more clock cycles than a loop containing one or two simple instructions, e.g.
looptest.asm
Iterations 1 10 100 1000 10000 100000 1000000
Empty loop
ref count 104 156 1296 12008 119108 1210316 11912140
code count 88 156 1296 12008 119108 1190588 12052108
NOPS 4
ref count 88 156 1296 12008 119108 1190720 11932788
code count 92 144 480 4080 96504 984108 9648700
MOV EAX,0
ref count 88 156 1296 12008 119300 1191276 11932836
code count 100 128 916 8616 86524 863724 8643656
XCHG ECX,ECX
XCHG EDX,EDX
ref count 88 156 1296 12008 119412 1209256 11939708
code count 92 168 676 6192 60192 600104 6000088
From the 1M counts, we infer that the empty loop takes 12 clocks, and this drops to 6 clocks when two XCHGs are inserted! The loop itself is simply:
@@: push ecx
....
pop ecx
sub ecx,1
jnz @B
with @@: aligned on a 16-byte boundary. Loop latency should be about 6 clocks and the empty loop 12-clock tally is definitely the problem. The branch predictor table for the jump seems to most likely culprit - but it's hard to envision a predictor algorithm that hasn't learned after several thousand lpasses that the jump is always being taken.
The XCHG lines do seem to resolve the issue but, as far as I've found out, this is an undocumented P4 feature. (XCHG does add a LOCK when memory access is involved, but that's not the case here.)
To use this 'feature', I've added a pair of XCHGs to the reference and code loops for all three algorithms. The choice of registers doesn't appear to be important and the same one can be used for both XCHGs (XCHG EAX,EAX == NOP shouldn't be used,) Within my limits of reproducibility, only MEAN results are affected.
As a first example, I picked the 9 simple instructions Michael used for counter2 tests.
MWTests.exe
MEAN CLOCKS: MINIMUM CLOCKS: ADJUNCT MEAN CLOCKS:
100 iterations, 100 skips 10 iterations, 10 skips 10 iterations, 10 skips
0 cycles, empty 0 cycles, empty 0 cycles, empty
0 cycles, mov eax,1 0 cycles, mov eax,1 0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2 2 cycles, mov eax,1 mov eax,2 2 cycles, mov eax,1 mov eax,2
0 cycles, nops 4 1 cycles, nops 4 1 cycles, nops 4
10 cycles, mul ecx 6 cycles, mul ecx 6 cycles, mul ecx
0 cycles, rol ecx,32 0 cycles, rol ecx,32 0 cycles, rol ecx,32
14 cycles, rcr ecx,31 11 cycles, rcr ecx,31 11 cycles, rcr ecx,31
28 cycles, div ecx 25 cycles, div ecx 24 cycles, div ecx
42 cycles, StrLen 34 cycles, StrLen 41 cycles, StrLen
For longer code snippets, I've compared two Park-Miller random number routines from the past:
ParkMiller.exe
Mean: 100 iterations, 100 skips
Minimum 25 iterations, 25 skips
Adjunct Mean 25 iterations, 25 skips
Mean: 86 cycles, Park-Miller Low Bits
Minimum: 88 cycles, Park-Miller Low Bits
Adjunct Mean: 88 cycles, Park-Miller Low Bits
Mean: 106 cycles, Park-Miller High Bits
Minimum: 95 cycles, Park-Miller High Bits
Adjunct Mean: 95 cycles, Park-Miller High Bits
The only discrepancy is the Mean result for High Bits. What's surprising is that the Low Bit coding involves a MUL-DIV-DIV sequence whereas the High-Bit code is MUL-DIV-MUL and might be supposed more efficient (and is on a P1).
As a final demo, we'll take four Park-Miller modifications discussed in this forum recently:
RNGs.exe
MEAN CLOCKS: MINIMUM CLOCKS: ADJUNCT MEAN CLOCKS:
40000 iterations, 400 skips 10000 iterations, 10 skips 10000 iterations, 10 skips
64 cycles, Michael 49 cycles, Michael 65 cycles, Michael
68 cycles, Abel Jcc 50 cycles, Abel Jcc 67 cycles, Abel Jcc
65 cycles, Lingo 66 cycles, Lingo 66 cycles, Lingo
54 cycles, Abel CMOVcc 55 cycles, Abel CMOVcc 57 cycles, Abel CMOVcc
All are clear improvements over prior codings, but the differences are too close to call. Two of the MINIMUM readings seem on the low side and might warrant a closer look, but any choice would be better based on something besides clock counting.
The attached file contains current macros (clock_counters.asm) for the masm32/macros folder and ASM/EXE files to illustrate usage and reproduce results above. CPU comparisons are of obvious interest. Most work here has been with a 2.0 GHz P4, but results with an 0.12 GHz P1 were yet more self-consistent.
Abel
[attachment deleted by admin]
Abel,
I have this waryness of anything that tries for an interpreted timing and prefer real time testing based on relative performance between algorithms. I am much of the view that if something does not perform in real time, then it does not perform at all. This approach means going after a sample large enough or looped enough times and with some understanding of the intent of the algorithm whether for example its aimed at short takeoff and short data to process or alternatively if it will be pointed at very large data.
The more accurately the test matches the intent of the algo, the more useful its results will be in terms of real world performance.
Hutch,
I've gone over your comments several times and still can't figure out whether we have a substantive issue of difference. The question I'm asking is to what degree can we understand the macroscopic, real world from analysis of the microscopic. For the former, I'd restrict experiments to time scales where the operating system isn't a major factor, millisec's by my estimate, or order 1M clocks. If a clock count per loop remains the same when the number of loops changes from 1,000,000 to 100, I'll argue we've gotten somewhere. My personal experience is that a lower limit of 20 loops is generally the case, but that's a parameter for the tool/algorithm of choice and user verifiable. I did propose several criteria for meaningful results - independence of loop counts, warmups, and process priorities but focused on the algorithms. To the best of my knowledge, results presented match these criteria.
The MEAN algorithm makes single measurements of 1M clocks or more if asked. The other two will make a series of 400+ clock measurements totaling 1M clocks. Their inconsistency tests the limitations of short-time data. Some may have spotted the raw count of 6,000,088 in one of the 1M looptest results. That's not a fluke - repeats were usually 6,000,xxx. Of the xxx, about 100 are out-of-loop overhead. Doesn't always work out so nicely though.
One of my incentives is figuring out for myself how much of the mind-numbing Pentium optimization lore is historical baggage and what remains relevant for a Pentium IV.15.2.0.0; and thence to the question of how much the whole is greater or less than the sum of its parts.
Regards,
Abel
My rule of thumb is to never time code in isolation, and never impose factors that wont be true during typical usage of the algorithm.
Ie, only within a major subset of THE real world program it is meant for, working on the data it is MEANT to process. No mock trials. Mock trials yeild mock results.
There are exceptions where I would time code under mock conditions. These are cases with little chance of there being a meaningfull difference between a mock trial and the real thing, such as very long tight loops which only work on small sets of data.
Two points for consideration:
1. How accurately can one clock executing code?
2. What use can be made of it?
The previous macros for the three algorithms were modified to furnish decimal results and normalize for macro repeat REPS.
(A number of loops are clocked either individually or collectively and each loop may contain a repeat macro so that the
total number of code executions is LOOPS*REPS.) High REPS are especially useful for short sequences as they create
a better-defined environment, but care is advisable to distinguish latencies from throughput and L1 cache size (8K for P4)
sets an upper REP limit.
LOOPS EQU 100
SKIPs EQU 5
REPS EQU 1000
PRIORITY EQU IDLE_PRIORITY_CLASS
Mean clock counts: Minimum clock counts: Adjunct mean clock counts:
0.000 cycles, no code 0.000 cycles, no code -0.00 cycles, no code
0.339 cycles, nop 0.342 cycles, nop 0.345 cycles, nop
1.393 cycles, nops 4 1.370 cycles, nops 4 1.370 cycles, nops 4
6.992 cycles, bswap eax 6.992 cycles, bswap eax 6.989 cycles, bswap eax
0.339 cycles, mov eax, 1 0.344 cycles, mov eax, 1 0.343 cycles, mov eax, 1
0.339 cycles, movzx eax,cl 0.341 cycles, movzx eax,cl 0.340 cycles, movzx eax,cl
0.497 cycles, add eax, 1 0.492 cycles, add eax, 1 0.490 cycles, add eax, 1
0.680 cycles, cmp eax,7fffffffh 0.678 cycles, cmp eax,7fffffffh 0.678 cycles, cmp eax,7fffffffh
6.986 cycles, adc eax, 1 6.985 cycles, adc eax, 1 6.983 cycles, adc eax, 1
3.993 cycles, ror eax, 8 3.991 cycles, ror eax, 8 3.988 cycles, ror eax, 8
18.98 cycles, rcr eax, 8 18.98 cycles, rcr eax, 8 19.06 cycles, rcr eax, 8
13.99 cycles, mul ebx 13.99 cycles, mul ebx 14.03 cycles, mul ebx
48.99 cycles, xor edx,edx : div ebx 48.99 cycles, xor edx,edx : div ebx 48.98 cycles, xor edx,edx : div ebx
Presuming many instructions take an integer number of clock cycles, results above indicate accuracy into the 2nd decimal place.
One instruction shows a half-cycle value, but several come in at 0.34 clocks and its multiples 0.68 and 1.37. Agner Fog's
tables assign these 0.25 clock throughputs. But:
"The trace cache can submit only three uops per clock cycle to the queue. This sets a limit to the execution speed if all
uops are of the type that can execute in alu0 and alu1. The throughput of four uops per clock cycle can thus only be obtained
if uops have been queued during a preceding period of lower throughput (due to slow instructions or cache misses). My
measurements show that a throughput of four uops per clock cycle can be obtained for a maximum of 11 consecutive clock cycles
if the queue has been filled during a preceding period of lower throughput."
Agner Fog, "How to Optimize for the Pentium Family of Microprocessors" (http://www.agner.org/)
While there is something to be gained from analyzing individual instructions, Agner has already assembled a quite thorough
compilation. It's more interesting to see how pipe-lining, out-of-order execution, and other CPU exotica facilitate or encumber
one's own efforts.
Consider the textbook add reg,reg classic:
LOOPS EQU 100
SKIPS EQU 4
REPS EQU 500
PRIORITY EQU IDLE_PRIORITY_CLASS
Mean Min AdMean
add eax,ebx 0.487 0.486 0.480
add eax,ecx 0.987 0.984 0.980
add eax,edx 1.488 1.498 1.490
add eax,ecx 0.486 0.486 0.480
add ebx,edx 0.679 0.676 0.676
add eax,ebx 1.017 1.024 1.018
The basic instruction has an 0.5 cycle latency and the first triad shows what's expected for a dependent sequence. The
second triad shows the 3 for 2 oft described optimization. The second ADD does take 1/6 cycle to decide on the alternative ALU,
but this appears to be recovered in the third. Nice to see that what's in the books still applies, but is a 1/2 cycle gain
worth a rewrite?
Recently, there has been some perplexity wrt clocking random number generator codings. The purpose of the RNG code is to read
in a seed and spew out a new one, with pseudo-randoms being derived from each seed. For illustration, let's examine rand32
and look at two cases, one with the -in and -out seeds having a common memory address and one where they differ. (The latter isn't
especially practical as it always yields the same results.)
LOOPS EQU 100
SKIP EQU 4
REPS EQU 100
PRIORITY EQU IDLE_PRIORITY_CLASS
seed addresses: in <> out in == out
Mean Min AdMean Mean Min AdMean
rand32 proc base:DWORD 7.947 7.870 7.860 7.952 7.870 7.860
mov eax, 16807 7.938 7.870 7.860 7.947 7.870 7.862
mul seed_in 17.937 17.880 17.870 17.946 17.880 17.870
mov ecx,7fffffffh 17.946 17.880 17.870 17.951 17.880 17.870
add edx,edx 17.945 17.890 17.880 17.946 17.890 17.880
cmp eax,ecx 19.947 19.890 19.880 19.946 19.890 19.881
jna @F 19.992 19.940 23.035? 19.982 19.940 19.947
sub eax,ecx 19.994 19.940 19.946 19.982 19.940 19.950
@@: add eax,edx 19.992 19.940 19.947 19.995 19.940 19.948
jns @F 23.947 19.950 19.948 23.947 23.940 23.930
sub eax,ecx 23.947 23.940 19.940 23.950 23.940 23.929
@@: mov seed_out,eax 23.947 23.940 19.940 39.246 39.250 40.989
xor edx,edx 23.947 19.950 19.939 39.209 39.210 40.978
div dword ptr[esp+4] 49.099 43.600 44.141 74.705 71.350 74.300
mov eax,edx 47.055 45.490 45.498 74.314 70.450 74.174
ret 4
rand32 endp
The experiment begins by commenting out all the instructions except the final ret. We start with 8 cycles from an invoke
(push & call) and the ret. De-commenting mov eax adds nothing - it falls under the shadow of the call. Exposing
the mul adds 10 cycles with several more commands slipping in behind. But, when we get to the bottom line, one
choice gives 45 cycles the other 75. Differences first appear in the mov seed_out,eax command which adds nothing in
one case and 15 cycles in the other, in addition to discombobulating the later div command with its stack reference. Bear in mind
that it isn't the mov instruction itself that takes 15 clocks. It has increased the count for all code up to and including itself.
Insight! The procedure is trying to read seed_in before the previous call has completed writing seed_out! Let's use EBX instead
of a memory variable to test this hypothesis. Not! Northwood, we have a problem!
"On processors with out-of-order execution, it is likely that the second calculation will start before the first calculation is
finished. In some situations, the processor may start several iterations until the maximum throughput of the execution units is
reached. The out-of-order capabilities cannot be utilized, however, in the case where each calculation depends on the result of the
previous one. Such a continued dependence chain is the worst situation you can have, and you should definitely try to find a way to
break down the dependence chain." - A. F., op. cit.
The rand32 code above was an alternative suggestion I had made for a procedure of Michael. Turns out both suffer from this cyclic
dependency, while alternative suggestions by Lingo and CMOV variants do not.
Agner's experimental results are based on his file TSCTestB32.asm. His algorithm clocks an empty loop a number of times and picks
the minimum for overhead. It then clocks a code block NUMTESTS times, subtracts the overhead, and saves each result in an array for
display, analysis, or whatever. I've taken the liberty of splitting his procedure into two MASM32 macros with a few compatibility tweaks
which I believe conform to the relevant GNU General Public License. My original zip has been updated and I've added these macros with
a demo for Michael's coding that shows a reproducible 532 clocks/10 repetitions when the seed is reset every 10 repetitions and garbage
ranging 50 to 100+ clocks otherwise. Feel free to report a useful solution. Another CPU perhaps?
Caveat: Results reported were obtained using a Pentium 4 CPU 2.00GHz, Family: 15. Model: 2. At times, one sees questionable values
that aren't reproducible and are best ignored - One man's noise is another's Nobel prize.
Abel