News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code Clocking Algorithms

Started by Abel, February 24, 2007, 11:04:22 AM

Previous topic - Next topic

Abel

I've been exploring the title topic and thought a lab report might be of mutual benefit.  Michael has put together several macros, timers.asm and counter2.asm, that provide a starting point and syntax to build upon.  The essential idea is using the RDTSC command to read the time-stamp counter before and after execution of a block of code and infer something from the changes.  One problem I've encountered with this instruction is that, on my P4, values returned are always multiples of 4, reducing precision to 4 clocks for a single measurement.

Serialization, ensuring that all prior commands have been completed before the clock is read, is a related issue.  Convention says to insert a CPUID command before each RDTSC.  There's a 1997 Intel document, rdtscpm1.pdf, that can be found on the web.  For more recent discussion:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30226599.aspx

Whether or not I use CPUID inside timing loops, results aren't changed with either a P1 or P4 CPU.  With the P4, the instruction burns 450 cycles, a not inconsiderable overhead  and my preference is to comment out the code but have it there for insertion should questions arise.  The 'naked' RDTSCs all follows JNZs and this may circumvent out-of-sequence RDTSC executions.

Multitasking interruptions have been suggested to be one cause for erratic results.  My own experience has been that, running at the lowest priority, erratic results are only encountered for high loop counts where the estimated time is of order milliseconds, commensurate with guesstimates for time-slice quantums.  It's often the case that clock differences of 10K or more can be reproduced to the cycle.  To maximize sensitivity to interruptions, I prefer using the lowest priority (IDLE) to ensure they're not an issue..

If results are to be meaningful, as well as reproducible, they should be independent of:
1.  algorithm
2.  loop repetitions over a significant range
3.  warmup repetitions beyond some minimum count for cache stabilization
4.  process priorities when not time-slice limited

To date I've compared 3 algorithms:
1.  Mean Clocks
    a. Read clock - T1
    b. Execute N empty reference loops and read clock - T2
    c..Read clock - T3
    d. Execute N loops with embedded code and read clock - T4
    e. Calculate [(T4-T3)-(T2-T1)]/N

2.  Minimum Clocks (quad-loops)
   a. Repeat N times:
      1.  Read clock - T1(i)
      2.  Execute 4 empty reference loops and read clock - T2(i)
      3.  If T2(i)-T1(i) < Trmin then Trmin=T2(i)-T1(i)
   b. Repeat N times:
      1.  Read clock - T3(i)
      2.  Execute 4 loops with embedded code and read clock - T4(i)
      3.  If T4(i)-T3(i) < Tcmin then Tcmin=T4(i)-T3(i)
   c. Calculate (Tcmin - Trmin)/4

3.  Adjunct Mean Clocks (quad-loops)
   a. Repeat N times
      1.  Read clock - T1(i)
      2.  Execute 4 empty reference loops and read clock - T2(i)
      3.  Execute 4 loops with embedded code and read clock - T3(i)
      4.  Tadj = sum[ T3(i) - 2*T2(i) + T1(i)]
   b. Calculate Tadj/(4*N)

Because of the 4-clock resolution of the RDTSC instruction, the latter two algorithms were coded with a 4-fold subloop to achieve 1-clock precision.

On the first go-round, the latter two algorithms returned agreeably similar results, while the Mean Clock algorithm invariably returned lesser, if not negative, values.  When this algorithm was more closely examined by looking at raw cycle counts, it was obvious that an empty loop required more clock cycles than a loop containing one or two simple instructions, e.g.

looptest.asm

Iterations            1      10      100     1000    10000   100000  1000000

Empty loop
ref count          104     156     1296    12008   119108  1210316 11912140
code count          88     156     1296    12008   119108  1190588 12052108

NOPS 4
ref count           88     156     1296    12008   119108  1190720 11932788
code count          92     144      480     4080    96504   984108  9648700

MOV EAX,0
ref count           88     156     1296    12008   119300  1191276 11932836
code count         100     128      916     8616    86524   863724  8643656

XCHG ECX,ECX
XCHG EDX,EDX
ref count           88     156     1296    12008   119412  1209256 11939708
code count          92     168      676     6192    60192   600104  6000088


From the 1M counts, we infer that the empty loop takes 12 clocks, and this drops to 6 clocks when two XCHGs are inserted!  The loop itself is simply:


@@: push ecx
....
pop ecx
sub ecx,1
jnz @B

      
with @@: aligned on a 16-byte boundary.  Loop latency should be about 6 clocks and the empty loop 12-clock tally is definitely the problem.  The branch predictor table for the jump seems to most likely culprit - but it's hard to envision a predictor algorithm that hasn't learned after several thousand lpasses that the jump is always being taken.

The XCHG lines do seem to resolve the issue but, as far as I've found out, this is an undocumented P4 feature.  (XCHG does add a LOCK when memory access is involved, but that's not the case here.)

To use this 'feature', I've added a pair of XCHGs to the reference and code loops for all three algorithms.  The choice of registers doesn't appear to be important and the same one can be used for both XCHGs (XCHG EAX,EAX == NOP shouldn't be used,)  Within my limits of reproducibility, only  MEAN results are affected.

As a first example, I picked the 9 simple instructions Michael used for counter2 tests.

MWTests.exe

MEAN CLOCKS:                     MINIMUM CLOCKS:                 ADJUNCT MEAN CLOCKS:               
100 iterations, 100 skips        10 iterations, 10 skips         10 iterations, 10 skips       
0 cycles, empty                  0 cycles, empty                 0 cycles, empty                 
0 cycles, mov eax,1              0 cycles, mov eax,1             0 cycles, mov eax,1             
0 cycles, mov eax,1 mov eax,2    2 cycles, mov eax,1 mov eax,2   2 cycles, mov eax,1 mov eax,2   
0 cycles, nops 4                 1 cycles, nops 4                1 cycles, nops 4               
10 cycles, mul ecx                6 cycles, mul ecx               6 cycles, mul ecx               
0 cycles, rol ecx,32             0 cycles, rol ecx,32            0 cycles, rol ecx,32           
14 cycles, rcr ecx,31            11 cycles, rcr ecx,31           11 cycles, rcr ecx,31           
28 cycles, div ecx               25 cycles, div ecx              24 cycles, div ecx             
42 cycles, StrLen                34 cycles, StrLen               41 cycles, StrLen               


For longer code snippets, I've compared two Park-Miller random number routines from the past:

ParkMiller.exe

Mean:         100 iterations, 100 skips
Minimum        25 iterations,  25 skips 
Adjunct Mean   25 iterations,  25 skips 

Mean:          86 cycles, Park-Miller Low Bits
Minimum:       88 cycles, Park-Miller Low Bits
Adjunct Mean:  88 cycles, Park-Miller Low Bits

Mean:         106 cycles, Park-Miller High Bits
Minimum:       95 cycles, Park-Miller High Bits
Adjunct Mean:  95 cycles, Park-Miller High Bits


The only discrepancy is the Mean result for High Bits.  What's surprising is that the Low Bit coding involves a MUL-DIV-DIV sequence whereas the High-Bit code is MUL-DIV-MUL and might be supposed more efficient (and is on a P1).

As a final demo, we'll take four Park-Miller modifications discussed in this forum recently:

RNGs.exe

MEAN CLOCKS:                     MINIMUM CLOCKS:                 ADJUNCT MEAN CLOCKS:               
40000 iterations, 400 skips      10000 iterations, 10 skips      10000 iterations, 10 skips       
64 cycles, Michael               49 cycles, Michael              65 cycles, Michael   
68 cycles, Abel Jcc              50 cycles, Abel Jcc             67 cycles, Abel Jcc   
65 cycles, Lingo                 66 cycles, Lingo                66 cycles, Lingo     
54 cycles, Abel CMOVcc           55 cycles, Abel CMOVcc          57 cycles, Abel CMOVcc


All are clear improvements over prior codings, but the differences are too close to call.  Two of the MINIMUM readings seem on the low side and might warrant a closer look, but any choice would be better based on something besides clock counting.

The attached file contains current macros (clock_counters.asm) for the masm32/macros folder and ASM/EXE files to illustrate usage and reproduce results above.  CPU comparisons are of obvious interest.  Most work here has been with a 2.0 GHz P4, but results with an 0.12 GHz P1 were yet more self-consistent.

Abel


[attachment deleted by admin]

hutch--

Abel,

I have this waryness of anything that tries for an interpreted timing and prefer real time testing based on relative performance between algorithms. I am much of the view that if something does not perform in real time, then it does not perform at all. This approach means going after a sample large enough or looped enough times and with some understanding of the intent of the algorithm whether for example its aimed at short takeoff and short data to process or alternatively if it will be pointed at very large data.

The more accurately the test matches the intent of the algo, the more useful its results will be in terms of real world performance.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Abel

Hutch,
I've gone over your comments several times and still can't figure out whether we have a substantive issue of difference.  The question I'm asking is to what degree can we understand the macroscopic, real world from analysis of the microscopic.  For the former, I'd restrict experiments to time scales where the operating system isn't a major factor, millisec's by my estimate, or order 1M clocks.  If a clock count per loop remains the same when the number of loops changes from 1,000,000 to 100, I'll argue we've gotten somewhere.  My personal experience is that a lower limit of 20 loops is generally the case, but that's a parameter for the tool/algorithm of choice and user verifiable.  I did propose several criteria for meaningful results - independence of loop counts, warmups, and process priorities but focused on the algorithms.  To the best of my knowledge, results presented match these criteria.

The MEAN algorithm makes single measurements of 1M clocks or more if asked.  The other two will make a series of 400+ clock measurements totaling 1M clocks.  Their inconsistency tests the limitations of short-time data.  Some may have spotted the raw count of 6,000,088 in one of the 1M looptest results.  That's not a fluke - repeats were usually 6,000,xxx.  Of the xxx, about 100 are out-of-loop overhead.  Doesn't always work out so nicely though.

One of my incentives is figuring out for myself how much of the mind-numbing Pentium optimization lore is historical baggage and what remains relevant for a Pentium IV.15.2.0.0; and thence to the question of how much the whole is greater or less than the sum of its parts.

Regards,
Abel

Rockoon

My rule of thumb is to never time code in isolation, and never impose factors that wont be true during typical usage of the algorithm.

Ie, only within a major subset of THE real world program it is meant for, working on the data it is MEANT to process. No mock trials. Mock trials yeild mock results.

There are exceptions where I would time code under mock conditions. These are cases with little chance of there being a meaningfull difference between a mock trial and the real thing, such as very long tight loops which only work on small sets of data.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Abel

Two points for consideration:
1.  How accurately can one clock executing code?
2.  What use can be made of it?

The previous macros for the three algorithms were modified to furnish decimal results and normalize for macro repeat REPS.
(A number of loops are clocked either individually or collectively and each loop may contain a repeat macro so that the
total number of code executions is LOOPS*REPS.)  High REPS are especially useful for short sequences as they create
a better-defined environment, but care is advisable to distinguish latencies from throughput and L1 cache size (8K for P4)
sets an upper REP limit.



LOOPS       EQU 100
SKIPs       EQU 5
REPS        EQU 1000
PRIORITY    EQU IDLE_PRIORITY_CLASS

Mean clock counts:                    Minimum clock counts:                Adjunct mean clock counts:
0.000 cycles, no code                 0.000 cycles, no code                -0.00 cycles, no code
0.339 cycles, nop                     0.342 cycles, nop                    0.345 cycles, nop
1.393 cycles, nops 4                  1.370 cycles, nops 4                 1.370 cycles, nops 4
6.992 cycles, bswap eax               6.992 cycles, bswap eax              6.989 cycles, bswap eax
0.339 cycles, mov eax, 1              0.344 cycles, mov eax, 1             0.343 cycles, mov eax, 1
0.339 cycles, movzx eax,cl            0.341 cycles, movzx eax,cl           0.340 cycles, movzx eax,cl
0.497 cycles, add eax, 1              0.492 cycles, add eax, 1             0.490 cycles, add eax, 1
0.680 cycles, cmp eax,7fffffffh       0.678 cycles, cmp eax,7fffffffh      0.678 cycles, cmp eax,7fffffffh
6.986 cycles, adc eax, 1              6.985 cycles, adc eax, 1             6.983 cycles, adc eax, 1
3.993 cycles, ror eax, 8              3.991 cycles, ror eax, 8             3.988 cycles, ror eax, 8
18.98 cycles, rcr eax, 8              18.98 cycles, rcr eax, 8             19.06 cycles, rcr eax, 8
13.99 cycles, mul ebx                 13.99 cycles, mul ebx                14.03 cycles, mul ebx
48.99 cycles, xor edx,edx : div ebx   48.99 cycles, xor edx,edx : div ebx  48.98 cycles, xor edx,edx : div ebx


Presuming many instructions take an integer number of clock cycles, results above indicate accuracy into the 2nd decimal place.
One instruction shows a half-cycle value, but several come in at 0.34 clocks and its multiples 0.68 and 1.37.  Agner Fog's
tables assign these 0.25 clock throughputs.  But:

"The trace cache can submit only three uops per clock cycle to the queue. This sets a limit to the execution speed if all
uops are of the type that can execute in alu0 and alu1. The throughput of four uops per clock cycle can thus only be obtained
if uops have been queued during a preceding period of lower throughput (due to slow instructions or cache misses).  My
measurements show that a throughput of four uops per clock cycle can be obtained for a maximum of 11 consecutive clock cycles
if the queue has been filled during a preceding period of lower throughput."
Agner Fog,  "How to Optimize for the Pentium Family of Microprocessors" (http://www.agner.org/)

While there is something to be gained from analyzing individual instructions,  Agner has already assembled a quite thorough
compilation.  It's more interesting to see how pipe-lining, out-of-order execution, and other CPU exotica facilitate or encumber
one's own efforts.

Consider the textbook add reg,reg classic:

LOOPS   EQU 100
SKIPS   EQU 4
REPS    EQU 500
PRIORITY    EQU IDLE_PRIORITY_CLASS
                         Mean    Min    AdMean
        add eax,ebx     0.487   0.486   0.480
        add eax,ecx     0.987   0.984   0.980
        add eax,edx     1.488   1.498   1.490

        add eax,ecx     0.486   0.486   0.480
        add ebx,edx     0.679   0.676   0.676
        add eax,ebx     1.017   1.024   1.018


The basic instruction has an 0.5 cycle latency and the first triad shows what's expected for a dependent sequence.  The
second triad shows the 3 for 2 oft described optimization.  The second ADD does take 1/6 cycle to decide on the alternative ALU,
but this appears to be recovered in the third.  Nice to see that what's in the books still applies, but is a 1/2 cycle gain
worth a rewrite?

Recently, there has been some perplexity wrt clocking random number generator codings.  The purpose of the RNG code is to read
in a seed and spew out a new one, with pseudo-randoms being derived from each seed.  For illustration, let's examine rand32
and look at two cases, one with the -in and -out seeds having a common memory address and one where they differ.  (The latter isn't
especially practical as it always yields the same results.)


LOOPS       EQU 100
SKIP        EQU 4
REPS        EQU 100
PRIORITY    EQU IDLE_PRIORITY_CLASS

seed addresses:                    in <> out                       in == out
                             Mean    Min    AdMean           Mean    Min    AdMean
rand32 proc base:DWORD       7.947   7.870   7.860           7.952   7.870   7.860
    mov eax, 16807           7.938   7.870   7.860           7.947   7.870   7.862
    mul seed_in             17.937  17.880  17.870          17.946  17.880  17.870
    mov ecx,7fffffffh       17.946  17.880  17.870          17.951  17.880  17.870
    add edx,edx             17.945  17.890  17.880          17.946  17.890  17.880
    cmp eax,ecx             19.947  19.890  19.880          19.946  19.890  19.881
    jna @F                  19.992  19.940  23.035?         19.982  19.940  19.947
    sub eax,ecx             19.994  19.940  19.946          19.982  19.940  19.950
@@: add eax,edx             19.992  19.940  19.947          19.995  19.940  19.948
    jns @F                  23.947  19.950  19.948          23.947  23.940  23.930
    sub eax,ecx             23.947  23.940  19.940          23.950  23.940  23.929
@@: mov seed_out,eax        23.947  23.940  19.940          39.246  39.250  40.989
    xor edx,edx             23.947  19.950  19.939          39.209  39.210  40.978
    div dword ptr[esp+4]    49.099  43.600  44.141          74.705  71.350  74.300
    mov eax,edx             47.055  45.490  45.498          74.314  70.450  74.174
    ret 4
rand32  endp

The experiment begins by commenting out all the instructions except the final ret.  We start with 8 cycles from an invoke
(push & call) and the ret.  De-commenting mov eax adds nothing - it falls under the shadow of the call.  Exposing
the mul adds 10 cycles with several more commands slipping in behind.  But, when we get to the bottom line, one
choice gives 45 cycles  the other 75.  Differences first appear in the mov seed_out,eax command which adds nothing in
one case and 15 cycles in the other, in addition to discombobulating the later div command with its stack reference. Bear in mind
that it isn't the mov instruction itself that takes 15 clocks.  It has increased the count for all code up to and including itself.

Insight!  The procedure is trying to read seed_in before the previous call has completed writing seed_out!  Let's use EBX instead
of a memory variable to test this hypothesis.  Not!  Northwood, we have a problem!

"On processors with out-of-order execution, it is likely that the second calculation will start before the first calculation is
finished. In some situations, the processor may start several iterations until the maximum throughput of the execution units is
reached.  The out-of-order capabilities cannot be utilized, however, in the case where each calculation depends on the result of the
previous one. Such a continued dependence chain is the worst situation you can have, and you should definitely try to find a way to
break down the dependence chain." - A. F., op. cit.

The rand32 code above was an alternative suggestion I had made for a procedure of Michael.  Turns out both suffer from this cyclic
dependency, while alternative suggestions by Lingo and CMOV variants do not.

Agner's experimental results are based on his file TSCTestB32.asm.  His algorithm clocks an empty loop a number of times and picks
the minimum for overhead.  It then clocks a code block NUMTESTS times, subtracts the overhead, and saves each result in an array for
display, analysis, or whatever.  I've taken the liberty of splitting his procedure into two MASM32 macros with a few compatibility tweaks
which I believe conform to the relevant GNU General Public License.  My original zip has been updated and I've added these macros with
a demo for Michael's coding that shows a reproducible 532 clocks/10 repetitions when the seed is reset every 10 repetitions and garbage
ranging 50 to 100+ clocks otherwise.  Feel free to report a useful solution.  Another CPU perhaps?

Caveat:  Results reported were obtained using a Pentium 4 CPU 2.00GHz, Family: 15. Model: 2.  At times, one sees questionable values
that aren't reproducible and are best ignored - One man's noise is another's Nobel prize.

Abel