News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code timing macros

Started by MichaelW, February 16, 2005, 03:21:52 AM

Previous topic - Next topic

Abel

Michael has provided us with two code clocking programs.  The first, timers.asm, takes timestamp counter readings before and after executing
an empty reference loop N times and before and after executing test code N times, and then finds a cycle count for the test code.

The second, counter2.asm, takes timestamp counter readings before and after each of N executions of a reference loop, saving only the least
difference.  It then does the same with the test code and calculates a cycle count from the two least values.

Those who have tried the latter report negative results for short code, a sign that the empty loop count is too large (vide supra).  Browsing
the posted asm file shows a jnz @B instruction at the end of the reference loop that doesn't go back to the intended aa: label so that the
empty loop count stays fixed at the value set by the first pass.

When that's remedied, counts are more realistic but, at least on a P4, all are multiples of four, e.g. -4, 0, 4, 8,...  P4 tests show the values
returned by rdtsc are always multiples of 4.  (On an old Pentium(1), multiples of 2.)  This limits clock resolution to 4 cycles.

One improvement is to find the lowest cycle count for successive loops of 4 repetitions, thereby increasing resolution to 1 cycle/iteration. 
With good fortune, it might be possible to whip through 4 iterations without interruption.  As a further check on multitasking sensitivity,
one may compare results with the extreme priority settings, IDLE_PRIORITY_CLASS and REALTIME_PRIORITY_CLASS.  With the former,
code can appear faster as it doesn't get to play until the A-codes have reached a hiatus.

The changes take but a few tweaks of Michael's work and attached is a suggested modification for the file counters.asm and some demo
executables.  While most results look promising, at times one hits a very reproducible but higher than expected result that can be shortened
by inserting NOPs or irrelevant code elsewhere in the source.  A cache or piping penalty??  So far, nothing yet found faster than -1 cycle.

FWIW, running c2test2x.exe on a 120MHz, Pentium(1), W95 box gave:
0 0 0 1 9 1 8 42 52 cycle counts for both priorities, which, apart from a div and StrLen call, are nearly identical with results on a 2GHz P4.

Abel


[attachment deleted by admin]

MichaelW

Abel,

Thanks for finding the reference loop problem, and for testing. Shortly after I first noticed your post I became very busy at work, and by the time I got over that hump I had forgotten about your post.

Until very recently virtually all of my testing was done on a P3, and assuming I included a several-second delay at the start of my code, the original version (of the counter2 macros), even with the problem, would return repeatable results. Without the delay there would be significant variations, mostly in the first set of results. I see now why the delay had an effect on the repeatability, and in hindsight, considering that the major point of these macros was to improve repeatability, I should have investigated this effect further.

Correcting the reference loop eliminated the run to run variation, even without the delay. On a P3 the cycle counts for single instructions or pairs of instructions looked reasonable, for example:

0 empty
1 nop
3 nops 4
1 bswap eax
1 mov eax, 1
1 movzx eax, cl
2 mov eax, 1 : mov eax, 2
1 add eax, 1
1 adc eax, 1
1 ror eax, 8
4 rcr eax, 8
5 mul ebx
38 xor edx, edx : div ebx


But on a P4 some of the counts appeared to be ~4 times higher than they should have been:

0 empty
4 nop
0 nops 4
4 bswap eax
4 mov eax, 1
4 movzx eax, cl
4 mov eax, 1 : mov eax, 2
4 add eax, 1
8 adc eax, 1
4 ror eax, 8
20 rcr eax, 8
16 mul ebx
68 xor edx, edx : div ebx


After I eliminated the instructions that preserve EBX around the CPUID instructions the counts for a P3 were little changed:

0        empty
0        nop
2        nops 4
0        bswap eax
1        mov eax, 1
1        movzx eax, cl
1        add eax, 1
1        adc eax, 1
1        ror eax, 8
3        rcr eax, 8
3        mul ebx
37       xor edx, edx : div ebx


And most of the counts for a P4 were more or less reasonable:

0        empty
0        nop
8        nops 4
8        bswap eax
0        mov eax, 1
0        movzx eax, cl
0        add eax, 1
12       adc eax, 1
8        ror eax, 8
28       rcr eax, 8
16       mul ebx
68       xor edx, edx : div ebx


As I have stated previously, I can see no way to get good cycle counts for instructions or sequences of instructions that execute in less than a few clock cycles. With the reference loop working correctly the second set of macros can effectively filter out the bad counts caused by context switches in the loops, but this still leaves a significant problem with the timed instructions not being completely isolated from the timing instructions. AFAIK this is what accounts for the difference in the P4 counts after eliminating the POP EBX following the first CPUID instruction. The only way I can see to correct this problem would be to insert a stand-alone serializing instruction after the timed instructions. CPUID will not work in this position, because for consistent results a fixed CPUID function number must be loaded into EAX, and the instruction that does this will not be isolated from the timed instructions. AFAICT there are no suitable instructions that can be used from a normal Windows app.

And there is still the question of why the second set of macros return significantly different counts than the first set and, more importantly, why the difference varies widely depending on the code being timed. For example:

(timers.asm)
86 cycles, nrandom
38 cycles, nrand
42 cycles, rand32
39 cycles, rand32_lingo


(counter2.asm)
98 cycles, nrandom
56 cycles, nrand
53 cycles, rand32
62 cycles, rand32_lingo

I have spent hours on this, and I still cannot see the reason.

eschew obfuscation

Abel

Michael,
FWIY, on the home P4 and macros:





   MEAN                                 MINIMUM                             ADJ MEAN

1,000,000 Loops, IDLE_CLASS_PRIORITY
0 cycles, empty                     1 cycles, empty                     1 cycles, empty
0 cycles, nop                       -1 cycles, nop                      0 cycles, nop
0 cycles, nops 4                    1 cycles, nops 4                    1 cycles, nops 4
1 cycles, bswap eax                 1 cycles, bswap eax                 -1 cycles, bswap eax
0 cycles, mov eax, 1                0 cycles, mov eax, 1                0 cycles, mov eax, 1
0 cycles, movzx eax,cl              -1 cycles, movzx eax,cl             -1 cycles, movzx eax,cl
1 cycles, add eax, 1                -1 cycles, add eax, 1               1 cycles, add eax, 1
0 cycles, adc eax, 1                1 cycles, adc eax, 1                1 cycles, adc eax, 1
7 cycles, ror eax, 8                0 cycles, ror eax, 8                0 cycles, ror eax, 8
14 cycles, rcr eax, 8               11 cycles, rcr eax, 8               11 cycles, rcr eax, 8
10 cycles, mul ebx                  7 cycles, mul ebx                   6 cycles, mul ebx
46 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx   38 cycles, xor edx, edx : div ebx

100 Loops, IDLE_CLASS_PRIORITY
1 cycles, empty                     0 cycles, empty                     0 cycles, empty
0 cycles, nop                       0 cycles, nop                       0 cycles, nop
0 cycles, nops 4                    1 cycles, nops 4                    1 cycles, nops 4
2 cycles, bswap eax                 1 cycles, bswap eax                 1 cycles, bswap eax
0 cycles, mov eax, 1                0 cycles, mov eax, 1                0 cycles, mov eax, 1
0 cycles, movzx eax,cl              0 cycles, movzx eax,cl              0 cycles, movzx eax,cl
0 cycles, add eax, 1                2 cycles, add eax, 1                2 cycles, add eax, 1
0 cycles, adc eax, 1                2 cycles, adc eax, 1                2 cycles, adc eax, 1
0 cycles, ror eax, 8                0 cycles, ror eax, 8                0 cycles, ror eax, 8
14 cycles, rcr eax, 8               11 cycles, rcr eax, 8               11 cycles, rcr eax, 8
10 cycles, mul ebx                  6 cycles, mul ebx                   6 cycles, mul ebx
41 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx


The first table may be pushing into the OS temporal regime but I didn't checkout REALTIME effects.

Abel
























Mark Jones

Just curious, has anyone considered running these timings in real-mode? Maybe disable the cache if possible? Seems like the best possible speed approximation would be to just include say 10000 repetitions of each instruction under test, surrounded by the serialization routines. Or am I missing something?
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MichaelW

Hi Mark,

Testing in real mode is a good idea that I had not considered. It should be fairly easy to implement, and it would eliminate the variables that Windows introduces into the tests. Real mode might allow the use of a free standing serializing instruction to completely isolate the timed instruction(s) from the timing instructions. I want to further investigate the 4-clock-multiple behavior of the P4, and real mode might provide a better way to do this. The Intel documents seem to indicate that for a P4 the time-stamp counter is incremented in step with the bus clock, instead of with the processor internal clock, and I'm thinking that this might explain the multiples of 4.
eschew obfuscation

Adamanteus

#65
I have such program, it's now ported to Win32 but functionality for DOS is stayed. Possible mark that in Win32 with REALTIME_PRIORITY_CLASS results absolutely the same. Comments on Russian, but I'm to lazy to translate such "gefilta fish" :boohoo:
P. S. Sorry, I here's mistaken on QueryPerformanceFrequency divisor experimentally 10000, that on 10 more then by MSDN documentation. So result of program is : Command - 100 repeat DIV EBX : 32 ns

[attachment deleted by admin]

KeepingRealBusy

Hutch,

As a "Johnny come lately", am I allowed to ask a question? What assembler was used to build the executables in this thread? This first entry in this topic is dated Feb 2005, so many things may have changed from then to now (last post was dec 29 2007). Is JWASM a good choiceto use, or just use MASM 8.0?

Dave,

hutch--

Dave,

I would be pretty sure that Michael used MASM as the macros were designed to work in MASM. For JWASM I would test it to see if you get the same results. Japheth has done a great job in terms of compatibility so it will probably work the same but he is still in beta so it is worth testing against a MASM built version.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy


KeepingRealBusy

Well, the MASM32 install was successful. Read all of the macros. Then assembled test.asm successfully.

Here is my timing for my HP Pavilion, AMD 4200+ 64x2, still running 32 bit mode XP.



24 cycles
26 cycles
117 ms
110 ms

Press enter to exit...



Dave.

KeepingRealBusy

Now the fun begins. I assembled c2test2.asm and executed it. This is what I got:



HIGH_PRIORITY_CLASS
-138 cycles, empty
0 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
-136 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31



And then it blew up with the following VS error.


VS Error Message: Unhandled exception at 0x00401a20 in c2test2.exe: 0xC0000095: Integer overflow.

Call stack: c2test2.exe!00401a20()

Instruction in error: 00401A20  div         eax,ecx

Regs:       EAX     00000001
ECX 444D4163
EBX 68747541
        EDX 69746e65



As nearly as I can determine, "div         eax,ecx " is invalid, at least on the AMD Athlon. The div instruction does not support two parameters (not like the imul which supports one, two, or three operands). div only supports a single parameter of reg/mem, and if the divisor is 32 bits (take your pick, not knowing the Intel format, but guessing from imul I would say ecx), then edx:eax is the dividend and ecx is the divisor. The quotient is returned in eax, the remainder in edx. Since edx was not cleared before the div, the quotient would overflow.

What is the Intel format for div? All the references I have agree with AMD. Where did the two parameter call come from?

Dave.

KeepingRealBusy

Please ignore my post above. The problem was not a two parameter DIV, it was just VS displaying that this was a 64 bit divide by 32 bits. The real problem was that the EDX register was not cleared before the DIV instruction (at least this must be done on AMD). I made the following changes to counter2.asm:

--c2test2asm.new----------------------------------------------- 53
        xor edx,edx
--c2test2.asm.old----------------------------------------------53

--c2test2.asm.new---------------------------------------------- 101
        xor edx,edx
--c2test2.asm.old----------------------------------------------100


The following are my timings:


HIGH_PRIORITY_CLASS
0 cycles, empty
0 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
-135 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31
40 cycles, div ecx
-297 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-136 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
1 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31
40 cycles, div ecx
31 cycles, StrLen



I also found a problem trying to assemble test.asm in timing macros using the latest MASM32.zip. This was a clean install on an otherwise unused drive. The problem was that the crt__xxx symbols were undefined. I changed the header to match the c2test2.asm header as follows and this correctly assembles:


--test.asm.new-------------------------------------------------- 2
    include \masm32\include\masm32rt.inc
    .586
--test.asm.old--------------------------------------------------2
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm


My timings for test.asm are as follows:


19 cycles
19 cycles
201 ms
191 ms


Dave.

jj2007

I stumbled over an odd problem - probably I am just blind to see what's wrong here. I tried to improve the accuracy of timings with a REPEAT 10, but instead of getting 10*the cycle count, I seem getting 100*cycles...

Testing (on a P4) whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

4       cycles jmp directly
5       cycles mov+jmp
474     cycles jmp directly, divide by 10
488     cycles mov+jmp, divide by 10


Expected value for the last two lines would be 47 and 48.

Here is the code:
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm

p1 PROTO: DWORD
p2 PROTO: DWORD

.code
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

p1 proc a1
  pop eax
  pop eax
  mov ecx, dword ptr [esp-8]
  jmp ecx
p1 endp

p2 proc a1
  pop eax
  pop eax
  jmp dword ptr [esp-8]
p2 endp

start:
LOOP_COUNT = 10000000
print chr$("Testing whether",13,10,9,"mov ecx, dword ptr [esp-8]",13,10,9,"jmp ecx",13,10,"is faster than",13,10,9,"jmp dword ptr [esp-8]",13,10,10)

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p2, 123
counter_end
print str$(eax), 9, "cycles jmp directly", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p1, 123
counter_end
print str$(eax), 9, "cycles mov+jmp", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p2, 123 ; manual REPEAT 10 to exclude that REPEAT is causing the problem
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
counter_end
print str$(eax), 9, "cycles jmp directly, divide by 10", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 10
invoke p1, 123
  ENDM
counter_end
print str$(eax), 9, "cycles mov+jmp, divide by 10", 13, 10

print chr$(10,"hit any key")

exit

end start


Any explanation?? ::)

sinsi


C:\Users\me\Desktop>jj
Testing whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

0       cycles jmp directly
1       cycles mov+jmp
46      cycles jmp directly, divide by 10
44      cycles mov+jmp, divide by 10

hit any key
C:\Users\me\Desktop>

If I add

  repeat 10
    invoke p2, 123
  endm

I get 46 again (all copy+paste, none of my code except the 'repeat' bits)
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Thanks, Sinsi. Mysterious. I remember that it worked correctly on my other puter, with a Core CPU, but can't test it right now.
A P4-specific problem?