News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing Macros for PB

Started by MikeT, January 04, 2008, 03:01:01 PM

Previous topic - Next topic

MichaelW

On my P3 the FB version consistently returns 36, 9, 16, 15, and the PB version consistently returns 38, 11, 17, 35.

For reference, the same tests with second set of MASM32 macros (counter2.asm) consistently return 38, 12, 19, 16, and the same tests with first set of MASM32 macros (timers.asm) typically return 34, 8, 11, 5.

The large difference for the PB version in the last test can be accounted for by differences in the compiler-generated code for the GetTickCount function call. Where the FB and MASM32 code calls GetTickCount through a jump table, the PB code calls GetTickCount directly, and then calls some other function after GetTickCount returns (as it seems to do for all of the API function calls), and the cycle count includes both function calls. I was attempting to test only compiler-independent code, but I made a bad choice for the last test.

The small differences in the first three tests can be accounted for by the differences in the structure of the macros. The method used by these macros has a weakness in that the timed instructions cannot be completely isolated from the timing instructions, so the timing instructions can affect the cycle count for the adjacent timed instructions. The effect can make it impossible to get meaningful cycle counts for single instructions or short sequences of instructions, but as the length of the sequence increases the effect on the cycle count quickly becomes negligible.

I can now see only two remaining problems. Somehow I made a mistake in calculating the number of alignment nops in the PB code, and instead of 9 there should be only 4. And to minimize the variations you should add a delay of several seconds before the first test (as I did with the sleep 3000 in the FB code), to allow time for the various system activities involved in launching an application to complete.

eschew obfuscation

MichaelW

It now appears that my statement regarding this code not changing registers that must be preserved is wrong. Per this page:

"compiled PowerBASIC code is designed to expect that the EBX, ESI, and EDI registers will remain unchanged between lines of BASIC code."

Even though there are no apparent problems, the CPUID instruction is changing the value of EBX, so to be safe EBX needs to be preserved around the CPUID instruction. The attachment contains the updated code, with the last test replaced with a longer block of assembly code derived from the MASM32 nrandom procedure.

Edit: I forgot to add a several second delay before the first test.

[attachment deleted by admin]
eschew obfuscation

MikeT

on my P4 ver 12 (sleep 3000 added) results:
16
16
12
88

fbtest results:
12
16
16
92

Results for both are consistent

QuotePB code calls GetTickCount directly, and then calls some other function after GetTickCount returns
Let me ask our PB guru about the secondary function call...

Added:
I dont see the 4 nops... should I add those?



[attachment deleted by admin]

MichaelW

#33
Running on my P3, the FB version returns 36, 9, 16, 91 and the PB version 38, 12, 17, 91, both with no variations observed. At this point I doubt that the core code can be further improved. I got the alignment right this time. After I added the instructions to preserve EBX around the CPUID calls, the distance between the for loop labels is 112 bytes (16*7) without any padding. The SLEEP 3000 needs to run one time after the program is launched, so it should be placed before the first COUNTER_BEGIN MACRO call, as it is in the FB code. With the SLEEP 3000 in the COUNTER_BEGIN macro it runs each time the macro is called, and while this should not cause problems, only the first run does anything useful.

If you wish to copy or copy and modify any of my comments or explanations from the MASM32 or FB macros, feel free.
eschew obfuscation

MikeT

Thx Michael
I changed nLoops and the position of the Sleep command
Results
16
16
12
88

Last two numbers are consistant.
first two are 20 one in every 10 executions.

In general is 1000 loops enough to measure?

What would be the lowest number of loops to get consitent results do you think?


[attachment deleted by admin]

MichaelW

On my P3 I get 38, 11, 17, 93 with no variations observed. Within my experience the variations you are seeing on the short sequences of instructions are normal for a P4. These sequences are not representative of real-world code, they are just something that I threw together as a quick test. Most code that would need to be optimized would involve longer sequences. You can also expect to see variations on code were a single pass through the code takes more than, very roughly, a few hundred clock cycles, but at least on my system the variations seldom reach even one percent.

Within my experience 1000 loops is more than enough. On my P3 the FB version running the same test code produces good repeatability down to 2 loops. I use 1000 loops normally, but I commonly reduce this for code where 1000 loops would take too long to execute. For code that cannot be looped I reduce the number of loops to 1, and then average the results over multiple trials.
eschew obfuscation

MikeT

Michael,
Thank you for developing this with me. I has been educational.
Today I wanted to put it to a real world test, in this case timing a function that converts a string representation of a number to a LONG.
http://www.powerbasic.com/support/pbforums/showthread.php?t=14373&highlight=VAL

In the end I created the attached test bed to evaluate the functions.
The timing variations I got with CPU Clock count method & the Counter Cycle MACROs were:
for 1000 loops:
420-438
279-299
77-231
&
344112-370288
226056-252064
74724-76052

For 100 Loops:
460-465
282-311
82-111
&
40816-43616
25244-27836
7724-10908

I was hoping for less variation, but I am probably just not understanding the nature of the beast.
Also, Is there any direct relationship between Clocks and CounterCycles?




[attachment deleted by admin]

MichaelW

I see now that I should have explained the macros in more detail. I was assuming that you understood what I meant by "test loop", and when I saw evidence that you did not, I failed to recognize it. The test loop is built into the macros, with the first part of it in the COUNTER_BEGIN macro, and the last part of it in the COUNTER_END macro. When the macros are expanded in memory, they look something like this:

Initialize CounterOverhead and CounterCycles
Set priority class.

FOR LoopCounter = 1 TO nLoops '' Start of empty reference loop.
    ! XOR eax, eax            '' CPUID function 0
    ! push ebx                '' preserve EBX
    ! db &H0F,&HA2            '' CPUID
    ! pop ebx                 '' recover EBX
    ! db &H0F,&H31            '' RDTSC
    ! mov CounterTSC1, eax    '' store low-order dword
    ! mov CounterTSC1[4], edx '' store high-order dword
   
    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx
   
    '' Capture the lowest cycle count that occurs.
   
    IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN
        CounterOverhead = CounterTSC2 - CounterTSC1
    END IF   
NEXT                          '' End of empty reference loop.

SLEEP 0                       '' Start a new timeslice.

FOR LoopCounter = 1 TO nLoops '' Start of test loop.

    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC1, eax
    ! mov CounterTSC1[4], edx

    '' Code placed between the end of the COUNTER_BEGIN
    '' macro and the start of the COUNTER_END macro
    '' code goes here.

    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx

    '' Capture the lowest cycle count that occurs.
   
    IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN
        CounterCycles = CounterTSC2 - CounterTSC1
    END IF   
NEXT                          '' End of test loop.

Restore priority class to normal.
Calculate counter cycles, correcting it for the loop overhead.


So to pick one example, that illustrates how the macros are intended to be used, and assuming that nLoops = 1000, this code:

COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
  RetVal = VAL(sNum)
COUNTER_END

Will execute the RetVal = VAL(sNum) statement 1000 times, capturing the lowest count that occurs in any single pass, and correcting it for the loop overhead, so the count will include only the cycles for the RetVal = VAL(sNum) statement, and not the cycles for the test loop overhead.

This code:

COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
    FOR k = 1 TO nLoops    ' inner loop
        RetVal = VAL(sNum)
    NEXT
COUNTER_END

Will execute the outer test loop (the test loop built into the macros) 1000 times, and the inner loop 1000 times. So the outer test loop will capture the lowest count that occurs in any single pass through the inner test loop, and this count will include 1000 executions of the RetVal = VAL(sNum) statement, as well as the loop overhead for the inner loop.

The attachment tests a FB version of Paul's code, and on my P3 it returns 88, 88, 88 with no variations observed.




[attachment deleted by admin]
eschew obfuscation

MikeT

aaaah... NOW i get the concept. I couldn't see the wood for the trees... I got so focused on other things i missed this obvious piece of the puzzle.

Your fbtest on my machine returns:
132-144
132-144
128-140

Could you test the atached code to see if there are any variations.
On my machine 1000 loops:
420-438
279-299
77-231
&
416-424
300-332
68 - no variation

100 loops:
439-465
282-312
82-111
&
400-424
288-308
68 - no variation

Is it possible that my P4 is responsible for the variations?
Would it help to escalate the priority?
(I think I remember you mentioned the risk is high if the code is not stable)

What puzzles me most, is that there is a 25 Clk overhead for the orginal method yet it reports consistently lower clks for the second algo than the Macros?

The purpose of investigating these Macros is to find a more stable way of timing code at the fine level for development. The absolute number is not as relevant to me as the stability of the number. A 30 Clk variation at 300Clks is a 10% error. Thats not so great.



[attachment deleted by admin]

MichaelW

Running on my P3 the macro return values, for 100 or 1000 loops, are very consistent, with only an occasional 2-cycle variation in the second test. For the time_stamp_count tests, the return values are usually within a few cycles of the return values for the macros, but occasionally they are way off. The way the tests are being performed the overhead for the time_stamp_count procedure call is negligible, accounting for less than one cycle in the final results, even when nLoops = 100.

The P4 is at least part of the problem, but there could be other things going on in your system that could affect the timing. Assuming you are running Windows 2000 or XP, you might take a look at the CPU usage and time on the Processes tab in Task Manager, looking for processes where the CPU usage is variable and frequently high. You might also want to examine the base priority of the processes, but you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu. The greatest effect would be from a process running at an elevated priority, where the CPU usage is variable and frequently high. Testing with REALTIME_PRIORITY_CLASS might reduce or eliminate the effects of these other things. If the code is stable I see no reason to avoid REALTIME_PRIORITY_CLASS, but before using it I would make sure that I had no applications open with unsaved data.

Another possibility might be to warm up the CPUID instruction, as I have seen done in an Intel example source. This seems to have no effect on a P3 or on the AMD processors that I have tested, but it might have an effect on a P4. The code would be added in the COUNTER_BEGIN macro before the start of the reference loop, so adding it would not disturb the relative alignment of the FOR loop labels.

...
    CounterOverhead = 2000000000
    CounterCycles   = 2000000000

    CALL SetPriorityClass( GetCurrentProcess, priority_class )

    '' Warm up the CPUID instruction

    ! push ebx
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! pop ebx

    FOR LoopCounter = 1 TO nLoops
...


A more stable way of timing code at the fine level, where the consistency is more important than the actual count, was my goal in developing the second set of MASM32 macros and the FB macros that use the same basic method.

Also, with the FB macros running a P3, and assuming that I remember to add the several-second delay before the first test, if I run an empty test loop:

counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles


I get a consistent return value of zero. What do your macros do on a P4?
eschew obfuscation

MikeT

#40
Quotebut you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu.
Ah OK. Well all I see in High proiority is
taskmgr.exe
winlogon.exe
csrss.exe
Everything else is normal priority
I keep this machine clean as a whistle.
Nothing is taking up processor time as i watch for several minutes.
When I run your test, it reports about 91% CPU usage for a moment

counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles

consitently returns zero yes

Adding the warm up code to your testing makes it a little more stable. only the third test fluctuates occasionally.
Adding the warm up code to the VAL testing makes little difference at 100 or 1000 loops on this P4 with XP Pro SP2.
But I am relieved to learn that on your P3 it is consitent.


Thank you for your help with this Michael. I learned a lot.


ecube

wow MichaelW you really put lot of time into this, you should tell the PB developers about it, sure they can give you recognition and use it internally to speed test things they do.