News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing macros rebirth

Started by Petroizki, January 22, 2006, 07:40:46 AM

Previous topic - Next topic

Petroizki

Since MichaelW has not been updating his timing macros, I decided to make my own. The idea is to make more stable results, with much less looping and a better average calculation. The final count calculation has the ability to ignore obvious interrupted timings, that are made by OS interference. The timing macros also don't trash registers, flags, stack or FPU. So it doesn't really matter what part of the code you are timing, but of course; the code being benchmarked with the loop counting macro, must always behave EXACTLY the same way!

This "inline" method of timing is not accurate! Problems arise with cache, branching, you name it. But it should give you a rough feeling of what the timing should be on a specific computer.

Like MichaelW's macros, there is a start-macro, and the end-macro. But instead, my macros have two different clock cycle counters, the other one is a single-pass counter, and the other one is the looping one.

PTIMER_START_COUNT ; <- starts the clock cycle count

; <- code at this point is executed only once
; you can freely use registers and flags, like they were set before the timing macro

PTIMER_END_COUNT ; <- ends the clock cycle count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'


This single-pass method is fine, if you wan't to time code somewhere middle of your program. And like said, the macros do not change stack, registers or anything else, so you can benchmark pretty much anything. You must understand that this macro will not produce stable results, as the code being times is executed only once. Especially the first run trough the code, will give very weird timings. :naughty:

To get more stable results, you should use the looping macro, which works just like the other one. The only exception is, that the code you are timing, must behave exactly the same way every time it's ran. You don't need to change the loop count, it is constant, and should produce good results.

PTIMER_START_LOOP_COUNT ; <- starts the loop count

; <- code at this point is executed multiple times
; you can expect the registers, flags, FPU and stack to be always the same at this point
; as the start count macro restores the original values

PTIMER_END_LOOP_COUNT ; <- ends the loop count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'


My macros won't change the process priority, I decided that it's better to left outside the macros. Of course I advice you to set it to something like HIGH_PRIORITY_CLASS, if you are having very unstable results.

QuoteYou can make your timing more accurate by following these rules:
1. Keep the code that is being timed, small. Large code will result in OS interference, thus resulting in wrong clock cycle count.
2. Try not to benchmark code with lot's and lot's of branches. Processors branch predicition is not unlimited, this will result in very weird timings.
3. The looped timing code should behave exactly the same way everytime it's ran. strcat -function for example, does not follow this rule.
4. And of course; don't run other high priority processes at the same time.

Note that the 64bit counter will wrap-around every 2^64 cycles, in 2GHz processor this means that the timer will give false results about every 292.5 years. Don't say i didn't warn you! :lol

Here is an output on my computer:  :wink
QuoteProcessor: AMD Athlon(tm) 64 Processor 3000+

Petroizki's macros clock counts:
lstrcmp:  968, 968, 968, 968, 968, 968, 968, 968, 968, 968,
szCmp:    29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

MichaelW's macros clock counts:
lstrcmp:  1097, 1100, 1109, 1085, 1085, 1085, 1075, 1081, 1080, 1075,
szCmp:    32, 30, 29, 30, 30, 31, 31, 31, 29, 30,
1 x nop:  4294967295, 2, 1, 2, 1, 2, 1, 2, 1, 1,
4 x nop:  1, 1, 0, 1, 1, 0, 2, 1, 1, 2,

dl: http://personal.inet.fi/atk/partsu/ptimers.zip

Petroizki

New upload.

The comparison code with MichaelW's macros totally choke up older CPU's (or at least Pentium 4). I aligned, and reorganized the code so the cache won't flush every millisecond...

(It worked just fine with AMD CPU's  :toothy )

QuoteProcessor: Intel(R) Pentium(R) 4 CPU 2.80GHz

Petroizki's macros clock counts:
lstrcmp:  1140, 1124, 1148, 1148, 1144, 1140, 1148, 1148, 1144, 1140, 
szCmp:    40, 44, 44, 44, 44, 44, 44, 44, 44, 44, 
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
4 x nop:  4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292,

MichaelW's macros clock counts:
lstrcmp:  1266, 1342, 1191, 1280, 1216, 1255, 1460, 1183, 1247, 1215, 
szCmp:    27, 24, 24, 23, 26, 49, 23, 46, 23, 26, 
1 x nop:  1, 4294967295, 4294967295, 1, 4294967294, 0, 1, 4294967294, 4294967294, 3, 
4 x nop:  4294967290, 4294967290, 4294967293, 4294967292, 4294967293, 4294967291, 4294967293, 4294967290, 4294967291, 4294967292,

MichaelW

Hi Petroizki, I'm glad to see you working on this :U These days I seldom have the time to get involved in anything that I cannot complete in an hour or so.

I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it. I have seen at least two other documents that contained similar code, but in both cases the programmer was failing to control the EAX value prior to executing CPUID, which IMO cast a large amount of doubt on the whole concept. And when I added warm-up code to my macros I could not detect any effect. But I suppose it would be reasonable to do the warm up, on the basis of it can't hurt, and it might help.

I think it was you who pointed out that my macros were not preserving EBX. In my own use I have had to work around the EBX problem more than a few times.

I did not do any test of the single pass macros, but it occurs to me that the sleep 0 is not going to do any good without a priority boost, and not much good without REALTIME_PRIORITY_CLASS.

BTW I like the way you bypassed the alignment filler by having the Sleep function return to a label.

As you can see in the timings below, your loop timing macros have some sort of problem timing the nops on my system, a 500 MHz P3 running Windows 2000 SP4. In the time I've had to work on this I have not been able to determine what exactly is causing the problem.
[EDIT] I see you likely solved the problem while I was composing this [/EDIT]

When I alter your comparison code to use my macros as I intended with a priority boost and longer loops, the repeatability of my raw results are reasonably comparable to the repeatability of your filtered results.
I'm not convinced that the filtering is necessarily a good thing, as it masks a run to run variation that could, I think, possibly indicate execution problems with the code. I'll try to come up with some example code that demonstrates this.

On my system the cycle counts returned by your macros are slightly higher than the cycle counts returned by mine.

Timing instruction sequences that take only a few clock cycles in a 100x loop allows you to avoid returning a cycle count of zero (an "obviously wrong" result).


.586
.model flat, stdcall
    OPTION casemap:NONE

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\msvcrt.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\msvcrt.lib

    include \masm32\macros\macros.asm

    include ptimers.inc   ; include ptimers
    include timers.asm    ; include MichaelW's timers

    COUNT_LOOPS EQU 10
.data?
    szCPU db 48 dup (?) ; will hold the processor information
.data
    str1 db "void void void void void void void void void void void", 0
    str2 db "void null null", 0
.code
start:
    mov eax, 80000000h
    cpuid
    cmp eax, 80000004h
    jb @nocpuid

    mov eax, 80000002h
    cpuid
    mov dword ptr [szCPU], eax
    mov dword ptr [szCPU + 4], ebx
    mov dword ptr [szCPU + 8], ecx
    mov dword ptr [szCPU + 12], edx

    mov eax, 80000003h
    cpuid
    mov dword ptr [szCPU + 16], eax
    mov dword ptr [szCPU + 20], ebx
    mov dword ptr [szCPU + 24], ecx
    mov dword ptr [szCPU + 28], edx

    mov eax, 80000004h
    cpuid
    mov dword ptr [szCPU + 32], eax
    mov dword ptr [szCPU + 36], ebx
    mov dword ptr [szCPU + 40], ecx
    mov dword ptr [szCPU + 44], edx

    print chr$("Processor: ")
    print ADDR szCPU, 13, 10, 13, 10

@nocpuid:
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, 8000h

    print chr$("Petroizki's macros clock counts:",13,10,"lstrcmp:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    invoke lstrcmp, ADDR str1, ADDR str2
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"szCmp:    ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    invoke szCmp, ADDR str1, ADDR str2
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"1 x nop:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    nop
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"4 x nop:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    nop
    nop
    nop
    nop
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,13,10,"MichaelW's macros clock counts:",13,10,"lstrcmp:  ")
    mov ebp, COUNT_LOOPS
loop0:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
    invoke lstrcmp, ADDR str1, ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop0

    print chr$(13,10,"szCmp:    ")
    mov ebp, COUNT_LOOPS
loop1:
    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke szCmp, ADDR str1, ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop1

    print chr$(13,10,"1 x nop x 100:  ")
    mov ebp, COUNT_LOOPS

REPEAT_COUNT EQU 100

loop2:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        nop
      ENDM 
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    Jnz loop2

    print chr$(13,10,"4 x nop x 100:  ")
    mov ebp, COUNT_LOOPS
loop3:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        nop
        nop
        nop
        nop
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop3

    inkey chr$(13,10,13,10,"Press any key to exit...")

    ret

end start


Petroizki's macros clock counts:
lstrcmp:  1174, 1174, 1174, 1169, 1174, 1169, 1174, 1170, 1174, 1174,
szCmp:    51, 50, 50, 50, 50, 56, 50, 50, 50, 50,
1 x nop:  554189319, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop:  4294967290, 4294967290, 4294967290, 4294967290, 4294967290, 4294967290
, 554189322, 4294967290, 4294967290, 4294967290,

MichaelW's macros clock counts:
lstrcmp:  1157, 1159, 1156, 1159, 1156, 1160, 1162, 1154, 1158, 1154,
szCmp:    51, 50, 49, 48, 50, 48, 49, 50, 48, 49,
1 x nop x 100:  45, 45, 45, 45, 45, 45, 41, 45, 45, 45,
4 x nop x 100:  197, 197, 196, 196, 196, 196, 197, 196, 203, 196,

eschew obfuscation

Petroizki

Hello MichaelW,

Yes, I have the intel document, which recommends to warmup rdtsc, and I thought it wouldn't hurt to do so. And I also think that setting the eax to the same value is very important on every call to cpuid, as different values change the time taken by cpuid.

I played with the Sleep function, and it seemed to stabilize the timing a little, so I left it there. It may become important on large codes, where OS interrupt is more likely. It also might be important on some CPU's.

As to the problems with nop instructions.. The timer gives negative values (4294967292 is actually -4), because the initialization code ran faster than the real one. This may be because of pairing, cache or just about anything. I think it's something that must be accepted on "inline" timing. At least the timings are pretty much the same on every run, and that's what I have been aiming at.. :wink

The code that you posted was the bad one, it put Pentium 4 to it's knees, you should test the new one. Note that I temporarily disabled the priority level setting on the your timers.asm!

I would be very interested on seeing results on lower-end CPU's. I have tested with different AMD XP/64, and the results have been very precise so far.

QuoteProcessor: AMD Athlon(tm) XP 2400+

Petroizki's macros clock counts:
lstrcmp:  905, 905, 905, 905, 905, 905, 905, 905, 905, 905,
szCmp:    34, 47, 47, 47, 47, 47, 47, 47, 47, 34,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

MichaelW's macros clock counts:
lstrcmp:  943, 919, 914, 918, 923, 919, 920, 923, 935, 928,
szCmp:    32, 32, 31, 31, 31, 32, 33, 32, 33, 32,
1 x nop:  0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 0,
4 x nop:  0, 0, 0, 1, 0, 4294967295, 1, 0, 4294967295, 4294967295,

EDIT: I think the main reason for the large differences in the result between your macros and mine; is the average calculation, pretty much on every run my macros drop some bad numbers, because the difference is too large.

MichaelW

Running your comparison.exe on my P3-500:

Petroizki's macros clock counts:
lstrcmp:  1162, 1162, 1162, 1162, 1162, 1161, 1162, 1161, 1162, 1162,
szCmp:    50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
1 x nop:  2290649215, 554189319, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop:  4294967290, 4294967290, 554189322, 4294967290, 554189322, 4294967290,
554189322, 554189322, 4294967290, 4294967290,

MichaelW's macros clock counts:
lstrcmp:  1135, 1129, 1131, 1130, 1129, 1129, 1129, 1132, 1131, 1129,
szCmp:    47, 54, 48, 48, 52, 48, 48, 52, 49, 46,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  0, 1, 1, 1, 0, 1, 0, 1, 0, 2,


Which Intel document is it, and where can I get it?

eschew obfuscation


MichaelW

Thanks, I had already tried that link and I got a not found error. But I did find it here:

http://www.eecg.toronto.edu/~de/RDTSCPM1.pdf

And here is the code that IMO casts a large amount of doubt on the concept (and the competence of the author). Note how the code neglects to load a consistent value into EAX ahead of the CPUID instruction.
void main () {
int time, subtime;
float x = 5.0f;
__asm {
// Make three warm-up passes through the timing routine to make
// sure that the CPUID and RDTSC instruction are ready
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
fld x
fld x
cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time, eax
fdiv // Perform division
cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time // Find the difference
mov time, eax
}
time = time - subtime; // Subtract the overhead
printf ("%d\n", time); // Print total time of divide to screen
}


eschew obfuscation

Petroizki

Yes, i noticed the bad cpuid calling on that code too.

I don't know if rdtsc itself really needs to be warmed up, but caching of the initial time code is still very important. Currently I go trough the code twice, and then get the real value for the timer initial clock count. I would be very interested of knowing, if the initial time count is miscalculated on low-end computers. Very unlikely that it would be interrupted by OS, and it seems like an overkill to count average from it, but it might prevent negative values?

Could you check this code on your computer, it prints the setup time instead of the result. It is a value that should NOT have any difference at all. :dazzled:
I think it should always be the same, even when the exe is restarted.
mov ebp, 20
@@: PTIMER_START_LOOP_COUNT
nop
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [__PTIMER_INIT_TIME]) ; <- get the setup time of the timer
print chr$(", ")
sub ebp, 1
jnz @B

Jimg

In my tests http://www.masmforum.com/simple/index.php?topic=2555.msg20246#msg20246 I found it took six times to really settle down.
I've also found that unless one uses the exact same memory address, that is move the test data to the exact same spots in memory before testing, and likewise move the test code itself for the routines under test to the exact same spot before testing, one cannot get consistant results.  After moving the data and test code, the results are consistant with only a few itereations needed rather than looping many thousands of times.

MichaelW

As coded:

234, 234, 235, 164, 164, 189, 234, 198, 238, 235, 242, 198, 207, 235, 235, 235,
193, 236, 197, 234,


With HIGH_PRIORITY_CLASS:

151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151,
151, 151, 151, 151,


eschew obfuscation

Petroizki

I have tested the setup time, and added one extra warmup loop (now three), making the rdtsc&cpuid used 6 times before the actual timing. On my machine, only two warmups sometimes give bad setup time for the first use of the time macro.

MichaelW's timings are shocking, it seems that i probably have to count average for the setup time too (very easily done). It definetly has to be used with HIGH_PRIORITY_CLASS on older CPU's, to provide stable results, at least for now.

MichaelW

The average changed from ~217 to 151. I suspect the same is happening with the faster processors, just to a lesser degree because the total loop time is less.

eschew obfuscation

Petroizki

Ok, new upload. I'm trying to find a simple way to stabilize the setup time. I also trashed the nop tests, instead i added multiply and division.

MichaelW


Petroizki's macros clock counts:
lstrcmp:  1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162,
szCmp:    51, 51, 51, 51, 50, 51, 51, 51, 51, 51,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,

MichaelW's macros clock counts:
lstrcmp:  1132, 1137, 1131, 1127, 1130, 1127, 1130, 1130, 1147, 1128,
szCmp:    56, 54, 51, 47, 49, 53, 50, 50, 49, 49,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 36, 34, 34, 34,

eschew obfuscation

Petroizki

:eek

I guess these timings are due to HIGH_PRIORITY_CLASS i compiled the source with, what if you test it with normal?

Very stable timings indeed.  :8)