The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Petroizki on January 22, 2006, 07:40:46 AM

Title: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 07:40:46 AM
Since MichaelW has not been updating his timing macros (http://www.masmforum.com/simple/index.php?topic=770.0), I decided to make my own. The idea is to make more stable results, with much less looping and a better average calculation. The final count calculation has the ability to ignore obvious interrupted timings, that are made by OS interference. The timing macros also don't trash registers, flags, stack or FPU. So it doesn't really matter what part of the code you are timing, but of course; the code being benchmarked with the loop counting macro, must always behave EXACTLY the same way!

This "inline" method of timing is not accurate! Problems arise with cache, branching, you name it. But it should give you a rough feeling of what the timing should be on a specific computer.

Like MichaelW's macros, there is a start-macro, and the end-macro. But instead, my macros have two different clock cycle counters, the other one is a single-pass counter, and the other one is the looping one.

PTIMER_START_COUNT ; <- starts the clock cycle count

; <- code at this point is executed only once
; you can freely use registers and flags, like they were set before the timing macro

PTIMER_END_COUNT ; <- ends the clock cycle count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'


This single-pass method is fine, if you wan't to time code somewhere middle of your program. And like said, the macros do not change stack, registers or anything else, so you can benchmark pretty much anything. You must understand that this macro will not produce stable results, as the code being times is executed only once. Especially the first run trough the code, will give very weird timings. :naughty:

To get more stable results, you should use the looping macro, which works just like the other one. The only exception is, that the code you are timing, must behave exactly the same way every time it's ran. You don't need to change the loop count, it is constant, and should produce good results.

PTIMER_START_LOOP_COUNT ; <- starts the loop count

; <- code at this point is executed multiple times
; you can expect the registers, flags, FPU and stack to be always the same at this point
; as the start count macro restores the original values

PTIMER_END_LOOP_COUNT ; <- ends the loop count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'


My macros won't change the process priority, I decided that it's better to left outside the macros. Of course I advice you to set it to something like HIGH_PRIORITY_CLASS, if you are having very unstable results.

QuoteYou can make your timing more accurate by following these rules:
1. Keep the code that is being timed, small. Large code will result in OS interference, thus resulting in wrong clock cycle count.
2. Try not to benchmark code with lot's and lot's of branches. Processors branch predicition is not unlimited, this will result in very weird timings.
3. The looped timing code should behave exactly the same way everytime it's ran. strcat -function for example, does not follow this rule.
4. And of course; don't run other high priority processes at the same time.

Note that the 64bit counter will wrap-around every 2^64 cycles, in 2GHz processor this means that the timer will give false results about every 292.5 years. Don't say i didn't warn you! :lol

Here is an output on my computer:  :wink
QuoteProcessor: AMD Athlon(tm) 64 Processor 3000+

Petroizki's macros clock counts:
lstrcmp:  968, 968, 968, 968, 968, 968, 968, 968, 968, 968,
szCmp:    29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

MichaelW's macros clock counts:
lstrcmp:  1097, 1100, 1109, 1085, 1085, 1085, 1075, 1081, 1080, 1075,
szCmp:    32, 30, 29, 30, 30, 31, 31, 31, 29, 30,
1 x nop:  4294967295, 2, 1, 2, 1, 2, 1, 2, 1, 1,
4 x nop:  1, 1, 0, 1, 1, 0, 2, 1, 1, 2,

dl: http://personal.inet.fi/atk/partsu/ptimers.zip
Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 10:19:28 AM
New upload.

The comparison code with MichaelW's macros totally choke up older CPU's (or at least Pentium 4). I aligned, and reorganized the code so the cache won't flush every millisecond...

(It worked just fine with AMD CPU's  :toothy )

QuoteProcessor: Intel(R) Pentium(R) 4 CPU 2.80GHz

Petroizki's macros clock counts:
lstrcmp:  1140, 1124, 1148, 1148, 1144, 1140, 1148, 1148, 1144, 1140, 
szCmp:    40, 44, 44, 44, 44, 44, 44, 44, 44, 44, 
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
4 x nop:  4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292,

MichaelW's macros clock counts:
lstrcmp:  1266, 1342, 1191, 1280, 1216, 1255, 1460, 1183, 1247, 1215, 
szCmp:    27, 24, 24, 23, 26, 49, 23, 46, 23, 26, 
1 x nop:  1, 4294967295, 4294967295, 1, 4294967294, 0, 1, 4294967294, 4294967294, 3, 
4 x nop:  4294967290, 4294967290, 4294967293, 4294967292, 4294967293, 4294967291, 4294967293, 4294967290, 4294967291, 4294967292,
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 11:16:16 AM
Hi Petroizki, I'm glad to see you working on this :U These days I seldom have the time to get involved in anything that I cannot complete in an hour or so.

I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it. I have seen at least two other documents that contained similar code, but in both cases the programmer was failing to control the EAX value prior to executing CPUID, which IMO cast a large amount of doubt on the whole concept. And when I added warm-up code to my macros I could not detect any effect. But I suppose it would be reasonable to do the warm up, on the basis of it can't hurt, and it might help.

I think it was you who pointed out that my macros were not preserving EBX. In my own use I have had to work around the EBX problem more than a few times.

I did not do any test of the single pass macros, but it occurs to me that the sleep 0 is not going to do any good without a priority boost, and not much good without REALTIME_PRIORITY_CLASS.

BTW I like the way you bypassed the alignment filler by having the Sleep function return to a label.

As you can see in the timings below, your loop timing macros have some sort of problem timing the nops on my system, a 500 MHz P3 running Windows 2000 SP4. In the time I've had to work on this I have not been able to determine what exactly is causing the problem.
[EDIT] I see you likely solved the problem while I was composing this [/EDIT]

When I alter your comparison code to use my macros as I intended with a priority boost and longer loops, the repeatability of my raw results are reasonably comparable to the repeatability of your filtered results.
I'm not convinced that the filtering is necessarily a good thing, as it masks a run to run variation that could, I think, possibly indicate execution problems with the code. I'll try to come up with some example code that demonstrates this.

On my system the cycle counts returned by your macros are slightly higher than the cycle counts returned by mine.

Timing instruction sequences that take only a few clock cycles in a 100x loop allows you to avoid returning a cycle count of zero (an "obviously wrong" result).


.586
.model flat, stdcall
    OPTION casemap:NONE

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\msvcrt.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\msvcrt.lib

    include \masm32\macros\macros.asm

    include ptimers.inc   ; include ptimers
    include timers.asm    ; include MichaelW's timers

    COUNT_LOOPS EQU 10
.data?
    szCPU db 48 dup (?) ; will hold the processor information
.data
    str1 db "void void void void void void void void void void void", 0
    str2 db "void null null", 0
.code
start:
    mov eax, 80000000h
    cpuid
    cmp eax, 80000004h
    jb @nocpuid

    mov eax, 80000002h
    cpuid
    mov dword ptr [szCPU], eax
    mov dword ptr [szCPU + 4], ebx
    mov dword ptr [szCPU + 8], ecx
    mov dword ptr [szCPU + 12], edx

    mov eax, 80000003h
    cpuid
    mov dword ptr [szCPU + 16], eax
    mov dword ptr [szCPU + 20], ebx
    mov dword ptr [szCPU + 24], ecx
    mov dword ptr [szCPU + 28], edx

    mov eax, 80000004h
    cpuid
    mov dword ptr [szCPU + 32], eax
    mov dword ptr [szCPU + 36], ebx
    mov dword ptr [szCPU + 40], ecx
    mov dword ptr [szCPU + 44], edx

    print chr$("Processor: ")
    print ADDR szCPU, 13, 10, 13, 10

@nocpuid:
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, 8000h

    print chr$("Petroizki's macros clock counts:",13,10,"lstrcmp:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    invoke lstrcmp, ADDR str1, ADDR str2
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"szCmp:    ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    invoke szCmp, ADDR str1, ADDR str2
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"1 x nop:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    nop
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,"4 x nop:  ")
    mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
    nop
    nop
    nop
    nop
    PTIMER_END_LOOP_COUNT
    print ustr$(dword ptr [qwPTIMER_RESULT])
    print chr$(", ")
    sub ebp, 1
    jnz @B

    print chr$(13,10,13,10,"MichaelW's macros clock counts:",13,10,"lstrcmp:  ")
    mov ebp, COUNT_LOOPS
loop0:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
    invoke lstrcmp, ADDR str1, ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop0

    print chr$(13,10,"szCmp:    ")
    mov ebp, COUNT_LOOPS
loop1:
    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke szCmp, ADDR str1, ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop1

    print chr$(13,10,"1 x nop x 100:  ")
    mov ebp, COUNT_LOOPS

REPEAT_COUNT EQU 100

loop2:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        nop
      ENDM 
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    Jnz loop2

    print chr$(13,10,"4 x nop x 100:  ")
    mov ebp, COUNT_LOOPS
loop3:
    counter_begin 1000000, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        nop
        nop
        nop
        nop
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(", ")
    sub ebp, 1
    jnz loop3

    inkey chr$(13,10,13,10,"Press any key to exit...")

    ret

end start


Petroizki's macros clock counts:
lstrcmp:  1174, 1174, 1174, 1169, 1174, 1169, 1174, 1170, 1174, 1174,
szCmp:    51, 50, 50, 50, 50, 56, 50, 50, 50, 50,
1 x nop:  554189319, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop:  4294967290, 4294967290, 4294967290, 4294967290, 4294967290, 4294967290
, 554189322, 4294967290, 4294967290, 4294967290,

MichaelW's macros clock counts:
lstrcmp:  1157, 1159, 1156, 1159, 1156, 1160, 1162, 1154, 1158, 1154,
szCmp:    51, 50, 49, 48, 50, 48, 49, 50, 48, 49,
1 x nop x 100:  45, 45, 45, 45, 45, 45, 41, 45, 45, 45,
4 x nop x 100:  197, 197, 196, 196, 196, 196, 197, 196, 203, 196,

Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 11:58:18 AM
Hello MichaelW,

Yes, I have the intel document, which recommends to warmup rdtsc, and I thought it wouldn't hurt to do so. And I also think that setting the eax to the same value is very important on every call to cpuid, as different values change the time taken by cpuid.

I played with the Sleep function, and it seemed to stabilize the timing a little, so I left it there. It may become important on large codes, where OS interrupt is more likely. It also might be important on some CPU's.

As to the problems with nop instructions.. The timer gives negative values (4294967292 is actually -4), because the initialization code ran faster than the real one. This may be because of pairing, cache or just about anything. I think it's something that must be accepted on "inline" timing. At least the timings are pretty much the same on every run, and that's what I have been aiming at.. :wink

The code that you posted was the bad one, it put Pentium 4 to it's knees, you should test the new one. Note that I temporarily disabled the priority level setting on the your timers.asm!

I would be very interested on seeing results on lower-end CPU's. I have tested with different AMD XP/64, and the results have been very precise so far.

QuoteProcessor: AMD Athlon(tm) XP 2400+

Petroizki's macros clock counts:
lstrcmp:  905, 905, 905, 905, 905, 905, 905, 905, 905, 905,
szCmp:    34, 47, 47, 47, 47, 47, 47, 47, 47, 34,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

MichaelW's macros clock counts:
lstrcmp:  943, 919, 914, 918, 923, 919, 920, 923, 935, 928,
szCmp:    32, 32, 31, 31, 31, 32, 33, 32, 33, 32,
1 x nop:  0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 0,
4 x nop:  0, 0, 0, 1, 0, 4294967295, 1, 0, 4294967295, 4294967295,

EDIT: I think the main reason for the large differences in the result between your macros and mine; is the average calculation, pretty much on every run my macros drop some bad numbers, because the difference is too large.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 12:40:56 PM
Running your comparison.exe on my P3-500:

Petroizki's macros clock counts:
lstrcmp:  1162, 1162, 1162, 1162, 1162, 1161, 1162, 1161, 1162, 1162,
szCmp:    50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
1 x nop:  2290649215, 554189319, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop:  4294967290, 4294967290, 554189322, 4294967290, 554189322, 4294967290,
554189322, 554189322, 4294967290, 4294967290,

MichaelW's macros clock counts:
lstrcmp:  1135, 1129, 1131, 1130, 1129, 1129, 1129, 1132, 1131, 1129,
szCmp:    47, 54, 48, 48, 52, 48, 48, 52, 49, 46,
1 x nop:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop:  0, 1, 1, 1, 0, 1, 0, 1, 0, 2,


Which Intel document is it, and where can I get it?

Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 12:51:35 PM
Thanks,

The Intel doc: http://www.math.uwaterloo.ca/~jamuir/rdtscpm1.pdf
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 02:02:07 PM
Thanks, I had already tried that link and I got a not found error. But I did find it here:

http://www.eecg.toronto.edu/~de/RDTSCPM1.pdf

And here is the code that IMO casts a large amount of doubt on the concept (and the competence of the author). Note how the code neglects to load a consistent value into EAX ahead of the CPUID instruction.
void main () {
int time, subtime;
float x = 5.0f;
__asm {
// Make three warm-up passes through the timing routine to make
// sure that the CPUID and RDTSC instruction are ready
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
fld x
fld x
cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time, eax
fdiv // Perform division
cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time // Find the difference
mov time, eax
}
time = time - subtime; // Subtract the overhead
printf ("%d\n", time); // Print total time of divide to screen
}


Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 02:27:46 PM
Yes, i noticed the bad cpuid calling on that code too.

I don't know if rdtsc itself really needs to be warmed up, but caching of the initial time code is still very important. Currently I go trough the code twice, and then get the real value for the timer initial clock count. I would be very interested of knowing, if the initial time count is miscalculated on low-end computers. Very unlikely that it would be interrupted by OS, and it seems like an overkill to count average from it, but it might prevent negative values?

Could you check this code on your computer, it prints the setup time instead of the result. It is a value that should NOT have any difference at all. :dazzled:
I think it should always be the same, even when the exe is restarted.
mov ebp, 20
@@: PTIMER_START_LOOP_COUNT
nop
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [__PTIMER_INIT_TIME]) ; <- get the setup time of the timer
print chr$(", ")
sub ebp, 1
jnz @B
Title: Re: Timing macros rebirth
Post by: Jimg on January 22, 2006, 02:40:35 PM
In my tests http://www.masmforum.com/simple/index.php?topic=2555.msg20246#msg20246 I found it took six times to really settle down.
I've also found that unless one uses the exact same memory address, that is move the test data to the exact same spots in memory before testing, and likewise move the test code itself for the routines under test to the exact same spot before testing, one cannot get consistant results.  After moving the data and test code, the results are consistant with only a few itereations needed rather than looping many thousands of times.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 03:18:51 PM
As coded:

234, 234, 235, 164, 164, 189, 234, 198, 238, 235, 242, 198, 207, 235, 235, 235,
193, 236, 197, 234,


With HIGH_PRIORITY_CLASS:

151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151,
151, 151, 151, 151,


Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 03:28:41 PM
I have tested the setup time, and added one extra warmup loop (now three), making the rdtsc&cpuid used 6 times before the actual timing. On my machine, only two warmups sometimes give bad setup time for the first use of the time macro.

MichaelW's timings are shocking, it seems that i probably have to count average for the setup time too (very easily done). It definetly has to be used with HIGH_PRIORITY_CLASS on older CPU's, to provide stable results, at least for now.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 03:44:05 PM
The average changed from ~217 to 151. I suspect the same is happening with the faster processors, just to a lesser degree because the total loop time is less.

Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 03:59:46 PM
Ok, new upload. I'm trying to find a simple way to stabilize the setup time. I also trashed the nop tests, instead i added multiply and division.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 04:09:27 PM

Petroizki's macros clock counts:
lstrcmp:  1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162,
szCmp:    51, 51, 51, 51, 50, 51, 51, 51, 51, 51,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,

MichaelW's macros clock counts:
lstrcmp:  1132, 1137, 1131, 1127, 1130, 1127, 1130, 1130, 1147, 1128,
szCmp:    56, 54, 51, 47, 49, 53, 50, 50, 49, 49,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 36, 34, 34, 34,

Title: Re: Timing macros rebirth
Post by: Petroizki on January 22, 2006, 04:15:11 PM
:eek

I guess these timings are due to HIGH_PRIORITY_CLASS i compiled the source with, what if you test it with normal?

Very stable timings indeed.  :8)
Title: Re: Timing macros rebirth
Post by: MichaelW on January 22, 2006, 04:26:32 PM
[EDIT] Got this wrong the first time, these are now for NORMAL_PRIORITY_CLASS (big difference)[/EDIT]


Petroizki's macros clock counts:
lstrcmp:  3179, 3216, 3229, 3286, 3222, 3255, 3262, 3257, 3241, 3236,
szCmp:    140, 152, 134, 143, 151, 146, 146, 137, 139, 143,
multiply: 9, 19, 4, 13, 4, 8, 3, 9, 14, 8,
division: 31, 27, 35, 36, 33, 38, 34, 29, 28, 45,

MichaelW's macros clock counts:
lstrcmp:  1199, 1204, 1206, 1208, 1207, 1200, 1215, 1204, 1205, 1202,
szCmp:    50, 55, 56, 54, 51, 48, 50, 52, 50, 51,
multiply: 0, 1, 1, 0, 0, 0, 0, 1, 4294967295, 1,
division: 35, 35, 34, 36, 37, 35, 35, 34, 35, 36,

Title: Re: Timing macros rebirth
Post by: Petroizki on January 23, 2006, 09:22:22 AM
New Upload.

In this version i disabled Sleep calls, and made some other changes, that seemed to steady the setup time timing in debugger. But it's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing.  :bg

I compiled the compare.exe with NORMAL_PRIORITY_CLASS.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 23, 2006, 07:09:51 PM
Much better.

Petroizki's macros clock counts:

lstrcmp:  1125, 1149, 1149, 1126, 1126, 1149, 1126, 1149, 1149, 1126,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

MichaelW's macros clock counts:
lstrcmp:  1133, 1148, 1141, 1147, 1132, 1134, 1136, 1132, 1143, 1138,
szCmp:    47, 49, 48, 49, 49, 47, 52, 47, 49, 48,
multiply: 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 34, 34, 35, 41,

Title: Re: Timing macros rebirth
Post by: Petroizki on January 23, 2006, 10:29:34 PM
Thanks Michael, seems like it's giving pretty much perfect values.

New Upload.

I have added a new timer based on the macros, it's entirely for timing procedures only. The reason is to get rid of the inline timing problems, and to bypass some aligning problems. The procedure timer creates it's own 8kb page, and it does increase your exe size, but who cares, no one is going to use this stuff in their programs release versions anyway. If you are timing procedures, I strongly advice you to use this timer, instead of the inline macros. Only use inline macros when you are timing some important parts of the code.

The procedure timer is as easy to use as possible, just like using invoke:
PTIMER_PROC_COUNT lstrlen, ADDR str1 ; <- calls procedure timer, single-pass
; <- result is in '[qwPROCTIMER_RESULT]'


For looping the procedure several times, use the looping call:
PTIMER_PROC_LOOP_COUNT lstrlen, ADDR str1 ; <- call procedure timing, multiple times
; <- like always, result is in '[qwPROCTIMER_RESULT]'


And again, the looping version must always behave exactly the same way on every call, to get stable results. The procedure can expect the registers, flags, and FPU to be like they were set before the timer call. Parameters are automatically repushed on every loop, so you can mess with them, however the timer does push some of it's own values to the stack, so don't expect esp-register to be the same. The timed procedure must be using calling convention 'stdcall' (procedure pops parameters from stack).

The resulting value ignores clock count of ret instruction and parameter pushing..

QuotePetroizki's inline macros clock counts:
lstrcmp:  1017, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010,
szCmp:    31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,

Petroizki's procmacros clock counts:
lstrcmp:  1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012,
szCmp:    28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
Title: Re: Timing macros rebirth
Post by: zooba on January 24, 2006, 01:39:25 AM
Quote from: Petroizki on January 23, 2006, 09:22:22 AM
It's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing.  :bg

Have you tried getting a program which eats up CPU time? CPU Killer is one that comes to mind... :U
Title: Re: Timing macros rebirth
Post by: hutch-- on January 24, 2006, 02:31:43 AM
Easy,

Write an app in C++ with MFC, OOP(S) and the works.  :bg
Title: Re: Timing macros rebirth
Post by: Petroizki on January 24, 2006, 06:54:13 AM
:toothy

Oh, and new upload too. Some registers were messed up, should work perfectly now. :red

How does the comparison.exe perform on other machines?
Title: Re: Timing macros rebirth
Post by: zooba on January 24, 2006, 07:41:18 AM
Quote from: hutch-- on January 24, 2006, 02:31:43 AM
Write an app in C++ with MFC, OOP(S) and the works.  :bg

And make it loop from 1 to 10. That should be long enough to do all these tests :U

Processor:         Intel(R) Pentium(R) M processor 1400MHz

Petroizki's inline macros clock counts:
lstrcmp:  806, 821, 806, 806, 806, 806, 806, 806, 806, 806,
szCmp:    31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 36, 36, 36, 36, 36, 36, 36, 36, 36, 36,

Petroizki's procmacros clock counts:
lstrcmp:  796, 797, 797, 797, 797, 797, 797, 797, 797, 797,
szCmp:    25, 25, 25, 25, 25, 25, 25, 25, 25, 25,

MichaelW's macros clock counts:
lstrcmp:  811, 806, 801, 810, 817, 804, 807, 806, 821, 816,
szCmp:    22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 35, 35, 35, 35, 35, 34, 35, 35, 36, 34,


It was actually running at 1400MHz too. Damn speedstep takes it down to 700MHz most of the time :wink
Title: Re: Timing macros rebirth
Post by: MichaelW on January 24, 2006, 09:08:52 AM
This is on a P3-500, but for some reason the processor does not display.

Petroizki's inline macros clock counts:
lstrcmp:  1123, 1123, 1147, 1123, 1148, 1137, 1123, 1147, 1165, 1123,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

Petroizki's procmacros clock counts:
lstrcmp:  1123, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128,
szCmp:    37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

MichaelW's macros clock counts:
lstrcmp:  1205, 1174, 1172, 1192, 1173, 1171, 1173, 1175, 1201, 1171,
szCmp:    49, 48, 54, 50, 49, 50, 51, 56, 53, 53,
multiply: 0, 1, 0, 0, 0, 0, 0, 4294967295, 1, 0,
division: 35, 36, 35, 35, 36, 35, 36, 35, 35, 36,


And this is on a K5-PR166:

Processor: AMD-K5(tm) Processor

Petroizki's inline macros clock counts:
lstrcmp:  162, 162, 162, 162, 162, 162, 162, 162, 162, 162,
szCmp:    40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
multiply: 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
division: 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

Petroizki's procmacros clock counts:
lstrcmp:  186, 186, 187, 187, 187, 187, 187, 187, 187, 187,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

MichaelW's macros clock counts:
lstrcmp:  190, 191, 192, 192, 192, 190, 191, 191, 191, 190,
szCmp:    36, 36, 36, 36, 36, 36, 36, 36, 36, 36,
multiply: 4294967295, 4294967295, 4294967294, 4294967295, 4294967295, 4294967295
, 4294967295, 4294967295, 4294967295, 4294967295,
division: 35, 34, 34, 34, 34, 34, 34, 34, 34, 34,


And just in case anyone noticed, the K5 was a very advanced processor for its time  :8)



Title: Re: Timing macros rebirth
Post by: Petroizki on January 24, 2006, 09:27:45 AM
Quote from: MichaelW on January 24, 2006, 09:08:52 AMThis is on a P3-500, but for some reason the processor does not display.
The processor probably does not support extended cpuid level 80000004h? Which is weird.. Maybe the CPU string is only in 80000002h-80000003h. I made changes to the comparison.exe, so it should print the CPU string, if it's there.

Quote from: MichaelW on January 24, 2006, 09:08:52 AMAnd just in case anyone noticed, the K5 was a very advanced processor for its time  :8)
lstrcmp sure does look impressive, what OS is it? Windows XP's lstrlen is almost ten times slower...  :eek


It seems that the procmacros always work well on szCmp, shows to be slightly faster than ptimers (no parameter pushing and ret timed), like it should be. lstrlen however seems to get slightly more clock counts sometimes, maybe it's an alignment issue.

New Upload. Procedure timer now corrects stack, if the timed procedure failed to do so (C-calling convention for example). Timing procedures with C-calling convention might not always give the exact clock count (+-1 clock cycles in some cases?), but I don't think nobody cares. However, when possible; use stdcall.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 24, 2006, 10:32:56 AM
The K5 is running under Windows 98 SE, and the P3 under Windows 2000 SP4.

In response to CPUID function 80000000h the P3 returns 03020101h. In response to CPUID function 0 it returns 2. AFAIK the Processor Brand String is available only on the more recent Intel processors, and the extended functions were first implemented on the P4. For the older (Intel) processors you must get the Brand Index using CPUID function 1, and then look up the Processor Brand String in a table. See the IA-32 Intel Architecture Software Developer's Manual Volume 2A, Table 3-19, Mapping of Brand Indices and IA-32 Processor Brand Strings.

I think Intel started out making a mess of this, and then decided to follow AMD's lead.

Title: Re: Timing macros rebirth
Post by: Mark_Larson on January 24, 2006, 06:56:01 PM
Quote from: MichaelW on January 22, 2006, 11:16:16 AM
I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it.

CPUID is a "serializing" instruction.  It flushes the pipeline.  If I remember right the first p4's had a 21 stage pipeline.  I might be mis-remembering.  If you have a big program, and you are timing just one routine in a part of that program, then CPUID will give you more accurate timings because it'll flush the pipeline before executing the code.  If you just have a small snippet of code that runs all by itself, then you aren't going to notice a difference.

And I hear you about only having an hour of free time to do stuff.  I don't have much time to code either.

Mark
Title: Re: Timing macros rebirth
Post by: MichaelW on January 25, 2006, 10:17:27 PM
This was a quick test to determine if it is possible to get a consistent cycle count for the timing overhead, under the best of conditions. Running this code on my P3 the overhead is 139 cycles for perhaps 90-95% of the loops. The count for the first loop is out of line for obvious reasons, but other smaller inconsistencies, where the count is always > 139, seem to occur more or less randomly, even with REALTIME_PRIORITY_CLASS, and even when starting at the beginning of a time slice. I see no way to eliminate the inconsistencies without filtering of the results, and I now think the filtering should just select the shortest time, as Ratch suggested  here (http://www.masmforum.com/simple/index.php?topic=770.30.msg16141#msg16141).


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        startcount  dd 0
        counts      dd 20 dup (0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep,2000

    invoke GetCurrentProcess
    ;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    mov   edi, 19*4

    align 16
  @@:
    invoke Sleep, 0

    xor   eax, eax
    cpuid
    rdtsc
    mov   startcount, eax
    ;;;;
    xor   eax, eax
    cpuid
    rdtsc

    sub   eax, startcount
    mov   [edi+counts], eax
    sub   edi, 4
    jnz   @B
   
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS

    mov   edi, 19*4
  @@:
    print ustr$([edi+counts]),13,10
    sub   edi, 4
    jnz   @B
   
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Title: Re: Timing macros rebirth
Post by: Petroizki on January 26, 2006, 07:08:39 AM
Ok, new upload.

I now use the shortest time choosing, for the setup time overhead.

Choosing the shortest time to the actual loop counting does not work for me. For example, when I loop 32 times through 4xnop instructions in my comp; I get 31x1 clock cycles and 1x0 clock cycles. I just think the current average counting gives more stable results. I don't know how it is with other CPU's, but i just want to filter the abnormal situations on small code.
Title: Re: Timing macros rebirth
Post by: MichaelW on January 26, 2006, 09:32:24 AM
I have doubts that the nop results are valid because AFAIK nop is a special case instruction. This code returns very consistent and more or less reasonable counts for most of the instructions that I tested (the exceptions are noted). A result of zero seems reasonable (for this code) for an instruction that executes in less than one clock cycle and/or runs in parallel with one of the overhead instructions. Perhaps there is some simple method of isolating the test instruction(s) from the overhead instructions so the two cannot run in parallel. Perhaps by bracketing the test instructions with CPUIDs and including them in the overhead?

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        startcount  dd 0
        overhead    dd -1
        count       dd -1
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep,2000

    invoke GetCurrentProcess
    ;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    mov   esi, 10
    .WHILE( esi )
      mov   edi, 30
      align 16   
      .WHILE( edi )
        invoke Sleep, 0
        xor   eax, eax
        cpuid
        rdtsc
        mov   startcount, eax
        ;;;;
        xor   eax, eax
        cpuid
        rdtsc
        sub   eax, startcount
        .IF( eax < overhead )
            mov   overhead, eax
        .ENDIF
        sub   edi, 1
      .ENDW 

      mov   edi, 30
      align 16
      .WHILE( edi )
        invoke Sleep, 0
        xor   eax, eax
        cpuid
        rdtsc
        mov   startcount, eax
        ;;;;;;
        ;nop              ; 0
        ;nops 2           ; 1
        ;nops 3           ; 1
        nops 4           ; 1
        ;nops 5           ; 0 usually, sometimes 1 or 2
        ;ror   eax, 1     ; 1
        ;mov   eax, 1     ; 0
        ; mul   ebx       ; 1 (in parallel with overhead?)
        ;xor   edx, edx   ; 0 by itself
        ;div   ebx        ; 7 with (xor  edx, edx)
        ;jmp   @F         ; 0
      @@:
        ;push  eax        ; 0 for both
        ;pop   eax        ;
        ;fdiv             ; 77
        ;cmpxchg dl, bh   ; 2 for 5 iterations of esi loop, then 0 ??
        ;;;;;;
        xor   eax, eax
        cpuid
        rdtsc
        sub   eax, startcount
        sub   eax, overhead
        .IF( eax < count )
            mov   count, eax
        .ENDIF
        sub   edi, 1
      .ENDW

      print "overhead="
      print ustr$(overhead),13,10
      print "count="
      print ustr$(count),13,10
      sub   esi, 1
    .ENDW 
   
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
   
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Title: Re: Timing macros rebirth
Post by: MichaelW on January 26, 2006, 09:55:09 AM
P3:

Petroizki's inline macros clock counts:
lstrcmp:  1123, 1123, 1147, 1123, 1148, 1123, 1123, 1149, 1123, 1122,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

Petroizki's procmacros clock counts:
lstrcmp:  1123, 1128, 1123, 1128, 1128, 1123, 1128, 1123, 1128, 1123,
szCmp:    37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

MichaelW's macros clock counts:
lstrcmp:  1374, 1395, 1354, 1341, 1353, 1384, 1353, 1465, 1376, 1378,
szCmp:    56, 58, 60, 61, 56, 54, 61, 57, 54, 53,
multiply: 3, 0, 1, 4294967295, 2, 4294967295, 0, 4294967295, 4294967295, 1,
division: 44, 42, 43, 40, 43, 39, 42, 41, 43, 43,


Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.
Title: Re: Timing macros rebirth
Post by: Petroizki on January 26, 2006, 11:04:49 AM
The problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops? So far i have been aiming at stable average. Should i just dump all the average calculations?

_func0 segment
FUNC0_START label dword
func0:
mov ebx, 1
xor edx, edx
div ebx

retn
db 4096*2-($-FUNC0_START) dup (0)
_func0 ends

start:
print chr$("Procedure macros TEST.",13,10,)

mov ebp, 3
@loop:
PTIMER_PROC_LOOP_COUNT func0

mov ebx, __PROCTIMER_LOOPS - 1
or edi, -1
print chr$(13,10,"Loop times: ")
@@: print ustr$(dword ptr [__PROCTIMER_TABLE + ebx*8])
print chr$(", ")

mov ecx, edi
mov eax, dword ptr [__PROCTIMER_TABLE + ebx*8]
sub ecx, eax
sbb edi, edi
and edi, ecx
add edi, eax

sub ebx, 1
jnc @B

print chr$(13,10,"Average as calculated: ")
print ustr$(dword ptr [qwPROCTIMER_RESULT])
print chr$(13,10,"Minimum value: ")
print ustr$(edi)
print chr$(13,10)
sub ebp, 1
jnz @loop

inkey chr$(13,10,13,10,"Press any key to exit...")
ret
end start


Quote from: MichaelW on January 26, 2006, 09:55:09 AMLooking at my last results, the cycle count for div seems far too low. I tested again and I still get 7.
mul clock count is 7, div clock count is 41?
Title: Re: Timing macros rebirth
Post by: MichaelW on January 26, 2006, 02:56:29 PM
QuoteThe problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops?

Assuming the results from my limited testing are typical, if you discarded the counts for the first few trials, and were running say 20 or more trials, the average would equal the lowest value, so I would lean towards doing whatever is easiest.

I changed my code to add:

xor  eax, eax
cpuid

Immediately before the test instructions (and to the overhead loop), and this changed the result for div to 37 cycles. So far I cannot see any way to isolate the test instructions from the xor  eax, eax that precedes the next cpuid, because the only other non-privileged serializing instructions are IRET and RSM. I at first thought an IRET might be possible, but setting the stack up correctly is going to be difficult at best.

Title: Re: Timing macros rebirth
Post by: zooba on January 26, 2006, 11:46:34 PM
Quote from: MichaelW on January 26, 2006, 09:55:09 AM
P3:

multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,


Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.


According to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate :U
Title: Re: Timing macros rebirth
Post by: MichaelW on January 27, 2006, 01:12:44 AM
QuoteAccording to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate.

Sorry, my statement was not clear. I was not referring to the results that I posted, but to the results of the code that I posted.

I am now fairly certain that an IRET cannot be made to work from ring 3. With no way to completely isolate the test instructions from the timing instructions there will always be the potential for one or more of the test instructions to execute in parallel with the trailing xor  eax,eax. For long sequences of instructions the effect of this would probably be negligible, but not for short sequences.
Title: Re: Timing macros rebirth
Post by: Mark Jones on January 31, 2006, 06:21:33 PM
Keep up the great work on the timing macros. :bg

Processor: AMD Athlon(tm) XP 2500+

Petroizki's inline macros clock counts:
lstrcmp:  881, 881, 881, 881, 881, 881, 881, 881, 881, 906,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,

Petroizki's procmacros clock counts:
lstrcmp:  877, 877, 877, 877, 877, 877, 877, 877, 877, 877,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

MichaelW's macros clock counts:
lstrcmp:  1391, 1264, 1264, 1118, 1177, 1189, 1353, 1124, 1286, 1141,
szCmp:    35, 34, 25, 44, 40, 49, 37, 49, 49, 49,
multiply: 4294967287, 0, 4294967295, 4294967288, 4294967293, 0, 4294967283, 4294
967295, 4294967293, 55,
division: 49, 48, 58, 43, 49, 42, 49, 47, 32, 49,

Press any key to exit...


proctimers example:
multiple-pass lstrlen: 80 clock counts
single-pass lstrlen: 267 clock counts


ptimers example:
multiple-pass lstrlen: 81 clock counts
single-pass lstrlen: 328 clock counts
Title: Re: Timing macros rebirth
Post by: Mark Jones on February 12, 2006, 03:18:32 AM
Interesting, if the project is built as a release version, we get:


Petroizki's inline macros clock counts:
lstrcmp:  899, 899, 899, 899, 899, 899, 899, 899, 877, 877,


and if we build as a debug version, we get:


Petroizki's inline macros clock counts:
lstrcmp:  877, 877, 877, 877, 877, 877, 877, 877, 877, 877,


  :wink  Everything else matches. The command lines are:

Quote
ml /c /Cp /coff /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj

Quote
ml /c /Cp /coff /Fm /Zi /Zd /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /DEBUG /DEBUGTYPE:BOTH /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj