Since MichaelW has not been updating his timing macros (http://www.masmforum.com/simple/index.php?topic=770.0), I decided to make my own. The idea is to make more stable results, with much less looping and a better average calculation. The final count calculation has the ability to ignore obvious interrupted timings, that are made by OS interference. The timing macros also don't trash registers, flags, stack or FPU. So it doesn't really matter what part of the code you are timing, but of course;
the code being benchmarked with the loop counting macro, must always behave EXACTLY the same way!This "inline" method of timing is not accurate! Problems arise with cache, branching, you name it. But it should give you a rough feeling of what the timing should be on a specific computer.
Like MichaelW's macros, there is a start-macro, and the end-macro. But instead, my macros have two different clock cycle counters, the other one is a single-pass counter, and the other one is the looping one.
PTIMER_START_COUNT ; <- starts the clock cycle count
; <- code at this point is executed only once
; you can freely use registers and flags, like they were set before the timing macro
PTIMER_END_COUNT ; <- ends the clock cycle count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'
This single-pass method is fine, if you wan't to time code somewhere middle of your program. And like said, the macros do not change stack, registers or anything else, so you can benchmark pretty much anything. You must understand that this macro will not produce stable results, as the code being times is executed only once. Especially the first run trough the code, will give very weird timings. :naughty:
To get more stable results, you should use the looping macro, which works just like the other one. The only exception is, that the code you are timing, must behave exactly the same way every time it's ran. You don't need to change the loop count, it is constant, and should produce good results.
PTIMER_START_LOOP_COUNT ; <- starts the loop count
; <- code at this point is executed multiple times
; you can expect the registers, flags, FPU and stack to be always the same at this point
; as the start count macro restores the original values
PTIMER_END_LOOP_COUNT ; <- ends the loop count
; the clock cycle count is in: 'qword ptr [qwPTIMER_RESULT]'
My macros won't change the process priority, I decided that it's better to left outside the macros. Of course I advice you to set it to something like HIGH_PRIORITY_CLASS, if you are having very unstable results.
QuoteYou can make your timing more accurate by following these rules:
1. Keep the code that is being timed, small. Large code will result in OS interference, thus resulting in wrong clock cycle count.
2. Try not to benchmark code with lot's and lot's of branches. Processors branch predicition is not unlimited, this will result in very weird timings.
3. The looped timing code should behave exactly the same way everytime it's ran. strcat -function for example, does not follow this rule.
4. And of course; don't run other high priority processes at the same time.
Note that the 64bit counter will wrap-around every 2^64 cycles, in 2GHz processor this means that the timer will give false results about every 292.5 years. Don't say i didn't warn you! :lol
Here is an output on my computer: :wink
QuoteProcessor: AMD Athlon(tm) 64 Processor 3000+
Petroizki's macros clock counts:
lstrcmp: 968, 968, 968, 968, 968, 968, 968, 968, 968, 968,
szCmp: 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
1 x nop: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
MichaelW's macros clock counts:
lstrcmp: 1097, 1100, 1109, 1085, 1085, 1085, 1075, 1081, 1080, 1075,
szCmp: 32, 30, 29, 30, 30, 31, 31, 31, 29, 30,
1 x nop: 4294967295, 2, 1, 2, 1, 2, 1, 2, 1, 1,
4 x nop: 1, 1, 0, 1, 1, 0, 2, 1, 1, 2,
dl: http://personal.inet.fi/atk/partsu/ptimers.zip
New upload.
The comparison code with MichaelW's macros totally choke up older CPU's (or at least Pentium 4). I aligned, and reorganized the code so the cache won't flush every millisecond...
(It worked just fine with AMD CPU's :toothy )
QuoteProcessor: Intel(R) Pentium(R) 4 CPU 2.80GHz
Petroizki's macros clock counts:
lstrcmp: 1140, 1124, 1148, 1148, 1144, 1140, 1148, 1148, 1144, 1140,
szCmp: 40, 44, 44, 44, 44, 44, 44, 44, 44, 44,
1 x nop: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop: 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292, 4294967292,
MichaelW's macros clock counts:
lstrcmp: 1266, 1342, 1191, 1280, 1216, 1255, 1460, 1183, 1247, 1215,
szCmp: 27, 24, 24, 23, 26, 49, 23, 46, 23, 26,
1 x nop: 1, 4294967295, 4294967295, 1, 4294967294, 0, 1, 4294967294, 4294967294, 3,
4 x nop: 4294967290, 4294967290, 4294967293, 4294967292, 4294967293, 4294967291, 4294967293, 4294967290, 4294967291, 4294967292,
Hi Petroizki, I'm glad to see you working on this :U These days I seldom have the time to get involved in anything that I cannot complete in an hour or so.
I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it. I have seen at least two other documents that contained similar code, but in both cases the programmer was failing to control the EAX value prior to executing CPUID, which IMO cast a large amount of doubt on the whole concept. And when I added warm-up code to my macros I could not detect any effect. But I suppose it would be reasonable to do the warm up, on the basis of it can't hurt, and it might help.
I think it was you who pointed out that my macros were not preserving EBX. In my own use I have had to work around the EBX problem more than a few times.
I did not do any test of the single pass macros, but it occurs to me that the sleep 0 is not going to do any good without a priority boost, and not much good without REALTIME_PRIORITY_CLASS.
BTW I like the way you bypassed the alignment filler by having the Sleep function return to a label.
As you can see in the timings below, your loop timing macros have some sort of problem timing the nops on my system, a 500 MHz P3 running Windows 2000 SP4. In the time I've had to work on this I have not been able to determine what exactly is causing the problem.
[EDIT] I see you likely solved the problem while I was composing this [/EDIT]
When I alter your comparison code to use my macros as I intended with a priority boost and longer loops, the repeatability of my raw results are reasonably comparable to the repeatability of your filtered results.
I'm not convinced that the filtering is necessarily a good thing, as it masks a run to run variation that could, I think, possibly indicate execution problems with the code. I'll try to come up with some example code that demonstrates this.
On my system the cycle counts returned by your macros are slightly higher than the cycle counts returned by mine.
Timing instruction sequences that take only a few clock cycles in a 100x loop allows you to avoid returning a cycle count of zero (an "obviously wrong" result).
.586
.model flat, stdcall
OPTION casemap:NONE
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
include \masm32\include\msvcrt.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\msvcrt.lib
include \masm32\macros\macros.asm
include ptimers.inc ; include ptimers
include timers.asm ; include MichaelW's timers
COUNT_LOOPS EQU 10
.data?
szCPU db 48 dup (?) ; will hold the processor information
.data
str1 db "void void void void void void void void void void void", 0
str2 db "void null null", 0
.code
start:
mov eax, 80000000h
cpuid
cmp eax, 80000004h
jb @nocpuid
mov eax, 80000002h
cpuid
mov dword ptr [szCPU], eax
mov dword ptr [szCPU + 4], ebx
mov dword ptr [szCPU + 8], ecx
mov dword ptr [szCPU + 12], edx
mov eax, 80000003h
cpuid
mov dword ptr [szCPU + 16], eax
mov dword ptr [szCPU + 20], ebx
mov dword ptr [szCPU + 24], ecx
mov dword ptr [szCPU + 28], edx
mov eax, 80000004h
cpuid
mov dword ptr [szCPU + 32], eax
mov dword ptr [szCPU + 36], ebx
mov dword ptr [szCPU + 40], ecx
mov dword ptr [szCPU + 44], edx
print chr$("Processor: ")
print ADDR szCPU, 13, 10, 13, 10
@nocpuid:
invoke GetCurrentProcess
invoke SetPriorityClass, eax, 8000h
print chr$("Petroizki's macros clock counts:",13,10,"lstrcmp: ")
mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
invoke lstrcmp, ADDR str1, ADDR str2
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [qwPTIMER_RESULT])
print chr$(", ")
sub ebp, 1
jnz @B
print chr$(13,10,"szCmp: ")
mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
invoke szCmp, ADDR str1, ADDR str2
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [qwPTIMER_RESULT])
print chr$(", ")
sub ebp, 1
jnz @B
print chr$(13,10,"1 x nop: ")
mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
nop
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [qwPTIMER_RESULT])
print chr$(", ")
sub ebp, 1
jnz @B
print chr$(13,10,"4 x nop: ")
mov ebp, COUNT_LOOPS
@@: PTIMER_START_LOOP_COUNT
nop
nop
nop
nop
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [qwPTIMER_RESULT])
print chr$(", ")
sub ebp, 1
jnz @B
print chr$(13,10,13,10,"MichaelW's macros clock counts:",13,10,"lstrcmp: ")
mov ebp, COUNT_LOOPS
loop0:
counter_begin 1000000, HIGH_PRIORITY_CLASS
invoke lstrcmp, ADDR str1, ADDR str2
counter_end
print ustr$(eax)
print chr$(", ")
sub ebp, 1
jnz loop0
print chr$(13,10,"szCmp: ")
mov ebp, COUNT_LOOPS
loop1:
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke szCmp, ADDR str1, ADDR str2
counter_end
print ustr$(eax)
print chr$(", ")
sub ebp, 1
jnz loop1
print chr$(13,10,"1 x nop x 100: ")
mov ebp, COUNT_LOOPS
REPEAT_COUNT EQU 100
loop2:
counter_begin 1000000, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
nop
ENDM
counter_end
print ustr$(eax)
print chr$(", ")
sub ebp, 1
Jnz loop2
print chr$(13,10,"4 x nop x 100: ")
mov ebp, COUNT_LOOPS
loop3:
counter_begin 1000000, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
nop
nop
nop
nop
ENDM
counter_end
print ustr$(eax)
print chr$(", ")
sub ebp, 1
jnz loop3
inkey chr$(13,10,13,10,"Press any key to exit...")
ret
end start
Petroizki's macros clock counts:
lstrcmp: 1174, 1174, 1174, 1169, 1174, 1169, 1174, 1170, 1174, 1174,
szCmp: 51, 50, 50, 50, 50, 56, 50, 50, 50, 50,
1 x nop: 554189319, 4294967287, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop: 4294967290, 4294967290, 4294967290, 4294967290, 4294967290, 4294967290
, 554189322, 4294967290, 4294967290, 4294967290,
MichaelW's macros clock counts:
lstrcmp: 1157, 1159, 1156, 1159, 1156, 1160, 1162, 1154, 1158, 1154,
szCmp: 51, 50, 49, 48, 50, 48, 49, 50, 48, 49,
1 x nop x 100: 45, 45, 45, 45, 45, 45, 41, 45, 45, 45,
4 x nop x 100: 197, 197, 196, 196, 196, 196, 197, 196, 203, 196,
Hello MichaelW,
Yes, I have the intel document, which recommends to warmup rdtsc, and I thought it wouldn't hurt to do so. And I also think that setting the eax to the same value is very important on every call to cpuid, as different values change the time taken by cpuid.
I played with the Sleep function, and it seemed to stabilize the timing a little, so I left it there. It may become important on large codes, where OS interrupt is more likely. It also might be important on some CPU's.
As to the problems with nop instructions.. The timer gives negative values (4294967292 is actually -4), because the initialization code ran faster than the real one. This may be because of pairing, cache or just about anything. I think it's something that must be accepted on "inline" timing. At least the timings are pretty much the same on every run, and that's what I have been aiming at.. :wink
The code that you posted was the bad one, it put Pentium 4 to it's knees, you should test the new one. Note that I temporarily disabled the priority level setting on the your timers.asm!
I would be very interested on seeing results on lower-end CPU's. I have tested with different AMD XP/64, and the results have been very precise so far.
QuoteProcessor: AMD Athlon(tm) XP 2400+
Petroizki's macros clock counts:
lstrcmp: 905, 905, 905, 905, 905, 905, 905, 905, 905, 905,
szCmp: 34, 47, 47, 47, 47, 47, 47, 47, 47, 34,
1 x nop: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
MichaelW's macros clock counts:
lstrcmp: 943, 919, 914, 918, 923, 919, 920, 923, 935, 928,
szCmp: 32, 32, 31, 31, 31, 32, 33, 32, 33, 32,
1 x nop: 0, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 0,
4 x nop: 0, 0, 0, 1, 0, 4294967295, 1, 0, 4294967295, 4294967295,
EDIT: I think the main reason for the large differences in the result between your macros and mine; is the average calculation, pretty much on every run my macros drop some bad numbers, because the difference is too large.
Running your comparison.exe on my P3-500:
Petroizki's macros clock counts:
lstrcmp: 1162, 1162, 1162, 1162, 1162, 1161, 1162, 1161, 1162, 1162,
szCmp: 50, 50, 50, 50, 50, 50, 50, 50, 50, 50,
1 x nop: 2290649215, 554189319, 4294967287, 4294967287, 4294967287, 4294967287,
4294967287, 4294967287, 4294967287, 4294967287,
4 x nop: 4294967290, 4294967290, 554189322, 4294967290, 554189322, 4294967290,
554189322, 554189322, 4294967290, 4294967290,
MichaelW's macros clock counts:
lstrcmp: 1135, 1129, 1131, 1130, 1129, 1129, 1129, 1132, 1131, 1129,
szCmp: 47, 54, 48, 48, 52, 48, 48, 52, 49, 46,
1 x nop: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
4 x nop: 0, 1, 1, 1, 0, 1, 0, 1, 0, 2,
Which Intel document is it, and where can I get it?
Thanks,
The Intel doc: http://www.math.uwaterloo.ca/~jamuir/rdtscpm1.pdf
Thanks, I had already tried that link and I got a not found error. But I did find it here:
http://www.eecg.toronto.edu/~de/RDTSCPM1.pdf
And here is the code that IMO casts a large amount of doubt on the concept (and the competence of the author). Note how the code neglects to load a consistent value into EAX ahead of the CPUID instruction.
void main () {
int time, subtime;
float x = 5.0f;
__asm {
// Make three warm-up passes through the timing routine to make
// sure that the CPUID and RDTSC instruction are ready
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
fld x
fld x
cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time, eax
fdiv // Perform division
cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time // Find the difference
mov time, eax
}
time = time - subtime; // Subtract the overhead
printf ("%d\n", time); // Print total time of divide to screen
}
Yes, i noticed the bad cpuid calling on that code too.
I don't know if rdtsc itself really needs to be warmed up, but caching of the initial time code is still very important. Currently I go trough the code twice, and then get the real value for the timer initial clock count. I would be very interested of knowing, if the initial time count is miscalculated on low-end computers. Very unlikely that it would be interrupted by OS, and it seems like an overkill to count average from it, but it might prevent negative values?
Could you check this code on your computer, it prints the setup time instead of the result. It is a value that should NOT have any difference at all. :dazzled:
I think it should always be the same, even when the exe is restarted.
mov ebp, 20
@@: PTIMER_START_LOOP_COUNT
nop
PTIMER_END_LOOP_COUNT
print ustr$(dword ptr [__PTIMER_INIT_TIME]) ; <- get the setup time of the timer
print chr$(", ")
sub ebp, 1
jnz @B
In my tests http://www.masmforum.com/simple/index.php?topic=2555.msg20246#msg20246 I found it took six times to really settle down.
I've also found that unless one uses the exact same memory address, that is move the test data to the exact same spots in memory before testing, and likewise move the test code itself for the routines under test to the exact same spot before testing, one cannot get consistant results. After moving the data and test code, the results are consistant with only a few itereations needed rather than looping many thousands of times.
As coded:
234, 234, 235, 164, 164, 189, 234, 198, 238, 235, 242, 198, 207, 235, 235, 235,
193, 236, 197, 234,
With HIGH_PRIORITY_CLASS:
151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151, 151,
151, 151, 151, 151,
I have tested the setup time, and added one extra warmup loop (now three), making the rdtsc&cpuid used 6 times before the actual timing. On my machine, only two warmups sometimes give bad setup time for the first use of the time macro.
MichaelW's timings are shocking, it seems that i probably have to count average for the setup time too (very easily done). It definetly has to be used with HIGH_PRIORITY_CLASS on older CPU's, to provide stable results, at least for now.
The average changed from ~217 to 151. I suspect the same is happening with the faster processors, just to a lesser degree because the total loop time is less.
Ok, new upload. I'm trying to find a simple way to stabilize the setup time. I also trashed the nop tests, instead i added multiply and division.
Petroizki's macros clock counts:
lstrcmp: 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162, 1162,
szCmp: 51, 51, 51, 51, 50, 51, 51, 51, 51, 51,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
MichaelW's macros clock counts:
lstrcmp: 1132, 1137, 1131, 1127, 1130, 1127, 1130, 1130, 1147, 1128,
szCmp: 56, 54, 51, 47, 49, 53, 50, 50, 49, 49,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 36, 34, 34, 34,
:eek
I guess these timings are due to HIGH_PRIORITY_CLASS i compiled the source with, what if you test it with normal?
Very stable timings indeed. :8)
[EDIT] Got this wrong the first time, these are now for NORMAL_PRIORITY_CLASS (big difference)[/EDIT]
Petroizki's macros clock counts:
lstrcmp: 3179, 3216, 3229, 3286, 3222, 3255, 3262, 3257, 3241, 3236,
szCmp: 140, 152, 134, 143, 151, 146, 146, 137, 139, 143,
multiply: 9, 19, 4, 13, 4, 8, 3, 9, 14, 8,
division: 31, 27, 35, 36, 33, 38, 34, 29, 28, 45,
MichaelW's macros clock counts:
lstrcmp: 1199, 1204, 1206, 1208, 1207, 1200, 1215, 1204, 1205, 1202,
szCmp: 50, 55, 56, 54, 51, 48, 50, 52, 50, 51,
multiply: 0, 1, 1, 0, 0, 0, 0, 1, 4294967295, 1,
division: 35, 35, 34, 36, 37, 35, 35, 34, 35, 36,
New Upload.
In this version i disabled Sleep calls, and made some other changes, that seemed to steady the setup time timing in debugger. But it's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing. :bg
I compiled the compare.exe with NORMAL_PRIORITY_CLASS.
Much better.
Petroizki's macros clock counts:
lstrcmp: 1125, 1149, 1149, 1126, 1126, 1149, 1126, 1149, 1149, 1126,
szCmp: 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
MichaelW's macros clock counts:
lstrcmp: 1133, 1148, 1141, 1147, 1132, 1134, 1136, 1132, 1143, 1138,
szCmp: 47, 49, 48, 49, 49, 47, 52, 47, 49, 48,
multiply: 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 34, 34, 35, 41,
Thanks Michael, seems like it's giving pretty much perfect values.
New Upload.
I have added a new timer based on the macros, it's entirely for timing procedures only. The reason is to get rid of the inline timing problems, and to bypass some aligning problems. The procedure timer creates it's own 8kb page, and it does increase your exe size, but who cares, no one is going to use this stuff in their programs release versions anyway. If you are timing procedures, I strongly advice you to use this timer, instead of the inline macros. Only use inline macros when you are timing some important parts of the code.
The procedure timer is as easy to use as possible, just like using invoke:
PTIMER_PROC_COUNT lstrlen, ADDR str1 ; <- calls procedure timer, single-pass
; <- result is in '[qwPROCTIMER_RESULT]'
For looping the procedure several times, use the looping call:
PTIMER_PROC_LOOP_COUNT lstrlen, ADDR str1 ; <- call procedure timing, multiple times
; <- like always, result is in '[qwPROCTIMER_RESULT]'
And again, the looping version must always behave exactly the same way on every call, to get stable results. The procedure can expect the registers, flags, and FPU to be like they were set before the timer call. Parameters are automatically repushed on every loop, so you can mess with them, however the timer does push some of it's own values to the stack, so don't expect esp-register to be the same. The timed procedure must be using calling convention 'stdcall' (procedure pops parameters from stack).
The resulting value ignores clock count of ret instruction and parameter pushing..
QuotePetroizki's inline macros clock counts:
lstrcmp: 1017, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010,
szCmp: 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,
Petroizki's procmacros clock counts:
lstrcmp: 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012,
szCmp: 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
Quote from: Petroizki on January 23, 2006, 09:22:22 AM
It's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing. :bg
Have you tried getting a program which eats up CPU time? CPU Killer is one that comes to mind... :U
Easy,
Write an app in C++ with MFC, OOP(S) and the works. :bg
:toothy
Oh, and new upload too. Some registers were messed up, should work perfectly now. :red
How does the comparison.exe perform on other machines?
Quote from: hutch-- on January 24, 2006, 02:31:43 AM
Write an app in C++ with MFC, OOP(S) and the works. :bg
And make it loop from 1 to 10. That should be long enough to do all these tests :U
Processor: Intel(R) Pentium(R) M processor 1400MHz
Petroizki's inline macros clock counts:
lstrcmp: 806, 821, 806, 806, 806, 806, 806, 806, 806, 806,
szCmp: 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 36, 36, 36, 36, 36, 36, 36, 36, 36, 36,
Petroizki's procmacros clock counts:
lstrcmp: 796, 797, 797, 797, 797, 797, 797, 797, 797, 797,
szCmp: 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
MichaelW's macros clock counts:
lstrcmp: 811, 806, 801, 810, 817, 804, 807, 806, 821, 816,
szCmp: 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 35, 35, 35, 35, 35, 34, 35, 35, 36, 34,
It was actually running at 1400MHz too. Damn speedstep takes it down to 700MHz most of the time :wink
This is on a P3-500, but for some reason the processor does not display.
Petroizki's inline macros clock counts:
lstrcmp: 1123, 1123, 1147, 1123, 1148, 1137, 1123, 1147, 1165, 1123,
szCmp: 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
Petroizki's procmacros clock counts:
lstrcmp: 1123, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128,
szCmp: 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
MichaelW's macros clock counts:
lstrcmp: 1205, 1174, 1172, 1192, 1173, 1171, 1173, 1175, 1201, 1171,
szCmp: 49, 48, 54, 50, 49, 50, 51, 56, 53, 53,
multiply: 0, 1, 0, 0, 0, 0, 0, 4294967295, 1, 0,
division: 35, 36, 35, 35, 36, 35, 36, 35, 35, 36,
And this is on a K5-PR166:
Processor: AMD-K5(tm) Processor
Petroizki's inline macros clock counts:
lstrcmp: 162, 162, 162, 162, 162, 162, 162, 162, 162, 162,
szCmp: 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
multiply: 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
division: 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
Petroizki's procmacros clock counts:
lstrcmp: 186, 186, 187, 187, 187, 187, 187, 187, 187, 187,
szCmp: 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
MichaelW's macros clock counts:
lstrcmp: 190, 191, 192, 192, 192, 190, 191, 191, 191, 190,
szCmp: 36, 36, 36, 36, 36, 36, 36, 36, 36, 36,
multiply: 4294967295, 4294967295, 4294967294, 4294967295, 4294967295, 4294967295
, 4294967295, 4294967295, 4294967295, 4294967295,
division: 35, 34, 34, 34, 34, 34, 34, 34, 34, 34,
And just in case anyone noticed, the K5 was a very advanced processor for its time :8)
Quote from: MichaelW on January 24, 2006, 09:08:52 AMThis is on a P3-500, but for some reason the processor does not display.
The processor probably does not support extended cpuid level 80000004h? Which is weird.. Maybe the CPU string is only in 80000002h-80000003h. I made changes to the comparison.exe, so it should print the CPU string, if it's there.
Quote from: MichaelW on January 24, 2006, 09:08:52 AMAnd just in case anyone noticed, the K5 was a very advanced processor for its time :8)
lstrcmp sure does look impressive, what OS is it? Windows XP's lstrlen is almost ten times slower... :eek
It seems that the procmacros always work well on szCmp, shows to be slightly faster than ptimers (no parameter pushing and ret timed), like it should be. lstrlen however seems to get slightly more clock counts sometimes, maybe it's an alignment issue.
New Upload. Procedure timer now corrects stack, if the timed procedure failed to do so (C-calling convention for example). Timing procedures with C-calling convention might not always give the exact clock count (+-1 clock cycles in some cases?), but I don't think nobody cares. However, when possible; use stdcall.
The K5 is running under Windows 98 SE, and the P3 under Windows 2000 SP4.
In response to CPUID function 80000000h the P3 returns 03020101h. In response to CPUID function 0 it returns 2. AFAIK the Processor Brand String is available only on the more recent Intel processors, and the extended functions were first implemented on the P4. For the older (Intel) processors you must get the Brand Index using CPUID function 1, and then look up the Processor Brand String in a table. See the IA-32 Intel Architecture Software Developer's Manual Volume 2A, Table 3-19, Mapping of Brand Indices and IA-32 Processor Brand Strings.
I think Intel started out making a mess of this, and then decided to follow AMD's lead.
Quote from: MichaelW on January 22, 2006, 11:16:16 AM
I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it.
CPUID is a "serializing" instruction. It flushes the pipeline. If I remember right the first p4's had a 21 stage pipeline. I might be mis-remembering. If you have a big program, and you are timing just one routine in a part of that program, then CPUID will give you more accurate timings because it'll flush the pipeline before executing the code. If you just have a small snippet of code that runs all by itself, then you aren't going to notice a difference.
And I hear you about only having an hour of free time to do stuff. I don't have much time to code either.
Mark
This was a quick test to determine if it is possible to get a consistent cycle count for the timing overhead, under the best of conditions. Running this code on my P3 the overhead is 139 cycles for perhaps 90-95% of the loops. The count for the first loop is out of line for obvious reasons, but other smaller inconsistencies, where the count is always > 139, seem to occur more or less randomly, even with REALTIME_PRIORITY_CLASS, and even when starting at the beginning of a time slice. I see no way to eliminate the inconsistencies without filtering of the results, and I now think the filtering should just select the shortest time, as Ratch suggested here (http://www.masmforum.com/simple/index.php?topic=770.30.msg16141#msg16141).
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
startcount dd 0
counts dd 20 dup (0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep,2000
invoke GetCurrentProcess
;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS
mov edi, 19*4
align 16
@@:
invoke Sleep, 0
xor eax, eax
cpuid
rdtsc
mov startcount, eax
;;;;
xor eax, eax
cpuid
rdtsc
sub eax, startcount
mov [edi+counts], eax
sub edi, 4
jnz @B
invoke GetCurrentProcess
invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
mov edi, 19*4
@@:
print ustr$([edi+counts]),13,10
sub edi, 4
jnz @B
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Ok, new upload.
I now use the shortest time choosing, for the setup time overhead.
Choosing the shortest time to the actual loop counting does not work for me. For example, when I loop 32 times through 4xnop instructions in my comp; I get 31x1 clock cycles and 1x0 clock cycles. I just think the current average counting gives more stable results. I don't know how it is with other CPU's, but i just want to filter the abnormal situations on small code.
I have doubts that the nop results are valid because AFAIK nop is a special case instruction. This code returns very consistent and more or less reasonable counts for most of the instructions that I tested (the exceptions are noted). A result of zero seems reasonable (for this code) for an instruction that executes in less than one clock cycle and/or runs in parallel with one of the overhead instructions. Perhaps there is some simple method of isolating the test instruction(s) from the overhead instructions so the two cannot run in parallel. Perhaps by bracketing the test instructions with CPUIDs and including them in the overhead?
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
startcount dd 0
overhead dd -1
count dd -1
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep,2000
invoke GetCurrentProcess
;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS
mov esi, 10
.WHILE( esi )
mov edi, 30
align 16
.WHILE( edi )
invoke Sleep, 0
xor eax, eax
cpuid
rdtsc
mov startcount, eax
;;;;
xor eax, eax
cpuid
rdtsc
sub eax, startcount
.IF( eax < overhead )
mov overhead, eax
.ENDIF
sub edi, 1
.ENDW
mov edi, 30
align 16
.WHILE( edi )
invoke Sleep, 0
xor eax, eax
cpuid
rdtsc
mov startcount, eax
;;;;;;
;nop ; 0
;nops 2 ; 1
;nops 3 ; 1
nops 4 ; 1
;nops 5 ; 0 usually, sometimes 1 or 2
;ror eax, 1 ; 1
;mov eax, 1 ; 0
; mul ebx ; 1 (in parallel with overhead?)
;xor edx, edx ; 0 by itself
;div ebx ; 7 with (xor edx, edx)
;jmp @F ; 0
@@:
;push eax ; 0 for both
;pop eax ;
;fdiv ; 77
;cmpxchg dl, bh ; 2 for 5 iterations of esi loop, then 0 ??
;;;;;;
xor eax, eax
cpuid
rdtsc
sub eax, startcount
sub eax, overhead
.IF( eax < count )
mov count, eax
.ENDIF
sub edi, 1
.ENDW
print "overhead="
print ustr$(overhead),13,10
print "count="
print ustr$(count),13,10
sub esi, 1
.ENDW
invoke GetCurrentProcess
invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
P3:
Petroizki's inline macros clock counts:
lstrcmp: 1123, 1123, 1147, 1123, 1148, 1123, 1123, 1149, 1123, 1122,
szCmp: 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
Petroizki's procmacros clock counts:
lstrcmp: 1123, 1128, 1123, 1128, 1128, 1123, 1128, 1123, 1128, 1123,
szCmp: 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,
MichaelW's macros clock counts:
lstrcmp: 1374, 1395, 1354, 1341, 1353, 1384, 1353, 1465, 1376, 1378,
szCmp: 56, 58, 60, 61, 56, 54, 61, 57, 54, 53,
multiply: 3, 0, 1, 4294967295, 2, 4294967295, 0, 4294967295, 4294967295, 1,
division: 44, 42, 43, 40, 43, 39, 42, 41, 43, 43,
Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.
The problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops? So far i have been aiming at stable average. Should i just dump all the average calculations?
_func0 segment
FUNC0_START label dword
func0:
mov ebx, 1
xor edx, edx
div ebx
retn
db 4096*2-($-FUNC0_START) dup (0)
_func0 ends
start:
print chr$("Procedure macros TEST.",13,10,)
mov ebp, 3
@loop:
PTIMER_PROC_LOOP_COUNT func0
mov ebx, __PROCTIMER_LOOPS - 1
or edi, -1
print chr$(13,10,"Loop times: ")
@@: print ustr$(dword ptr [__PROCTIMER_TABLE + ebx*8])
print chr$(", ")
mov ecx, edi
mov eax, dword ptr [__PROCTIMER_TABLE + ebx*8]
sub ecx, eax
sbb edi, edi
and edi, ecx
add edi, eax
sub ebx, 1
jnc @B
print chr$(13,10,"Average as calculated: ")
print ustr$(dword ptr [qwPROCTIMER_RESULT])
print chr$(13,10,"Minimum value: ")
print ustr$(edi)
print chr$(13,10)
sub ebp, 1
jnz @loop
inkey chr$(13,10,13,10,"Press any key to exit...")
ret
end start
Quote from: MichaelW on January 26, 2006, 09:55:09 AMLooking at my last results, the cycle count for div seems far too low. I tested again and I still get 7.
mul clock count is 7, div clock count is 41?
QuoteThe problem kinda is; should we get the average, or the fastest possible time, even if it would appear only once in 32 loops?
Assuming the results from my limited testing are typical, if you discarded the counts for the first few trials, and were running say 20 or more trials, the average would equal the lowest value, so I would lean towards doing whatever is easiest.
I changed my code to add:
xor eax, eax
cpuid
Immediately before the test instructions (and to the overhead loop), and this changed the result for div to 37 cycles. So far I cannot see any way to isolate the test instructions from the xor eax, eax that precedes the next cpuid, because the only other non-privileged serializing instructions are IRET and RSM. I at first thought an IRET might be possible, but setting the stack up correctly is going to be difficult at best.
Quote from: MichaelW on January 26, 2006, 09:55:09 AM
P3:
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
Looking at my last results, the cycle count for div seems far too low. I tested again and I still get 7. Agner Fog's Pentium optimization manual shows a latency of 39 and a reciprocal throughput of 37 for the PPro, P2 and P3.
According to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate :U
QuoteAccording to the results you've posted, you're actually getting 41 for division and 7 for multiplication. I think 41 is much more appropriate.
Sorry, my statement was not clear. I was not referring to the results that I posted, but to the results of the code that I posted.
I am now fairly certain that an IRET cannot be made to work from ring 3. With no way to completely isolate the test instructions from the timing instructions there will always be the potential for one or more of the test instructions to execute in parallel with the trailing xor eax,eax. For long sequences of instructions the effect of this would probably be negligible, but not for short sequences.
Keep up the great work on the timing macros. :bg
Processor: AMD Athlon(tm) XP 2500+
Petroizki's inline macros clock counts:
lstrcmp: 881, 881, 881, 881, 881, 881, 881, 881, 881, 906,
szCmp: 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,
Petroizki's procmacros clock counts:
lstrcmp: 877, 877, 877, 877, 877, 877, 877, 877, 877, 877,
szCmp: 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
MichaelW's macros clock counts:
lstrcmp: 1391, 1264, 1264, 1118, 1177, 1189, 1353, 1124, 1286, 1141,
szCmp: 35, 34, 25, 44, 40, 49, 37, 49, 49, 49,
multiply: 4294967287, 0, 4294967295, 4294967288, 4294967293, 0, 4294967283, 4294
967295, 4294967293, 55,
division: 49, 48, 58, 43, 49, 42, 49, 47, 32, 49,
Press any key to exit...
proctimers example:
multiple-pass lstrlen: 80 clock counts
single-pass lstrlen: 267 clock counts
ptimers example:
multiple-pass lstrlen: 81 clock counts
single-pass lstrlen: 328 clock counts
Interesting, if the project is built as a release version, we get:
Petroizki's inline macros clock counts:
lstrcmp: 899, 899, 899, 899, 899, 899, 899, 899, 877, 877,
and if we build as a debug version, we get:
Petroizki's inline macros clock counts:
lstrcmp: 877, 877, 877, 877, 877, 877, 877, 877, 877, 877,
:wink Everything else matches. The command lines are:
Quote
ml /c /Cp /coff /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj
Quote
ml /c /Cp /coff /Fm /Zi /Zd /nologo /I"\masm32\Include" comparison.asm
polink /SUBSYSTEM:CONSOLE /DEBUG /DEBUGTYPE:BOTH /VERSION:4.0 /LIBPATH:"\masm32\lib" /OUT:"TT.exe" *.obj