Has anyone converted these timing macro's for PB by chance?
http://www.masm32.com/board/index.php?topic=770.0
Hi Mike,
Welcome on board. As far as I know no-one has ever ported Michael's macros to PB and there is the problem that the two languages are not the same in their macro capacity.
What I would be inclined to do is write a normal function or two and code the timing method then put it in a seperate file so it can be included when needed.
Hi Hutch,
That would work. How hard is it to port that ASM to PB?
Are there many PB ASM members?
Mike,
the timing macros that Michael has written are very useful in comparing and optimising low level algorithms but I seriously doubt they would be much use to you in application development. If you need to benchmark different algos there are many ways to do this without the level of specialty involved in these timing macros. I still personally time most algos on a large sample with GetTickCount but there are other API functions with better resolution and of course RDTSC if you know how to use its 64 bit resolution.
Perehaps let us know what you are trying to do and we can probably help you in the direction you need.
Mike,
I'm not sure if this is doable in PB, but I did manage to create a system that consists of one macro one define and two procedures that works well for FreeBASIC. Here are the essential parts of the code:
#include once "windows.bi"
dim shared _counter_tsc1_ as ulongint, _counter_tsc2_ as ulongint
dim shared _counter_overhead_ as ulongint, counter_cycles as ulongint
dim shared _counter_loop_counter_ as uinteger
sub _counter_code1_
asm
'
' Use same CPUID input value for each call.
'
xor eax, eax
'
' Flush pipe and wait for pending ops to finish.
'
cpuid
'
' Read Time Stamp Counter.
'
rdtsc
'
' Save count.
'
mov [_counter_tsc1_], eax
mov [_counter_tsc1_+4], edx
end asm
end sub
sub _counter_code2_
asm
xor eax, eax
cpuid
rdtsc
mov [_counter_tsc2_], eax
mov [_counter_tsc2_+4], edx
end asm
end sub
'' Unlike the #define directive, the #macro directive
'' allows inline asm.
''
#macro COUNTER_BEGIN( loop_count, priority_class )
_counter_overhead_ = 2000000000
counter_cycles = 2000000000
SetPriorityClass( GetCurrentProcess(), priority_class )
Sleep_(0) '' Start a new time slice
''
'' The nops compensate for the 10-byte instruction (that
'' initializes _counter_loop_counter_) between the alignment
'' directive and the loop label, which ideally needs to be
'' aligned on a 16-byte boundary.
''
asm
.balign 16
nop
nop
nop
nop
nop
nop
end asm
for _counter_loop_counter_ = 1 to loop_count
_counter_code1_
_counter_code2_
if (_counter_tsc2_ - _counter_tsc1_) < _counter_overhead_ then
_counter_overhead_ = _counter_tsc2_ - _counter_tsc1_
endif
next
Sleep_(0) '' Start a new time slice
asm
.balign 16
nop
nop
nop
nop
nop
nop
end asm
for _counter_loop_counter_ = 1 to loop_count
_counter_code1_
#endmacro
''
'' *** Note the open FOR loop ***
''
#define COUNTER_END _
_counter_code2_ :_
if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
endif :_
next :_
SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS ) :_
counter_cycles -= _counter_overhead_
Hutch,
I am sure you are right, but I would like to explore this anyway. I learn by exploring :)
#COMPILE EXE
#DIM ALL
GLOBAL hDbg AS LONG
' #INCLUDE once "windows.bi" ?
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL LoopCounter AS INTEGER
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1
'
' Use same CPUID input value for each call.
'
! XOR eax, eax
'
' Flush pipe and wait for pending ops to finish.
'
! cpuid
'
' Read Time Stamp Counter.
'
! rdtsc ' <---------------- PB Compiler does not recognize this ***
'
' Save count.
'
! mov [CounterTSC1], eax
! mov [CounterTSC1+4], edx
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2
! XOR eax, eax
! cpuid
! rdtsc ' <---------------- PB Compiler does not recognize this ***
! mov [CounterTSC2], eax
! mov [CounterTSC2+4], edx
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( loop_count, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
SetPriorityClass( GetCurrentProcess(), priority_class )
Sleep_(0) '' Start a new time slice
'' The nops compensate for the 10-byte instruction (that
'' initializes LoopCounter) between the alignment
'' directive and the loop label, which ideally needs to be
'' aligned on a 16-byte boundary.
! .balign 16
! nop
! nop
! nop
! nop
! nop
! nop
FOR LoopCounter = 1 TO loop_count
CounterCode1
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
NEXT
Sleep_(0) '' Start a new time slice
! .balign 16
! nop
! nop
! nop
! nop
! nop
! nop
FOR LoopCounter = 1 TO loop_count '' *** Note the open FOR loop ***
GOSUB CounterCode1
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END
GOSUB CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
NEXT
CALL SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count Charles E V Pegge
'---------------------------'
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,tick ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
'---------------------------'
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
LOCAL i, Rounding, nLoops, RetVal AS LONG
LOCAL cBeg, cEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL d AS DOUBLE
LOCAL s, sTemp AS STRING
hDbg = FREEFILE '
OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
nLoops = 100000
d = 523.34#
COUNTER_BEGIN '
FOR i = 1 TO nLoops
d = d * 3
d = d / 3
NEXT
COUNTER_END '
s = s + "VAL =" + STR$(d) + ", Clock Cycles="+STR$( (CounterCycles)\nLoops ) + $CRLF + $CRLF
PRINT #hDbg, s
MSGBOX s,64,"All Done" : EXIT FUNCTION
CLOSE hDbg
END FUNCTION
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
My ASM knowledge is limited, but this will nto compile.
Synthesizing unsupported instructions with db (e.g. ! db &H0F, &H31 for RDTSC) is nothing unusual.
The:
#INCLUDE once "windows.bi"
was necessary for FreeBASIC to pull in the header files and import libraries necessary to call the SetPriorityClass, GetCurrentProcess, and Sleep functions. I have no idea how you might do this with PB.
The Sleep_(0) calls the Windows API Sleep (http://msdn2.microsoft.com/en-us/library/ms686298.aspx) function passing a dwMilliseconds value of 0. The function was renamed to avoid a conflict with the FreeBASIC SLEEP function.
The .balign is a GAS alignment directive. Again, I have no idea how you might do this with PB.
Mike,
here are two PB macros for the two instructions.
MACRO rdtsc = ! db &H0F,&H31
MACRO cpuid = ! db &H0F,&HA2
Note that you use them WITHOUT the leading "!" as they ae macros, not inline assembler.
Thx hutch.
I get a compiler error:
Assembler Syntax Error
for
! mov [CounterTSC1], eax
I suspect the square brackets. Can I replace them with round brackets?
Also, is the PB sleep interchangable with sleep in this code?
ie starting a new time slice?
#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.inc" ' Basic Win API definitions
GLOBAL hDbg AS LONG
' #INCLUDE once "windows.bi" ?
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL LoopCounter AS INTEGER
MACRO rdtsc = ! db &H0F,&H31
MACRO cpuid = ! db &H0F,&HA2
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1
'
' Use same CPUID input value for each call.
'
! XOR eax, eax
'
' Flush pipe and wait for pending ops to finish.
'
! cpuid
'
' Read Time Stamp Counter.
'
rdtsc
'
' Save count.
'
! mov [CounterTSC1], eax
! mov [CounterTSC1+4], edx
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2
! XOR eax, eax
cpuid
rdtsc
! mov [CounterTSC2], eax
! mov [CounterTSC2+4], edx
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( loop_count, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
SLEEP 0 ' Start a new time slice
'' The nops compensate for the 10-byte instruction (that
'' initializes LoopCounter) between the alignment
'' directive and the loop label, which ideally needs to be
'' aligned on a 16-byte boundary.
! .balign 16
! nop
! nop
! nop
! nop
! nop
! nop
FOR LoopCounter = 1 TO loop_count
CounterCode1
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
NEXT
SLEEP 0 ' Start a new time slice
! .balign 16
! nop
! nop
! nop
! nop
! nop
! nop
FOR LoopCounter = 1 TO loop_count '' *** Note the open FOR loop ***
GOSUB CounterCode1
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END
GOSUB CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
NEXT
CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count Charles E V Pegge
'---------------------------'
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,tick ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
'---------------------------'
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
LOCAL i, Rounding, nLoops, RetVal AS LONG
LOCAL cBeg, cEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL d AS DOUBLE
LOCAL s, sTemp AS STRING
hDbg = FREEFILE '
OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
nLoops = 100000
d = 523.34#
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS
FOR i = 1 TO nLoops
d = d * 3
d = d / 3
NEXT
COUNTER_END '
s = s + "VAL =" + STR$(d) + ", Clock Cycles="+STR$( (CounterCycles)\nLoops ) + $CRLF + $CRLF
PRINT #hDbg, s
MSGBOX s,64,"All Done" : EXIT FUNCTION
CLOSE hDbg
END FUNCTION
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
http://www.powerbasic.com/support/help/pbcc/sleep_statement.htm
Still trying to figure out how to convert
! mov [CounterTSC2+4], edx
to PB. The Compiler baulks at the square brackets...
Mike,
PB is a bit deviant in the notation for complex addressing mode.
replace this
! mov [CounterTSC2+4], edx
with
! mov CounterTSC2[4], edx
See if this works.
Thx Hutch,
Tha seemed to keep the compler happy, now it baulks at:
! .balign 16
Mike,
Delete that line, PowerBASIC does not support code alignment.
And if the compiler cannot align the code there is no need for the two groups of 6 nops, and you should expect the cycle counts to vary somewhat with the memory address of the counting (FOR) loop labels. You should be able to minimize this effect by doing all of your comparison tests in the same macro at the same address, and by adding a sufficient number of nops ahead of the second Sleep(0) to ensure that both of the loop labels (the reference loop that determines the counter overhead and the test loop) are at the same relative alignment (i.e. that they are some multiple of 16 bytes apart).
This will compile and run, but CounterCycles = 0
I suspect it has to do with returuning the values:
CounterTSC1
CounterTSC2
I currently use Charles E V Pegge code.
he returns values with
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
When I try
! mov [CounterTSC1],eax ' save low order 4 bytes.
! mov [CounterTSC1+4],edx ' save high order 4 bytes.
I get a compiler error again
so perhaps I can do:
! mov ebx,CounterTSC1
! mov CounterTSC1, eax ' Save count.
! mov CounterTSC1[4], edx ' Save count.
(I have no idea what I am doing, just using pattern recognition!)
#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.inc" ' Basic Win API definitions
GLOBAL hDbg AS LONG
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL LoopCounter AS INTEGER
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC1, eax ' Save count.
! mov CounterTSC1[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC2, eax ' Save count.
! mov CounterTSC2[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops
CounterCode1
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
NEXT
' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
' between the alignment directive and the loop label, which ideally needs to be
' aligned on a 16-byte boundary.
! nop
! nop
! nop
! nop
! nop
! nop
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops '' *** Note the open FOR loop ***
CounterCode1
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
NEXT
CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count Charles E V Pegge
'---------------------------'
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,tick ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
'---------------------------'
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
LOCAL i, Rounding, nLoops, RetVal AS LONG
LOCAL cBeg, cEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL d AS DOUBLE
LOCAL s, sTemp AS STRING
hDbg = FREEFILE '
OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
nLoops = 100000
d = 523.34#
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS
FOR i = 1 TO nLoops
d = d * 3
d = d / 3
NEXT
COUNTER_END '
s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF
s = s + "nLoops=" + STR$(nLoops) + $CRLF
s = s + "VAL =" + STR$(d) + ", Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
PRINT #hDbg, s
MSGBOX s,64,"All Done" : EXIT FUNCTION
CLOSE hDbg
END FUNCTION
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
In my code the statement:
if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
endif :_
Is intended to capture the lowest cycle count that occurs in any loop. Here is the complete header for the FreeBASIC version:
'====================================================================
'' The COUNTER_BEGIN and COUNTER_END macros provide a convenient
'' method of measuring the processor clock cycle count for a block
'' of code. These macros must be called in pairs, and the block of
'' code must be placed in between the COUNTER_BEGIN and COUNTER_END
'' macro calls. The clock cycle count for a single loop through the
'' block of code, corrected for the test loop overhead, is returned
'' in the shared variable COUNTER_CYCLES.
''
'' These macros capture the lowest cycle count that occurs in a
'' single loop through the block of code, on the assumption that
'' the lowest count is the correct count. The higher counts that
'' occur are the result of one or more context switches within the
'' loop. Context switches can occur at the end of a time slice, so
'' to minimize the possibility of the loop overlapping the time
'' slice the COUNTER_BEGIN macro starts a new time slice at the
'' beginning of the loop. If the execution time for a single loop
'' is greater than the duration of a time slice (approximately 20ms
'' under Windows), then the loop will overlap the time slice, and
'' if another thread of equal priority is ready to run, then a
'' context switch will occur.
''
'' A higher-priority thread can "preempt" a lower priority thread,
'' causing a context switch to occur before the end the time slice.
'' Raising the priority of the process (thread) can reduce the
'' frequency of these context switches. To do so, specify a higher
'' than normal priority class in the priority_class parameter of
'' the COUNTER_BEGIN macro. REALTIME_PRIORITY_CLASS specifies the
'' highest possible priority, but using it involves some risk,
'' because your process will preempt *all* other processes,
'' including critical Windows processes, and if the timed code
'' takes too long to execute then Windows may hang, even the NT
'' versions. HIGH_PRIORITY_CLASS is a safer alternative that in
'' most cases will produce the same cycle count.
''
'' You can avoid the bulk of the context switches by simply
'' inserting a few second delay in front of the first macro
'' call, to allow the system disk cache activity to subside.
''
'' For the loop_count parameter of the COUNTER_BEGIN macro, larger
'' values increase the number of samples and so tend to improve
'' the consistency of the returned cycle counts, but in most cases
'' you are unlikely to see any improvement beyond a value of about
'' 1000.
''
'' Note that the block of code must be able to withstand repeated
'' looping, and that this method will not work for timing a loop
'' where the number of loops is variable, because the lowest cycle
'' count that occurs will be when the loop falls through.
''
'' These macros require a Pentium-class processor that supports
'' the CPUID (function 0 only) and RDTSC instructions.
'====================================================================
To get an average cycle count, as is done in the first set of MASM32 macros, you would need to accumulate the total number of cycles for all loops, and then divide it by the number of loops. The FreeBASIC version is similar to the second set of MASM32 macros, and the end result is simply the value of counter_cycles. I selected odd names for the globals to reduce the possibility of naming conflicts.
Hello Michael,
Thank you for your continued help, but I must admit I am struggling here. I do not know ASM at all and I'm generally considered not overly bright on most forums, so if you could tell me what I should change I would be gratefull.
In reading the description, I suspect:
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
and
CounterCycles = CounterCycles - CounterOverhead
but I am not sure if I should just remove the subtraction of CounterOverhead or what to do?
I am also very unclear about the nops and how many to add. How do you figure that out?
Kind Regards
Mike
PS
I made all the Global names easily understandable while I grapple to understand how to make this work in PB.
I usually add a prefix to all my Globals when deploying code like this, something like TIMING_MACRO_xxxx
The only big problem that I can see is the CounterCycles\nLoops, where you are taking the lowest cycle count that occurs in any loop and dividing it by the number of loops, when (I assume) your goal is to display the lowest cycle count that occurs in any loop. But I doubt that this accounts for a result of zero, because I doubt that the test (FOR) loop could execute in anything close to nLoops cycles. I think you should just remove the nops and ignore any alignment considerations until you get the code working. If I were doing this I think I would set nLoops to 1 and reduce the test code to just d = d / 3 (no FOR loop and no multiply), which should take something on the order of 10 cycles to execute. Then in the macro I would add print statements to display the values of the globals as the code executes.
ok Done.
With the GLOBALs declared a LONG I get negative numbers(?)
1 CounterTSC1=-257241836
1 CounterTSC2=-257087332
1 CounterOverhead= 9343
-------
2 CounterTSC1=-256895800
2 CounterTSC2=-256835048
2 CounterOverhead= 9343
CounterCycles= 51409
nLoops= 1
VAL = 174.446666666667, Clock Cycles= 51409
With the GLOBALs declared as DWORD...
1 CounterTSC1= 3695436770
1 CounterTSC2= 3695552118
1 CounterOverhead= 9370
-------
2 CounterTSC1= 3695729618
2 CounterTSC2= 3695790974
2 CounterOverhead= 9370
CounterCycles= 51986
nLoops= 1
VAL = 174.446666666667, Clock Cycles= 51986
#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.inc" ' Basic Win API definitions
GLOBAL hDbg AS LONG
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC1, eax ' Save count.
! mov CounterTSC1[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC2, eax ' Save count.
! mov CounterTSC2[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops
CounterCode1 '
PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
CounterCode2 '
PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
NEXT '
PRINT #hDbg, "-------"
' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
' between the alignment directive and the loop label, which ideally needs to be
' aligned on a 16-byte boundary.
' ! nop
' ! nop
' ! nop
' ! nop
' ! nop
' ! nop
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops '' *** Note the open FOR loop ***
CounterCode1 '
PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END
CounterCode2 '
PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
NEXT
CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count Charles E V Pegge
'---------------------------'
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,tick ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
'---------------------------'
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
LOCAL i, Rounding, nLoops, RetVal AS LONG
LOCAL cBeg, cEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL d AS DOUBLE
LOCAL s, sTemp AS STRING
hDbg = FREEFILE '
OPEN "TMDebug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
nLoops = 1 '00000
d = 523.34#
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS
FOR i = 1 TO nLoops
' d = d * 3
d = d / 3
NEXT
COUNTER_END '
s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF
s = s + "nLoops=" + STR$(nLoops) + $CRLF
s = s + "VAL =" + STR$(d) + ", Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
PRINT #hDbg, s
MSGBOX s,64,"All Done" : EXIT FUNCTION
CLOSE hDbg
END FUNCTION
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
with nLoops set to 1000
1 CounterTSC1= 120560912
1 CounterTSC2= 120700904
1 CounterOverhead= 9415
1 CounterTSC1= 120845036
1 CounterTSC2= 120904000
1 CounterOverhead= 9415
1 CounterTSC1= 121077796
1 CounterTSC2= 121137880
1 CounterOverhead= 9415
1 CounterTSC1= 121255360
1 CounterTSC2= 121342196
1 CounterOverhead= 9415
1 CounterTSC1= 121459804
1 CounterTSC2= 121518368
1 CounterOverhead= 9415
1 CounterTSC1= 121663192
1 CounterTSC2= 121722916
1 CounterOverhead= 9415
'
'
'
1 CounterTSC2= 213222812
1 CounterOverhead= 9415
1 CounterTSC1= 213276708
1 CounterTSC2= 213303540
1 CounterOverhead= 9415
-------
2 CounterTSC1= 213388548
2 CounterTSC2= 213536432
2 CounterOverhead= 9415
2 CounterTSC1= 214828816
'
'
'
2 CounterOverhead= 9415
2 CounterTSC1= 350547788
2 CounterTSC2= 350615896
2 CounterOverhead= 9415
CounterCycles= 58241
nLoops= 1000
VAL = 0, Clock Cycles= 58
The attachment contains the complete FreeBASIC source and the compiled EXE for a test app that you might find useful for testing. Running on my P3 the results are very consistent run to run:
d = d / 3 -> 10 cycles
d = d * 3 -> 6 cycles
d = d + 3 -> 4 cycles
d = d - 3 -> 4 cycles
d = sqr(d) -> 9 cycles
d = log(d) -> 58 cycles
d = sin(d) -> 87 cycles
[attachment deleted by admin]
On my P4 i get:
Quoted = d / 3 -> 18446744073709551612 cycles
d = d * 3 -> 18446744073709551612 cycles
d = d + 3 -> 18446744073709551612 cycles
d = d - 3 -> 18446744073709551612 cycles
d = sqr(d) -> 4 cycles
d = log(d) -> 204 cycles
d = sin(d) -> 128 cycles
Do you have an email address? I have an idea.
I think 18446744073709551612 is effectively -4, indicating that the reference count was larger than the test count. The problem with not being able to align the loop labels is that you can have a situation where the loop overhead in the test loop is not the same as the loop overhead in the reference loop. As I stated, I think I know of a way to minimize the alignment problems, but implementing it will require access to the compiler output, as assembly code.
Also, counting cycles on a P4 presents problems that AFAIK are unique to the P4. I have never gotten around to doing any definitive tests, but I think the problem has something to do with the Time Stamp Counter being updated in step with the external clock, which runs at 1/4 the speed of the internal clock. Using the method of these macros, taking the lowest count that occurs in any single pass through the test code the count is always a multiple of 4. So obviously, you cannot get useful counts for instructions/statements that execute in less than 4 cycles, or even several times that. For the P4 the method that measures the total cycle count for all passes and calculates an average, as implemented in the first set of MASM32 macros, is probably preferable. But since the P4 architecture is now history, and not likely to be revived, and the method of these macros works so well on the other processors, I did not bother to implement the other method for FB. These macros can be adapted to work reasonably on a P4 for instructions/statements that execute in less than, say 12-16 cycles, the same way that they can be adapted to work on the other processors for instructions/statements that execute in less than a few cycles. Basically, you repeat the test instructions/statements some number of times (by duplicating the instructions/statements in the source, not as a loop) and then divide the result by the number of repeats.
>I assume that the test included no debugging code.
Yes, just ran the .exe you provided
>add three instances of ! nop ahead of the first Sleep 0 statement in the COUNTER_BEGIN macro
done
#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.inc" ' Basic Win API definitions
GLOBAL hDbg AS LONG
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC1, eax ' Save count.
! mov CounterTSC1[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2
! XOR eax, eax ' Use same CPUID input value for each call.
! db &H0F,&HA2 ' cpuid - Flush pipe and wait for pending ops to finish.
! db &H0F,&H31 ' rdtsc - Read Time Stamp Counter.
! mov CounterTSC2, eax ' Save count.
! mov CounterTSC2[4], edx ' Save count.
END SUB
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
' between the alignment directive and the loop label, which ideally needs to be
' aligned on a 16-byte boundary.
! nop
! nop
! nop
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops
CounterCode1 ' PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
CounterCode2 ' PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
NEXT ' PRINT #hDbg, "-------"
SLEEP 0 ' Start a new time slice
FOR LoopCounter = 1 TO nLoops '' *** Note the open FOR loop ***
CounterCode1 ' PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END
CounterCode2 ' PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
NEXT
CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
LOCAL i, Rounding, nLoops, RetVal AS LONG
LOCAL cBeg, cEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL d AS DOUBLE
LOCAL s, sTemp AS STRING
nLoops = 1000
d = 523.34#
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS
FOR i = 1 TO nLoops
d = d / 3
NEXT
COUNTER_END '
s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF
s = s + "nLoops=" + STR$(nLoops) + $CRLF
s = s + "VAL =" + STR$(d) + ", Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
MSGBOX s,64,"All Done" : EXIT FUNCTION
END FUNCTION
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
[attachment deleted by admin]
For comparing the PB results to the FB results, I think a more reasonable starting point would be something like this, where the tested code will be the same for both compilers.
#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.inc"
GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS LONG ' Use a 32-bit loop counter
SUB CounterCode1
! XOR eax, eax
! db &H0F,&HA2
! db &H0F,&H31
! mov CounterTSC1, eax
! mov CounterTSC1[4], edx
END SUB
SUB CounterCode2
! XOR eax, eax
! db &H0F,&HA2
! db &H0F,&H31
! mov CounterTSC2, eax
! mov CounterTSC2[4], edx
END SUB
MACRO COUNTER_BEGIN( nLoops, priority_class )
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
SLEEP 0
! nop
! nop
! nop
FOR LoopCounter = 1 TO nLoops
CounterCode1
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
NEXT
SLEEP 0
! nop
! nop
! nop
FOR LoopCounter = 1 TO nLoops
CounterCode1
END MACRO
MACRO COUNTER_END
CounterCode2
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
NEXT
CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
CounterCycles = CounterCycles - CounterOverhead
END MACRO
FUNCTION PBMAIN
LOCAL s AS STRING
COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
! xor edx, edx
! mov eax, 1
! div ecx
COUNTER_END
s = STR$(CounterCycles) + $CRLF
COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
! rcr eax, 31
! rcl eax, 31
COUNTER_END
s = s + STR$(CounterCycles) + $CRLF
COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
! pushad
! popad
COUNTER_END
s = s + STR$(CounterCycles) + $CRLF
MSGBOX s
EXIT FUNCTION
END FUNCTION
The results for the FB version, again very consistent on my P3:
36
9
16
If you could compile this and post the EXE, I should be able to work out the alignment padding.
Compiled exactly as you wrote it:
Results:
4294930039
4294930047
4294930023
[attachment deleted by admin]
There is a problem in the code that I failed to see. The globals CounterTSC1, CounterTSC2, CounterOverhead, and CounterCycles should all be unsigned 64-bit integers. And I noticed in the compiled code that the compiler is using the FPU to perform what should have been 32-bit integer comparisons. This is workable, but it seems somewhat odd for a performance-oriented compiler, and makes me suspect that the declaration statement might have problems (according to the documentation the declaration is correct for version 8, but I'm not sure about earlier versions). So for testing purposes I think the declarations for the globals should be changed to:
GLOBAL CounterTSC1 AS QUAD
GLOBAL CounterTSC2 AS QUAD
GLOBAL CounterOverhead AS QUAD
GLOBAL CounterCycles AS QUAD
GLOBAL LoopCounter AS LONG ' Use a 32-bit loop counter
And it would be helpful if in the attachment you would include the source with the EXE.
The FPU should NOT be involved in a DWORD operation? Now PowerBASIC is famous for using the FPU for QUAD's:
http://www.powerbasic.com/support/pbforums/showthread.php?t=35768
Results:
16
32
12
Executable and code attached.
I have included the statement
#REGISTER NONE
to ensure that the compiler is not attempting to use the registers.
[attachment deleted by admin]
Running on my P3 the PB code returns 35, 12, 9, and the FB code 36, 9, 16, both with no run-to-run variation through multiple trials. As it happened the for loop labels were 80 bytes apart, so both of the loops were operating at the same relative (mod 16) alignment. One more problem that I failed to anticipate is that the PB compiler apparently does not align the entry point of procedures, so the entry points for the CounterCode1 and CounterCode2 procedures may not be (and probably will not be) at the optimal alignment. To get around this I eliminated the procedures and placed the essential code inline in the macros. I think I have the for loop labels aligned correctly, but I will need to examine the compiled EXE to be sure.
BTW, this code does not modify any of the four registers that must be preserved per this (http://www.powerbasic.com/support/help/pbwin/html/ASM_statement.htm) page.
[attachment deleted by admin]
fbtest.exe yeilds
12
16
16
8
on my P4 with no run to run variation.
TimingMacro09.exe yields:
20
16
16
20
mostly with the occasional
20
16
16
16
In ver10, I increased the loops to 1m and the variation got a little wider
20
16
12 - 16
12 - 20
Attached
[attachment deleted by admin]
On my P3 the FB version consistently returns 36, 9, 16, 15, and the PB version consistently returns 38, 11, 17, 35.
For reference, the same tests with second set of MASM32 macros (counter2.asm) consistently return 38, 12, 19, 16, and the same tests with first set of MASM32 macros (timers.asm) typically return 34, 8, 11, 5.
The large difference for the PB version in the last test can be accounted for by differences in the compiler-generated code for the GetTickCount function call. Where the FB and MASM32 code calls GetTickCount through a jump table, the PB code calls GetTickCount directly, and then calls some other function after GetTickCount returns (as it seems to do for all of the API function calls), and the cycle count includes both function calls. I was attempting to test only compiler-independent code, but I made a bad choice for the last test.
The small differences in the first three tests can be accounted for by the differences in the structure of the macros. The method used by these macros has a weakness in that the timed instructions cannot be completely isolated from the timing instructions, so the timing instructions can affect the cycle count for the adjacent timed instructions. The effect can make it impossible to get meaningful cycle counts for single instructions or short sequences of instructions, but as the length of the sequence increases the effect on the cycle count quickly becomes negligible.
I can now see only two remaining problems. Somehow I made a mistake in calculating the number of alignment nops in the PB code, and instead of 9 there should be only 4. And to minimize the variations you should add a delay of several seconds before the first test (as I did with the sleep 3000 in the FB code), to allow time for the various system activities involved in launching an application to complete.
It now appears that my statement regarding this code not changing registers that must be preserved is wrong. Per this (http://www.powerbasic.com/support/help/pbwin/html/Intermixing_ASM_and_BASIC_code.htm) page:
"compiled PowerBASIC code is designed to expect that the EBX, ESI, and EDI registers will remain unchanged between lines of BASIC code."
Even though there are no apparent problems, the CPUID instruction is changing the value of EBX, so to be safe EBX needs to be preserved around the CPUID instruction. The attachment contains the updated code, with the last test replaced with a longer block of assembly code derived from the MASM32 nrandom procedure.
Edit: I forgot to add a several second delay before the first test.
[attachment deleted by admin]
on my P4 ver 12 (sleep 3000 added) results:
16
16
12
88
fbtest results:
12
16
16
92
Results for both are consistent
QuotePB code calls GetTickCount directly, and then calls some other function after GetTickCount returns
Let me ask our PB guru about the secondary function call...
Added:
I dont see the 4 nops... should I add those?
[attachment deleted by admin]
Running on my P3, the FB version returns 36, 9, 16, 91 and the PB version 38, 12, 17, 91, both with no variations observed. At this point I doubt that the core code can be further improved. I got the alignment right this time. After I added the instructions to preserve EBX around the CPUID calls, the distance between the for loop labels is 112 bytes (16*7) without any padding. The SLEEP 3000 needs to run one time after the program is launched, so it should be placed before the first COUNTER_BEGIN MACRO call, as it is in the FB code. With the SLEEP 3000 in the COUNTER_BEGIN macro it runs each time the macro is called, and while this should not cause problems, only the first run does anything useful.
If you wish to copy or copy and modify any of my comments or explanations from the MASM32 or FB macros, feel free.
Thx Michael
I changed nLoops and the position of the Sleep command
Results
16
16
12
88
Last two numbers are consistant.
first two are 20 one in every 10 executions.
In general is 1000 loops enough to measure?
What would be the lowest number of loops to get consitent results do you think?
[attachment deleted by admin]
On my P3 I get 38, 11, 17, 93 with no variations observed. Within my experience the variations you are seeing on the short sequences of instructions are normal for a P4. These sequences are not representative of real-world code, they are just something that I threw together as a quick test. Most code that would need to be optimized would involve longer sequences. You can also expect to see variations on code were a single pass through the code takes more than, very roughly, a few hundred clock cycles, but at least on my system the variations seldom reach even one percent.
Within my experience 1000 loops is more than enough. On my P3 the FB version running the same test code produces good repeatability down to 2 loops. I use 1000 loops normally, but I commonly reduce this for code where 1000 loops would take too long to execute. For code that cannot be looped I reduce the number of loops to 1, and then average the results over multiple trials.
Michael,
Thank you for developing this with me. I has been educational.
Today I wanted to put it to a real world test, in this case timing a function that converts a string representation of a number to a LONG.
http://www.powerbasic.com/support/pbforums/showthread.php?t=14373&highlight=VAL
In the end I created the attached test bed to evaluate the functions.
The timing variations I got with CPU Clock count method & the Counter Cycle MACROs were:
for 1000 loops:
420-438
279-299
77-231
&
344112-370288
226056-252064
74724-76052
For 100 Loops:
460-465
282-311
82-111
&
40816-43616
25244-27836
7724-10908
I was hoping for less variation, but I am probably just not understanding the nature of the beast.
Also, Is there any direct relationship between Clocks and CounterCycles?
[attachment deleted by admin]
I see now that I should have explained the macros in more detail. I was assuming that you understood what I meant by "test loop", and when I saw evidence that you did not, I failed to recognize it. The test loop is built into the macros, with the first part of it in the COUNTER_BEGIN macro, and the last part of it in the COUNTER_END macro. When the macros are expanded in memory, they look something like this:
Initialize CounterOverhead and CounterCycles
Set priority class.
FOR LoopCounter = 1 TO nLoops '' Start of empty reference loop.
! XOR eax, eax '' CPUID function 0
! push ebx '' preserve EBX
! db &H0F,&HA2 '' CPUID
! pop ebx '' recover EBX
! db &H0F,&H31 '' RDTSC
! mov CounterTSC1, eax '' store low-order dword
! mov CounterTSC1[4], edx '' store high-order dword
! XOR eax, eax
! push ebx
! db &H0F,&HA2
! pop ebx
! db &H0F,&H31
! mov CounterTSC2, eax
! mov CounterTSC2[4], edx
'' Capture the lowest cycle count that occurs.
IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN
CounterOverhead = CounterTSC2 - CounterTSC1
END IF
NEXT '' End of empty reference loop.
SLEEP 0 '' Start a new timeslice.
FOR LoopCounter = 1 TO nLoops '' Start of test loop.
! XOR eax, eax
! push ebx
! db &H0F,&HA2
! pop ebx
! db &H0F,&H31
! mov CounterTSC1, eax
! mov CounterTSC1[4], edx
'' Code placed between the end of the COUNTER_BEGIN
'' macro and the start of the COUNTER_END macro
'' code goes here.
! XOR eax, eax
! push ebx
! db &H0F,&HA2
! pop ebx
! db &H0F,&H31
! mov CounterTSC2, eax
! mov CounterTSC2[4], edx
'' Capture the lowest cycle count that occurs.
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN
CounterCycles = CounterTSC2 - CounterTSC1
END IF
NEXT '' End of test loop.
Restore priority class to normal.
Calculate counter cycles, correcting it for the loop overhead.
So to pick one example, that illustrates how the macros are intended to be used, and assuming that nLoops = 1000, this code:
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
RetVal = VAL(sNum)
COUNTER_END
Will execute the RetVal = VAL(sNum) statement 1000 times, capturing the lowest count that occurs in any single pass, and correcting it for the loop overhead, so the count will include only the cycles for the RetVal = VAL(sNum) statement, and not the cycles for the test loop overhead.
This code:
COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
FOR k = 1 TO nLoops ' inner loop
RetVal = VAL(sNum)
NEXT
COUNTER_END
Will execute the outer test loop (the test loop built into the macros) 1000 times, and the inner loop 1000 times. So the outer test loop will capture the lowest count that occurs in any single pass through the inner test loop, and this count will include 1000 executions of the RetVal = VAL(sNum) statement, as well as the loop overhead for the inner loop.
The attachment tests a FB version of Paul's code, and on my P3 it returns 88, 88, 88 with no variations observed.
[attachment deleted by admin]
aaaah... NOW i get the concept. I couldn't see the wood for the trees... I got so focused on other things i missed this obvious piece of the puzzle.
Your fbtest on my machine returns:
132-144
132-144
128-140
Could you test the atached code to see if there are any variations.
On my machine 1000 loops:
420-438
279-299
77-231
&
416-424
300-332
68 - no variation
100 loops:
439-465
282-312
82-111
&
400-424
288-308
68 - no variation
Is it possible that my P4 is responsible for the variations?
Would it help to escalate the priority?
(I think I remember you mentioned the risk is high if the code is not stable)
What puzzles me most, is that there is a 25 Clk overhead for the orginal method yet it reports consistently lower clks for the second algo than the Macros?
The purpose of investigating these Macros is to find a more stable way of timing code at the fine level for development. The absolute number is not as relevant to me as the stability of the number. A 30 Clk variation at 300Clks is a 10% error. Thats not so great.
[attachment deleted by admin]
Running on my P3 the macro return values, for 100 or 1000 loops, are very consistent, with only an occasional 2-cycle variation in the second test. For the time_stamp_count tests, the return values are usually within a few cycles of the return values for the macros, but occasionally they are way off. The way the tests are being performed the overhead for the time_stamp_count procedure call is negligible, accounting for less than one cycle in the final results, even when nLoops = 100.
The P4 is at least part of the problem, but there could be other things going on in your system that could affect the timing. Assuming you are running Windows 2000 or XP, you might take a look at the CPU usage and time on the Processes tab in Task Manager, looking for processes where the CPU usage is variable and frequently high. You might also want to examine the base priority of the processes, but you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu. The greatest effect would be from a process running at an elevated priority, where the CPU usage is variable and frequently high. Testing with REALTIME_PRIORITY_CLASS might reduce or eliminate the effects of these other things. If the code is stable I see no reason to avoid REALTIME_PRIORITY_CLASS, but before using it I would make sure that I had no applications open with unsaved data.
Another possibility might be to warm up the CPUID instruction, as I have seen done in an Intel example source. This seems to have no effect on a P3 or on the AMD processors that I have tested, but it might have an effect on a P4. The code would be added in the COUNTER_BEGIN macro before the start of the reference loop, so adding it would not disturb the relative alignment of the FOR loop labels.
...
CounterOverhead = 2000000000
CounterCycles = 2000000000
CALL SetPriorityClass( GetCurrentProcess, priority_class )
'' Warm up the CPUID instruction
! push ebx
! XOR eax, eax
! db &H0F,&HA2
! XOR eax, eax
! db &H0F,&HA2
! XOR eax, eax
! db &H0F,&HA2
! pop ebx
FOR LoopCounter = 1 TO nLoops
...
A more stable way of timing code at the fine level, where the consistency is more important than the actual count, was my goal in developing the second set of MASM32 macros and the FB macros that use the same basic method.
Also, with the FB macros running a P3, and assuming that I remember to add the several-second delay before the first test, if I run an empty test loop:
counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles
I get a consistent return value of zero. What do your macros do on a P4?
Quotebut you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu.
Ah OK. Well all I see in High proiority is
taskmgr.exe
winlogon.exe
csrss.exe
Everything else is normal priority
I keep this machine clean as a whistle.
Nothing is taking up processor time as i watch for several minutes.
When I run your test, it reports about 91% CPU usage for a moment
counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles
consitently returns zero yes
Adding the warm up code to your testing makes it a little more stable. only the third test fluctuates occasionally.
Adding the warm up code to the VAL testing makes little difference at 100 or 1000 loops on this P4 with XP Pro SP2.
But I am relieved to learn that on your P3 it is consitent.
Thank you for your help with this Michael. I learned a lot.
wow MichaelW you really put lot of time into this, you should tell the PB developers about it, sure they can give you recognition and use it internally to speed test things they do.