News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing Macros for PB

Started by MikeT, January 04, 2008, 03:01:01 PM

Previous topic - Next topic

MikeT

This will compile and run, but CounterCycles = 0
I suspect it has to do with returuning the values:
CounterTSC1
CounterTSC2

I currently use Charles E V Pegge code.
he returns values with
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.

When I try
  ! mov [CounterTSC1],eax             ' save low order 4 bytes.
  ! mov [CounterTSC1+4],edx           ' save high order 4 bytes.
I get a compiler error again
so perhaps I can do:
    ! mov ebx,CounterTSC1     
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

(I have no idea what I am doing, just using pattern recognition!)

                         

     

#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
           
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1
    NEXT   
         
       
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1
         


END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 100000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

MichaelW

In my code the statement:

if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
endif :_

Is intended to capture the lowest cycle count that occurs in any loop. Here is the complete header for the FreeBASIC version:

'====================================================================
'' The COUNTER_BEGIN and COUNTER_END macros provide a convenient
'' method of measuring the processor clock cycle count for a block
'' of code. These macros must be called in pairs, and the block of
'' code must be placed in between the COUNTER_BEGIN and COUNTER_END
'' macro calls. The clock cycle count for a single loop through the
'' block of code, corrected for the test loop overhead, is returned
'' in the shared variable COUNTER_CYCLES.
''
'' These macros capture the lowest cycle count that occurs in a
'' single loop through the block of code, on the assumption that
'' the lowest count is the correct count. The higher counts that
'' occur are the result of one or more context switches within the
'' loop. Context switches can occur at the end of a time slice, so
'' to minimize the possibility of the loop overlapping the time
'' slice the COUNTER_BEGIN macro starts a new time slice at the
'' beginning of the loop. If the execution time for a single loop
'' is greater than the duration of a time slice (approximately 20ms
'' under Windows), then the loop will overlap the time slice, and
'' if another thread of equal priority is ready to run, then a
'' context switch will occur.
''
'' A higher-priority thread can "preempt" a lower priority thread,
'' causing a context switch to occur before the end the time slice.
'' Raising the priority of the process (thread) can reduce the
'' frequency of these context switches. To do so, specify a higher
'' than normal priority class in the priority_class parameter of
'' the COUNTER_BEGIN macro. REALTIME_PRIORITY_CLASS specifies the
'' highest possible priority, but using it involves some risk,
'' because your process will preempt *all* other processes,
'' including critical Windows processes, and if the timed code
'' takes too long to execute then Windows may hang, even the NT
'' versions. HIGH_PRIORITY_CLASS is a safer alternative that in
'' most cases will produce the same cycle count.
''
'' You can avoid the bulk of the context switches by simply
'' inserting a few second delay in front of the first macro
'' call, to allow the system disk cache activity to subside.
''
'' For the loop_count parameter of the COUNTER_BEGIN macro, larger
'' values increase the number of samples and so tend to improve
'' the consistency of the returned cycle counts, but in most cases
'' you are unlikely to see any improvement beyond a value of about
'' 1000.
''
'' Note that the block of code must be able to withstand repeated
'' looping, and that this method will not work for timing a loop
'' where the number of loops is variable, because the lowest cycle
'' count that occurs will be when the loop falls through.
''
'' These macros require a Pentium-class processor that supports
'' the CPUID (function 0 only) and RDTSC instructions.
'====================================================================

To get an average cycle count, as is done in the first set of MASM32 macros, you would need to accumulate the total number of cycles for all loops, and then divide it by the number of loops. The FreeBASIC version is similar to the second set of MASM32 macros, and the end result is simply the value of counter_cycles. I selected odd names for the globals to reduce the possibility of naming conflicts.
eschew obfuscation

MikeT

Hello Michael,

Thank you for your continued help, but I must admit I am struggling here. I do not know ASM at all and I'm generally considered not overly bright on most forums, so if you could tell me what I should change I would be gratefull.

In reading the description, I suspect:
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1

and

CounterCycles = CounterCycles - CounterOverhead

but I am not sure if I should just remove the subtraction of CounterOverhead or what to do?

I am also very unclear about the nops and how many to add. How do you figure that out?

Kind Regards
Mike
PS
I made all the Global names easily understandable while I grapple to understand how to make this work in PB.
I usually add a prefix to all my Globals when deploying code like this, something like TIMING_MACRO_xxxx

MichaelW

The only big problem that I can see is the CounterCycles\nLoops, where you are taking the lowest cycle count that occurs in any loop and dividing it by the number of loops, when (I assume) your goal is to display the lowest cycle count that occurs in any loop. But I doubt that this accounts for a result of zero, because I doubt that the test (FOR) loop could execute in anything close to nLoops cycles. I think you should just remove the nops and ignore any alignment considerations until you get the code working. If I were doing this I think I would set nLoops to 1 and reduce the test code to just d = d / 3 (no FOR loop and no multiply), which should take something on the order of 10 cycles to execute. Then in the macro I would add print statements to display the values of the globals as the code executes.
eschew obfuscation

MikeT

ok Done.
With the GLOBALs declared a LONG I get negative numbers(?)
1 CounterTSC1=-257241836
1 CounterTSC2=-257087332
1 CounterOverhead= 9343
-------
2 CounterTSC1=-256895800
2 CounterTSC2=-256835048
2 CounterOverhead= 9343
CounterCycles= 51409
nLoops= 1
VAL = 174.446666666667,   Clock Cycles= 51409


With the GLOBALs declared as DWORD...
1 CounterTSC1= 3695436770
1 CounterTSC2= 3695552118
1 CounterOverhead= 9370
-------
2 CounterTSC1= 3695729618
2 CounterTSC2= 3695790974
2 CounterOverhead= 9370
CounterCycles= 51986
nLoops= 1
VAL = 174.446666666667,   Clock Cycles= 51986

                         

     

#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter. 
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
           
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1 '
PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
        CounterCode2 '
PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
    NEXT '         
PRINT #hDbg, "-------"
       
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
'    !  nop
'    !  nop
'    !  nop
'    !  nop
'    !  nop
'    !  nop
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1 '
PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
         


END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2 '
PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "TMDebug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 1 '00000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
'        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


with nLoops set to 1000

1 CounterTSC1= 120560912
1 CounterTSC2= 120700904
1 CounterOverhead= 9415
1 CounterTSC1= 120845036
1 CounterTSC2= 120904000
1 CounterOverhead= 9415
1 CounterTSC1= 121077796
1 CounterTSC2= 121137880
1 CounterOverhead= 9415
1 CounterTSC1= 121255360
1 CounterTSC2= 121342196
1 CounterOverhead= 9415
1 CounterTSC1= 121459804
1 CounterTSC2= 121518368
1 CounterOverhead= 9415
1 CounterTSC1= 121663192
1 CounterTSC2= 121722916
1 CounterOverhead= 9415
'
'
'
1 CounterTSC2= 213222812
1 CounterOverhead= 9415
1 CounterTSC1= 213276708
1 CounterTSC2= 213303540
1 CounterOverhead= 9415
-------
2 CounterTSC1= 213388548
2 CounterTSC2= 213536432
2 CounterOverhead= 9415
2 CounterTSC1= 214828816
'
'
'
2 CounterOverhead= 9415
2 CounterTSC1= 350547788
2 CounterTSC2= 350615896
2 CounterOverhead= 9415
CounterCycles= 58241
nLoops= 1000
VAL = 0,   Clock Cycles= 58

MichaelW

The attachment contains the complete FreeBASIC source and the compiled EXE for a test app that you might find useful for testing. Running on my P3 the results are very consistent run to run:

d = d / 3 ->  10 cycles
d = d * 3 ->  6 cycles
d = d + 3 ->  4 cycles
d = d - 3 ->  4 cycles
d = sqr(d) -> 9 cycles
d = log(d) -> 58 cycles
d = sin(d) -> 87 cycles



[attachment deleted by admin]
eschew obfuscation

MikeT

On my P4 i get:
Quoted = d / 3 ->  18446744073709551612 cycles
d = d * 3 ->  18446744073709551612 cycles
d = d + 3 ->  18446744073709551612 cycles
d = d - 3 ->  18446744073709551612 cycles
d = sqr(d) -> 4 cycles
d = log(d) -> 204 cycles
d = sin(d) -> 128 cycles
Do you have an email address? I have an idea.

MichaelW

I think 18446744073709551612 is effectively -4, indicating that the reference count was larger than the test count. The problem with not being able to align the loop labels is that you can have a situation where the loop overhead in the test loop is not the same as the loop overhead in the reference loop. As I stated, I think I know of a way to minimize the alignment problems, but implementing it will require access to the compiler output, as assembly code.

Also, counting cycles on a P4 presents problems that AFAIK are unique to the P4. I have never gotten around to doing any definitive tests, but I think the problem has something to do with the Time Stamp Counter being updated in step with the external clock, which runs at 1/4 the speed of the internal clock. Using the method of these macros, taking the lowest count that occurs in any single pass through the test code the count is always a multiple of 4. So obviously, you cannot get useful counts for instructions/statements that execute in less than 4 cycles, or even several times that. For the P4 the method that measures the total cycle count for all passes and calculates an average, as implemented in the first set of MASM32 macros, is probably preferable. But since the P4 architecture is now history, and not likely to be revived, and the method of these macros works so well on the other processors, I did not bother to implement the other method for FB. These macros can be adapted to work reasonably on a P4 for instructions/statements that execute in less than, say 12-16 cycles, the same way that they can be adapted to work on the other processors for instructions/statements that execute in less than a few cycles. Basically, you repeat the test instructions/statements some number of times (by duplicating the instructions/statements in the source, not as a loop) and then divide the result by the number of repeats.
eschew obfuscation

MikeT

>I assume that the test included no debugging code.
Yes, just ran the .exe you provided

>add three instances of ! nop ahead of the first Sleep 0 statement in the COUNTER_BEGIN macro
done


#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter. 
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
         
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
    !  nop
    !  nop
    !  nop 
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1 ' PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
        CounterCode2 ' PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
    NEXT ' PRINT #hDbg, "-------"
       

    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1 ' PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
         

END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2 ' PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
    nLoops = 1000
                               
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤





[attachment deleted by admin]

MichaelW

For comparing the PB results to the FB results, I think a more reasonable starting point would be something like this, where the tested code will be the same for both compilers.

#COMPILE EXE
#DIM ALL

#INCLUDE "WIN32API.inc"

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS LONG   ' Use a 32-bit loop counter

SUB CounterCode1

    ! XOR eax, eax
    ! db &H0F,&HA2
    ! db &H0F,&H31
    ! mov CounterTSC1, eax
    ! mov CounterTSC1[4], edx

END SUB

SUB CounterCode2

    ! XOR eax, eax
    ! db &H0F,&HA2
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx

END SUB

MACRO COUNTER_BEGIN( nLoops, priority_class )

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000

    CALL SetPriorityClass( GetCurrentProcess, priority_class )

    SLEEP 0

    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO nLoops
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
    NEXT

    SLEEP 0

    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO nLoops
        CounterCode1

END MACRO

MACRO COUNTER_END

        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT

    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead

END MACRO

FUNCTION PBMAIN

    LOCAL s AS STRING

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! xor edx, edx
        ! mov eax, 1
        ! div ecx
    COUNTER_END
    s = STR$(CounterCycles) + $CRLF

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! rcr eax, 31
        ! rcl eax, 31
    COUNTER_END
    s = s + STR$(CounterCycles) + $CRLF

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! pushad
        ! popad
    COUNTER_END
    s = s + STR$(CounterCycles) + $CRLF

    MSGBOX s

    EXIT FUNCTION

END FUNCTION


The results for the FB version, again very consistent on my P3:

36
9
16


If you could compile this and post the EXE, I should be able to work out the alignment padding.
eschew obfuscation

MikeT

Compiled exactly as you wrote it:
Results:
4294930039
4294930047
4294930023



[attachment deleted by admin]

MichaelW

There is a problem in the code that I failed to see. The globals CounterTSC1, CounterTSC2, CounterOverhead, and CounterCycles should all be unsigned 64-bit integers. And I noticed in the compiled code that the compiler is using the FPU to perform what should have been 32-bit integer comparisons. This is workable, but it seems somewhat odd for a performance-oriented compiler, and makes me suspect that the declaration statement might have problems (according to the documentation the declaration is correct for version 8, but I'm not sure about earlier versions). So for testing purposes I think the declarations for the globals should be changed to:

GLOBAL CounterTSC1 AS QUAD
GLOBAL CounterTSC2 AS QUAD
GLOBAL CounterOverhead AS QUAD
GLOBAL CounterCycles AS QUAD

GLOBAL LoopCounter AS LONG   ' Use a 32-bit loop counter


And it would be helpful if in the attachment you would include the source with the EXE.
eschew obfuscation

MikeT

#27
The FPU should NOT be involved in a DWORD operation? Now PowerBASIC is famous for using the FPU for QUAD's:
http://www.powerbasic.com/support/pbforums/showthread.php?t=35768


Results:
16
32
12

Executable and code attached.
I have included the statement
#REGISTER NONE
to ensure that the compiler is not attempting to use the registers.



[attachment deleted by admin]

MichaelW

Running on my P3 the PB code returns 35, 12, 9, and the FB code 36, 9, 16, both with no run-to-run variation through multiple trials. As it happened the for loop labels were 80 bytes apart, so both of the loops were operating at the same relative (mod 16) alignment. One more problem that I failed to anticipate is that the PB compiler apparently does not align the entry point of procedures, so the entry points for the CounterCode1 and CounterCode2 procedures may not be (and probably will not be) at the optimal alignment. To get around this I eliminated the procedures and placed the essential code inline in the macros. I think I have the for loop labels aligned correctly, but I will need to examine the compiled EXE to be sure.

BTW, this code does not modify any of the four registers that must be preserved per this page.




[attachment deleted by admin]
eschew obfuscation

MikeT

fbtest.exe yeilds
12
16
16
8

on my P4 with no run to run variation.

TimingMacro09.exe yields:
20
16
16
20
mostly with the occasional
20
16
16
16


In ver10, I increased the loops to 1m and the variation got a little wider
20
16
12 - 16
12 - 20

Attached


[attachment deleted by admin]