The MASM Forum Archive 2004 to 2012

Project Support Forums => PowerBASIC Low Level Code => Topic started by: MikeT on January 04, 2008, 03:01:01 PM

Title: Timing Macros for PB
Post by: MikeT on January 04, 2008, 03:01:01 PM
Has anyone converted these timing macro's for PB by chance?
http://www.masm32.com/board/index.php?topic=770.0
Title: Re: Timing Macros for PB
Post by: hutch-- on January 04, 2008, 11:56:00 PM
Hi Mike,

Welcome on board. As far as I know no-one has ever ported Michael's macros to PB and there is the problem that the two languages are not the same in their macro capacity.

What I would be inclined to do is write a normal function or two and code the timing method then put it in a seperate file so it can be included when needed.
Title: Re: Timing Macros for PB
Post by: MikeT on January 05, 2008, 03:11:49 AM
Hi Hutch,

That would work. How hard is it to port that ASM to PB?
Are there many PB ASM members?
Title: Re: Timing Macros for PB
Post by: hutch-- on January 06, 2008, 12:06:46 AM
Mike,

the timing macros that Michael has written are very useful in comparing and optimising low level algorithms but I seriously doubt they would be much use to you in application development. If you need to benchmark different algos there are many ways to do this without the level of specialty involved in these timing macros. I still personally time most algos on a large sample with GetTickCount but there are other API functions with better resolution and of course RDTSC if you know how to use its 64 bit resolution.

Perehaps let us know what you are trying to do and we can probably help you in the direction you need.
Title: Re: Timing Macros for PB
Post by: MichaelW on January 06, 2008, 12:59:32 AM
Mike,

I'm not sure if this is doable in PB, but I did manage to create a system that consists of one macro one define and two procedures that works well for FreeBASIC. Here are the essential parts of the code:

#include once "windows.bi"

dim shared _counter_tsc1_ as ulongint, _counter_tsc2_ as ulongint
dim shared _counter_overhead_ as ulongint, counter_cycles as ulongint
dim shared _counter_loop_counter_ as uinteger

sub _counter_code1_
    asm
        '
        ' Use same CPUID input value for each call.
        '
        xor eax, eax
        '
        ' Flush pipe and wait for pending ops to finish.
        '
        cpuid
        '
        ' Read Time Stamp Counter.
        '
        rdtsc
        '
        ' Save count.
        '
        mov [_counter_tsc1_], eax
        mov [_counter_tsc1_+4], edx
    end asm
end sub

sub _counter_code2_
    asm
        xor eax, eax
        cpuid
        rdtsc
        mov [_counter_tsc2_], eax
        mov [_counter_tsc2_+4], edx
    end asm
end sub

'' Unlike the #define directive, the #macro directive
'' allows inline asm.
''
#macro COUNTER_BEGIN( loop_count, priority_class )
    _counter_overhead_ = 2000000000
    counter_cycles = 2000000000
    SetPriorityClass( GetCurrentProcess(), priority_class )
    Sleep_(0)                 '' Start a new time slice
    ''
    '' The nops compensate for the 10-byte instruction (that
    '' initializes _counter_loop_counter_) between the alignment
    '' directive and the loop label, which ideally needs to be
    '' aligned on a 16-byte boundary.
    ''
    asm
      .balign 16
      nop
      nop
      nop
      nop
      nop
      nop
    end asm
    for _counter_loop_counter_ = 1 to loop_count
        _counter_code1_
        _counter_code2_
        if (_counter_tsc2_ - _counter_tsc1_) < _counter_overhead_ then
            _counter_overhead_ = _counter_tsc2_ - _counter_tsc1_
        endif
    next
    Sleep_(0)                 '' Start a new time slice
    asm
      .balign 16
      nop
      nop
      nop
      nop
      nop
      nop
    end asm
    for _counter_loop_counter_ = 1 to loop_count
        _counter_code1_
#endmacro
''
'' *** Note the open FOR loop ***
''
#define COUNTER_END _
        _counter_code2_ :_
        if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
            counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
        endif :_
    next :_
    SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS ) :_
    counter_cycles -= _counter_overhead_

Title: Re: Timing Macros for PB
Post by: MikeT on January 08, 2008, 06:49:44 PM
Hutch,
I am sure you are right, but I would like to explore this anyway. I learn by exploring :)

                         

     

#COMPILE EXE
#DIM ALL


GLOBAL hDbg AS LONG             
               
' #INCLUDE once "windows.bi"  ?

GLOBAL  CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL  LoopCounter AS INTEGER
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

        '
        ' Use same CPUID input value for each call.
        '
     !  XOR eax, eax
        '
        ' Flush pipe and wait for pending ops to finish.
        '
     !  cpuid
        '
        ' Read Time Stamp Counter.
        '
     !  rdtsc  ' <----------------  PB Compiler does not recognize this ***
        '
        ' Save count.
        '
     !  mov [CounterTSC1], eax
     !  mov [CounterTSC1+4], edx
END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

  ! XOR eax, eax
  ! cpuid
  ! rdtsc       ' <----------------  PB Compiler does not recognize this ***
  ! mov [CounterTSC2], eax
  ! mov [CounterTSC2+4], edx

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( loop_count, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000
    SetPriorityClass( GetCurrentProcess(), priority_class )
    Sleep_(0) '' Start a new time slice

    '' The nops compensate for the 10-byte instruction (that
    '' initializes LoopCounter) between the alignment
    '' directive and the loop label, which ideally needs to be
    '' aligned on a 16-byte boundary.


    !  .balign 16
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO loop_count
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1

    NEXT   

    Sleep_(0)                 '' Start a new time slice

    !  .balign 16
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO loop_count  '' *** Note the open FOR loop ***
        GOSUB CounterCode1

END MACRO
           


       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END

        GOSUB CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT
    CALL SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       


END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 100000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN '

      FOR i = 1 TO nLoops
        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 

    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( (CounterCycles)\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


My ASM knowledge is limited, but this will nto compile.
Title: Re: Timing Macros for PB
Post by: MichaelW on January 08, 2008, 07:35:57 PM
Synthesizing unsupported instructions with db (e.g. ! db &H0F, &H31 for RDTSC) is nothing unusual.

The:

#INCLUDE once "windows.bi"

was necessary for FreeBASIC to pull in the header files and import libraries necessary to call the SetPriorityClass, GetCurrentProcess, and Sleep functions. I have no idea how you might do this with PB.

The Sleep_(0) calls the Windows API  Sleep (http://msdn2.microsoft.com/en-us/library/ms686298.aspx) function passing a dwMilliseconds value of 0. The function was renamed to avoid a conflict with the FreeBASIC SLEEP function.

The .balign is a GAS alignment directive. Again, I have no idea how you might do this with PB.
Title: Re: Timing Macros for PB
Post by: hutch-- on January 09, 2008, 01:11:06 AM
Mike,

here are two PB macros for the two instructions.


    MACRO rdtsc = ! db &H0F,&H31

    MACRO cpuid = ! db &H0F,&HA2


Note that you use them WITHOUT the leading "!" as they ae macros, not inline assembler.
Title: Re: Timing Macros for PB
Post by: MikeT on January 09, 2008, 08:37:40 AM
Thx hutch.

I get a compiler error:
Assembler Syntax Error
for
     !  mov [CounterTSC1], eax
I suspect the square brackets. Can I replace them with round brackets?

Also, is the PB sleep interchangable with sleep in this code?
ie starting a new time slice?



                         

     

#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               
' #INCLUDE once "windows.bi"  ?

GLOBAL  CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL  LoopCounter AS INTEGER
           
                     
MACRO rdtsc = ! db &H0F,&H31

MACRO cpuid = ! db &H0F,&HA2

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

        '
        ' Use same CPUID input value for each call.
        '
     !  XOR eax, eax
        '
        ' Flush pipe and wait for pending ops to finish.
        '
     !  cpuid
        '
        ' Read Time Stamp Counter.
        '
        rdtsc 
        '
        ' Save count.
        '
     !  mov [CounterTSC1], eax
     !  mov [CounterTSC1+4], edx
END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

  ! XOR eax, eax
    cpuid
    rdtsc       
  ! mov [CounterTSC2], eax
  ! mov [CounterTSC2+4], edx

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( loop_count, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )

    SLEEP 0 ' Start a new time slice
             

    '' The nops compensate for the 10-byte instruction (that
    '' initializes LoopCounter) between the alignment
    '' directive and the loop label, which ideally needs to be
    '' aligned on a 16-byte boundary.


    !  .balign 16
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO loop_count
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1

    NEXT   

    SLEEP 0 ' Start a new time slice

    !  .balign 16
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO loop_count  '' *** Note the open FOR loop ***
        GOSUB CounterCode1

END MACRO
           


       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_END

        GOSUB CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       


END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 100000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 

    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( (CounterCycles)\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


Title: Re: Timing Macros for PB
Post by: MichaelW on January 09, 2008, 11:41:24 AM
http://www.powerbasic.com/support/help/pbcc/sleep_statement.htm
Title: Re: Timing Macros for PB
Post by: MikeT on January 11, 2008, 10:16:20 PM
Still trying to figure out how to convert

! mov [CounterTSC2+4], edx

to PB. The Compiler baulks at the square brackets...
Title: Re: Timing Macros for PB
Post by: hutch-- on January 11, 2008, 10:22:26 PM
Mike,

PB is a bit deviant in the notation for complex addressing mode.


replace this
! mov [CounterTSC2+4], edx
with
! mov CounterTSC2[4], edx


See if this works.
Title: Re: Timing Macros for PB
Post by: MikeT on January 13, 2008, 11:41:34 PM
Thx Hutch,
Tha seemed to keep the compler happy, now it baulks at:
    !  .balign 16

Title: Re: Timing Macros for PB
Post by: hutch-- on January 14, 2008, 03:05:18 AM
Mike,

Delete that line, PowerBASIC does not support code alignment.
Title: Re: Timing Macros for PB
Post by: MichaelW on January 14, 2008, 05:52:24 AM
And if the compiler cannot align the code there is no need for the two groups of 6 nops, and you should expect the cycle counts to vary somewhat with the memory address of the counting (FOR) loop labels. You should be able to minimize this effect by doing all of your comparison tests in the same macro at the same address, and by adding a sufficient number of nops ahead of the second Sleep(0) to ensure that both of the loop labels (the reference loop that determines the counter overhead and the test loop) are at the same relative alignment (i.e. that they are some multiple of 16 bytes apart).
Title: Re: Timing Macros for PB
Post by: MikeT on January 14, 2008, 05:03:10 PM
This will compile and run, but CounterCycles = 0
I suspect it has to do with returuning the values:
CounterTSC1
CounterTSC2

I currently use Charles E V Pegge code.
he returns values with
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.

When I try
  ! mov [CounterTSC1],eax             ' save low order 4 bytes.
  ! mov [CounterTSC1+4],edx           ' save high order 4 bytes.
I get a compiler error again
so perhaps I can do:
    ! mov ebx,CounterTSC1     
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

(I have no idea what I am doing, just using pattern recognition!)

                         

     

#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS LONG
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
           
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1
    NEXT   
         
       
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    !  nop
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1
         


END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "Debug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 100000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: Timing Macros for PB
Post by: MichaelW on January 14, 2008, 05:53:17 PM
In my code the statement:

if (_counter_tsc2_ - _counter_tsc1_) < counter_cycles then :_
counter_cycles = _counter_tsc2_ - _counter_tsc1_ :_
endif :_

Is intended to capture the lowest cycle count that occurs in any loop. Here is the complete header for the FreeBASIC version:

'====================================================================
'' The COUNTER_BEGIN and COUNTER_END macros provide a convenient
'' method of measuring the processor clock cycle count for a block
'' of code. These macros must be called in pairs, and the block of
'' code must be placed in between the COUNTER_BEGIN and COUNTER_END
'' macro calls. The clock cycle count for a single loop through the
'' block of code, corrected for the test loop overhead, is returned
'' in the shared variable COUNTER_CYCLES.
''
'' These macros capture the lowest cycle count that occurs in a
'' single loop through the block of code, on the assumption that
'' the lowest count is the correct count. The higher counts that
'' occur are the result of one or more context switches within the
'' loop. Context switches can occur at the end of a time slice, so
'' to minimize the possibility of the loop overlapping the time
'' slice the COUNTER_BEGIN macro starts a new time slice at the
'' beginning of the loop. If the execution time for a single loop
'' is greater than the duration of a time slice (approximately 20ms
'' under Windows), then the loop will overlap the time slice, and
'' if another thread of equal priority is ready to run, then a
'' context switch will occur.
''
'' A higher-priority thread can "preempt" a lower priority thread,
'' causing a context switch to occur before the end the time slice.
'' Raising the priority of the process (thread) can reduce the
'' frequency of these context switches. To do so, specify a higher
'' than normal priority class in the priority_class parameter of
'' the COUNTER_BEGIN macro. REALTIME_PRIORITY_CLASS specifies the
'' highest possible priority, but using it involves some risk,
'' because your process will preempt *all* other processes,
'' including critical Windows processes, and if the timed code
'' takes too long to execute then Windows may hang, even the NT
'' versions. HIGH_PRIORITY_CLASS is a safer alternative that in
'' most cases will produce the same cycle count.
''
'' You can avoid the bulk of the context switches by simply
'' inserting a few second delay in front of the first macro
'' call, to allow the system disk cache activity to subside.
''
'' For the loop_count parameter of the COUNTER_BEGIN macro, larger
'' values increase the number of samples and so tend to improve
'' the consistency of the returned cycle counts, but in most cases
'' you are unlikely to see any improvement beyond a value of about
'' 1000.
''
'' Note that the block of code must be able to withstand repeated
'' looping, and that this method will not work for timing a loop
'' where the number of loops is variable, because the lowest cycle
'' count that occurs will be when the loop falls through.
''
'' These macros require a Pentium-class processor that supports
'' the CPUID (function 0 only) and RDTSC instructions.
'====================================================================

To get an average cycle count, as is done in the first set of MASM32 macros, you would need to accumulate the total number of cycles for all loops, and then divide it by the number of loops. The FreeBASIC version is similar to the second set of MASM32 macros, and the end result is simply the value of counter_cycles. I selected odd names for the globals to reduce the possibility of naming conflicts.
Title: Re: Timing Macros for PB
Post by: MikeT on January 14, 2008, 06:48:08 PM
Hello Michael,

Thank you for your continued help, but I must admit I am struggling here. I do not know ASM at all and I'm generally considered not overly bright on most forums, so if you could tell me what I should change I would be gratefull.

In reading the description, I suspect:
IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1

and

CounterCycles = CounterCycles - CounterOverhead

but I am not sure if I should just remove the subtraction of CounterOverhead or what to do?

I am also very unclear about the nops and how many to add. How do you figure that out?

Kind Regards
Mike
PS
I made all the Global names easily understandable while I grapple to understand how to make this work in PB.
I usually add a prefix to all my Globals when deploying code like this, something like TIMING_MACRO_xxxx
Title: Re: Timing Macros for PB
Post by: MichaelW on January 14, 2008, 07:19:19 PM
The only big problem that I can see is the CounterCycles\nLoops, where you are taking the lowest cycle count that occurs in any loop and dividing it by the number of loops, when (I assume) your goal is to display the lowest cycle count that occurs in any loop. But I doubt that this accounts for a result of zero, because I doubt that the test (FOR) loop could execute in anything close to nLoops cycles. I think you should just remove the nops and ignore any alignment considerations until you get the code working. If I were doing this I think I would set nLoops to 1 and reduce the test code to just d = d / 3 (no FOR loop and no multiply), which should take something on the order of 10 cycles to execute. Then in the macro I would add print statements to display the values of the globals as the code executes.
Title: Re: Timing Macros for PB
Post by: MikeT on January 15, 2008, 08:59:13 PM
ok Done.
With the GLOBALs declared a LONG I get negative numbers(?)
1 CounterTSC1=-257241836
1 CounterTSC2=-257087332
1 CounterOverhead= 9343
-------
2 CounterTSC1=-256895800
2 CounterTSC2=-256835048
2 CounterOverhead= 9343
CounterCycles= 51409
nLoops= 1
VAL = 174.446666666667,   Clock Cycles= 51409


With the GLOBALs declared as DWORD...
1 CounterTSC1= 3695436770
1 CounterTSC2= 3695552118
1 CounterOverhead= 9370
-------
2 CounterTSC1= 3695729618
2 CounterTSC2= 3695790974
2 CounterOverhead= 9370
CounterCycles= 51986
nLoops= 1
VAL = 174.446666666667,   Clock Cycles= 51986

                         

     

#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter. 
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
           
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1 '
PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
        CounterCode2 '
PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
    NEXT '         
PRINT #hDbg, "-------"
       
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
'    !  nop
'    !  nop
'    !  nop
'    !  nop
'    !  nop
'    !  nop
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1 '
PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
         


END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2 '
PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 '
PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB time_stamp_count(tick AS QUAD) ' CPU Clock count    Charles E V Pegge

  '---------------------------'
  '                           ' approx because it is not a serialised instruction
  '                           ' it may execute before or after other instructions
  '                           ' in the pipeline.
  ! mov ebx,tick              ' var address where count is to be stored.
  ! db  &h0f,&h31             ' RDTSC read time-stamp counter into edx:eax hi lo.
  ! mov [ebx],eax             ' save low order 4 bytes.
  ! mov [ebx+4],edx           ' save high order 4 bytes.
  '---------------------------'

END SUB
           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
  hDbg = FREEFILE '
  OPEN "TMDebug.txt" FOR OUTPUT LOCK WRITE AS hDbg ' PRINT #hDbg, "MetersToFt="+STR$(MetersToFt)
             

    nLoops = 1 '00000
                           
           
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
'        d = d * 3 
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
PRINT #hDbg, s
                   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


  CLOSE hDbg

END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


with nLoops set to 1000

1 CounterTSC1= 120560912
1 CounterTSC2= 120700904
1 CounterOverhead= 9415
1 CounterTSC1= 120845036
1 CounterTSC2= 120904000
1 CounterOverhead= 9415
1 CounterTSC1= 121077796
1 CounterTSC2= 121137880
1 CounterOverhead= 9415
1 CounterTSC1= 121255360
1 CounterTSC2= 121342196
1 CounterOverhead= 9415
1 CounterTSC1= 121459804
1 CounterTSC2= 121518368
1 CounterOverhead= 9415
1 CounterTSC1= 121663192
1 CounterTSC2= 121722916
1 CounterOverhead= 9415
'
'
'
1 CounterTSC2= 213222812
1 CounterOverhead= 9415
1 CounterTSC1= 213276708
1 CounterTSC2= 213303540
1 CounterOverhead= 9415
-------
2 CounterTSC1= 213388548
2 CounterTSC2= 213536432
2 CounterOverhead= 9415
2 CounterTSC1= 214828816
'
'
'
2 CounterOverhead= 9415
2 CounterTSC1= 350547788
2 CounterTSC2= 350615896
2 CounterOverhead= 9415
CounterCycles= 58241
nLoops= 1000
VAL = 0,   Clock Cycles= 58
Title: Re: Timing Macros for PB
Post by: MichaelW on January 16, 2008, 12:15:51 AM
The attachment contains the complete FreeBASIC source and the compiled EXE for a test app that you might find useful for testing. Running on my P3 the results are very consistent run to run:

d = d / 3 ->  10 cycles
d = d * 3 ->  6 cycles
d = d + 3 ->  4 cycles
d = d - 3 ->  4 cycles
d = sqr(d) -> 9 cycles
d = log(d) -> 58 cycles
d = sin(d) -> 87 cycles



[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MikeT on January 17, 2008, 07:23:02 PM
On my P4 i get:
Quoted = d / 3 ->  18446744073709551612 cycles
d = d * 3 ->  18446744073709551612 cycles
d = d + 3 ->  18446744073709551612 cycles
d = d - 3 ->  18446744073709551612 cycles
d = sqr(d) -> 4 cycles
d = log(d) -> 204 cycles
d = sin(d) -> 128 cycles
Do you have an email address? I have an idea.
Title: Re: Timing Macros for PB
Post by: MichaelW on January 17, 2008, 09:18:42 PM
I think 18446744073709551612 is effectively -4, indicating that the reference count was larger than the test count. The problem with not being able to align the loop labels is that you can have a situation where the loop overhead in the test loop is not the same as the loop overhead in the reference loop. As I stated, I think I know of a way to minimize the alignment problems, but implementing it will require access to the compiler output, as assembly code.

Also, counting cycles on a P4 presents problems that AFAIK are unique to the P4. I have never gotten around to doing any definitive tests, but I think the problem has something to do with the Time Stamp Counter being updated in step with the external clock, which runs at 1/4 the speed of the internal clock. Using the method of these macros, taking the lowest count that occurs in any single pass through the test code the count is always a multiple of 4. So obviously, you cannot get useful counts for instructions/statements that execute in less than 4 cycles, or even several times that. For the P4 the method that measures the total cycle count for all passes and calculates an average, as implemented in the first set of MASM32 macros, is probably preferable. But since the P4 architecture is now history, and not likely to be revived, and the method of these macros works so well on the other processors, I did not bother to implement the other method for FB. These macros can be adapted to work reasonably on a P4 for instructions/statements that execute in less than, say 12-16 cycles, the same way that they can be adapted to work on the other processors for instructions/statements that execute in less than a few cycles. Basically, you repeat the test instructions/statements some number of times (by duplicating the instructions/statements in the source, not as a loop) and then divide the result by the number of repeats.
Title: Re: Timing Macros for PB
Post by: MikeT on January 18, 2008, 01:55:16 AM
>I assume that the test included no debugging code.
Yes, just ran the .exe you provided

>add three instances of ! nop ahead of the first Sleep 0 statement in the COUNTER_BEGIN macro
done


#COMPILE EXE
#DIM ALL
       
#INCLUDE "WIN32API.inc"         ' Basic Win API definitions   

GLOBAL hDbg AS LONG             
               

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS INTEGER
   
       
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode1

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter. 
    ! mov CounterTSC1, eax     ' Save count.
    ! mov CounterTSC1[4], edx  ' Save count.

END SUB
     
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
SUB CounterCode2

    ! XOR eax, eax             ' Use same CPUID input value for each call.
    ! db &H0F,&HA2             ' cpuid - Flush pipe and wait for pending ops to finish.
    ! db &H0F,&H31             ' rdtsc - Read Time Stamp Counter.   
    ! mov CounterTSC2, eax     ' Save count.
    ! mov CounterTSC2[4], edx  ' Save count.

END SUB


'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'
MACRO COUNTER_BEGIN( nLoops, priority_class ) 

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000 
                         
    CALL SetPriorityClass( GetCurrentProcess, priority_class )
         
    ' The nops compensate for the 10-byte instruction (that initializes LoopCounter)
    ' between the alignment directive and the loop label, which ideally needs to be
    ' aligned on a 16-byte boundary.
    !  nop
    !  nop
    !  nop 
    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops
        CounterCode1 ' PRINT #hDbg, "1 CounterTSC1="+STR$(CounterTSC1)
        CounterCode2 ' PRINT #hDbg, "1 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN  CounterOverhead = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "1 CounterOverhead="+STR$(CounterOverhead)
    NEXT ' PRINT #hDbg, "-------"
       

    SLEEP 0 ' Start a new time slice

    FOR LoopCounter = 1 TO nLoops  '' *** Note the open FOR loop ***
        CounterCode1 ' PRINT #hDbg, "2 CounterTSC1="+STR$(CounterTSC1)
         

END MACRO
'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤'         
MACRO COUNTER_END
                 


        CounterCode2 ' PRINT #hDbg, "2 CounterTSC2="+STR$(CounterTSC2)
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1 ' PRINT #hDbg, "2 CounterOverhead="+STR$(CounterOverhead)
    NEXT
    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead       

END MACRO
             

           

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
FUNCTION PBMAIN
               
  LOCAL i, Rounding, nLoops, RetVal AS LONG 
  LOCAL cBeg, cEnd AS QUAD  ' for time stamp, measuring cpu clock cycles
  LOCAL d AS DOUBLE
  LOCAL s, sTemp AS STRING

               
    nLoops = 1000
                               
    d = 523.34#
   
    COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS) ' REALTIME_PRIORITY_CLASS

      FOR i = 1 TO nLoops
        d = d / 3
      NEXT               

    COUNTER_END ' 
               
    s = s + "CounterCycles=" + STR$(CounterCycles) + $CRLF 
    s = s + "nLoops=" + STR$(nLoops) + $CRLF
    s = s + "VAL =" + STR$(d) + ",   Clock Cycles="+STR$( CounterCycles\nLoops ) + $CRLF + $CRLF
                 
   
MSGBOX s,64,"All Done"  : EXIT FUNCTION


END FUNCTION

'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤





[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 18, 2008, 06:34:00 AM
For comparing the PB results to the FB results, I think a more reasonable starting point would be something like this, where the tested code will be the same for both compilers.

#COMPILE EXE
#DIM ALL

#INCLUDE "WIN32API.inc"

GLOBAL CounterTSC1, CounterTSC2, CounterOverhead , CounterCycles AS DWORD
GLOBAL LoopCounter AS LONG   ' Use a 32-bit loop counter

SUB CounterCode1

    ! XOR eax, eax
    ! db &H0F,&HA2
    ! db &H0F,&H31
    ! mov CounterTSC1, eax
    ! mov CounterTSC1[4], edx

END SUB

SUB CounterCode2

    ! XOR eax, eax
    ! db &H0F,&HA2
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx

END SUB

MACRO COUNTER_BEGIN( nLoops, priority_class )

    CounterOverhead = 2000000000
    CounterCycles   = 2000000000

    CALL SetPriorityClass( GetCurrentProcess, priority_class )

    SLEEP 0

    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO nLoops
        CounterCode1
        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN CounterOverhead = CounterTSC2 - CounterTSC1
    NEXT

    SLEEP 0

    !  nop
    !  nop
    !  nop

    FOR LoopCounter = 1 TO nLoops
        CounterCode1

END MACRO

MACRO COUNTER_END

        CounterCode2
        IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN CounterCycles = CounterTSC2 - CounterTSC1
    NEXT

    CALL SetPriorityClass( GetCurrentProcess, %NORMAL_PRIORITY_CLASS )
    CounterCycles = CounterCycles - CounterOverhead

END MACRO

FUNCTION PBMAIN

    LOCAL s AS STRING

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! xor edx, edx
        ! mov eax, 1
        ! div ecx
    COUNTER_END
    s = STR$(CounterCycles) + $CRLF

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! rcr eax, 31
        ! rcl eax, 31
    COUNTER_END
    s = s + STR$(CounterCycles) + $CRLF

    COUNTER_BEGIN(1000, %HIGH_PRIORITY_CLASS)
        ! pushad
        ! popad
    COUNTER_END
    s = s + STR$(CounterCycles) + $CRLF

    MSGBOX s

    EXIT FUNCTION

END FUNCTION


The results for the FB version, again very consistent on my P3:

36
9
16


If you could compile this and post the EXE, I should be able to work out the alignment padding.
Title: Re: Timing Macros for PB
Post by: MikeT on January 18, 2008, 07:33:12 AM
Compiled exactly as you wrote it:
Results:
4294930039
4294930047
4294930023



[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 18, 2008, 07:11:56 PM
There is a problem in the code that I failed to see. The globals CounterTSC1, CounterTSC2, CounterOverhead, and CounterCycles should all be unsigned 64-bit integers. And I noticed in the compiled code that the compiler is using the FPU to perform what should have been 32-bit integer comparisons. This is workable, but it seems somewhat odd for a performance-oriented compiler, and makes me suspect that the declaration statement might have problems (according to the documentation the declaration is correct for version 8, but I'm not sure about earlier versions). So for testing purposes I think the declarations for the globals should be changed to:

GLOBAL CounterTSC1 AS QUAD
GLOBAL CounterTSC2 AS QUAD
GLOBAL CounterOverhead AS QUAD
GLOBAL CounterCycles AS QUAD

GLOBAL LoopCounter AS LONG   ' Use a 32-bit loop counter


And it would be helpful if in the attachment you would include the source with the EXE.
Title: Re: Timing Macros for PB
Post by: MikeT on January 18, 2008, 09:14:18 PM
The FPU should NOT be involved in a DWORD operation? Now PowerBASIC is famous for using the FPU for QUAD's:
http://www.powerbasic.com/support/pbforums/showthread.php?t=35768


Results:
16
32
12

Executable and code attached.
I have included the statement
#REGISTER NONE
to ensure that the compiler is not attempting to use the registers.



[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 19, 2008, 05:48:34 AM
Running on my P3 the PB code returns 35, 12, 9, and the FB code 36, 9, 16, both with no run-to-run variation through multiple trials. As it happened the for loop labels were 80 bytes apart, so both of the loops were operating at the same relative (mod 16) alignment. One more problem that I failed to anticipate is that the PB compiler apparently does not align the entry point of procedures, so the entry points for the CounterCode1 and CounterCode2 procedures may not be (and probably will not be) at the optimal alignment. To get around this I eliminated the procedures and placed the essential code inline in the macros. I think I have the for loop labels aligned correctly, but I will need to examine the compiled EXE to be sure.

BTW, this code does not modify any of the four registers that must be preserved per  this (http://www.powerbasic.com/support/help/pbwin/html/ASM_statement.htm) page.




[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MikeT on January 19, 2008, 08:36:11 PM
fbtest.exe yeilds
12
16
16
8

on my P4 with no run to run variation.

TimingMacro09.exe yields:
20
16
16
20
mostly with the occasional
20
16
16
16


In ver10, I increased the loops to 1m and the variation got a little wider
20
16
12 - 16
12 - 20

Attached


[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 20, 2008, 05:29:32 AM
On my P3 the FB version consistently returns 36, 9, 16, 15, and the PB version consistently returns 38, 11, 17, 35.

For reference, the same tests with second set of MASM32 macros (counter2.asm) consistently return 38, 12, 19, 16, and the same tests with first set of MASM32 macros (timers.asm) typically return 34, 8, 11, 5.

The large difference for the PB version in the last test can be accounted for by differences in the compiler-generated code for the GetTickCount function call. Where the FB and MASM32 code calls GetTickCount through a jump table, the PB code calls GetTickCount directly, and then calls some other function after GetTickCount returns (as it seems to do for all of the API function calls), and the cycle count includes both function calls. I was attempting to test only compiler-independent code, but I made a bad choice for the last test.

The small differences in the first three tests can be accounted for by the differences in the structure of the macros. The method used by these macros has a weakness in that the timed instructions cannot be completely isolated from the timing instructions, so the timing instructions can affect the cycle count for the adjacent timed instructions. The effect can make it impossible to get meaningful cycle counts for single instructions or short sequences of instructions, but as the length of the sequence increases the effect on the cycle count quickly becomes negligible.

I can now see only two remaining problems. Somehow I made a mistake in calculating the number of alignment nops in the PB code, and instead of 9 there should be only 4. And to minimize the variations you should add a delay of several seconds before the first test (as I did with the sleep 3000 in the FB code), to allow time for the various system activities involved in launching an application to complete.

Title: Re: Timing Macros for PB
Post by: MichaelW on January 20, 2008, 11:11:06 AM
It now appears that my statement regarding this code not changing registers that must be preserved is wrong. Per  this (http://www.powerbasic.com/support/help/pbwin/html/Intermixing_ASM_and_BASIC_code.htm) page:

"compiled PowerBASIC code is designed to expect that the EBX, ESI, and EDI registers will remain unchanged between lines of BASIC code."

Even though there are no apparent problems, the CPUID instruction is changing the value of EBX, so to be safe EBX needs to be preserved around the CPUID instruction. The attachment contains the updated code, with the last test replaced with a longer block of assembly code derived from the MASM32 nrandom procedure.

Edit: I forgot to add a several second delay before the first test.

[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MikeT on January 21, 2008, 08:59:37 PM
on my P4 ver 12 (sleep 3000 added) results:
16
16
12
88

fbtest results:
12
16
16
92

Results for both are consistent

QuotePB code calls GetTickCount directly, and then calls some other function after GetTickCount returns
Let me ask our PB guru about the secondary function call...

Added:
I dont see the 4 nops... should I add those?



[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 22, 2008, 12:46:38 AM
Running on my P3, the FB version returns 36, 9, 16, 91 and the PB version 38, 12, 17, 91, both with no variations observed. At this point I doubt that the core code can be further improved. I got the alignment right this time. After I added the instructions to preserve EBX around the CPUID calls, the distance between the for loop labels is 112 bytes (16*7) without any padding. The SLEEP 3000 needs to run one time after the program is launched, so it should be placed before the first COUNTER_BEGIN MACRO call, as it is in the FB code. With the SLEEP 3000 in the COUNTER_BEGIN macro it runs each time the macro is called, and while this should not cause problems, only the first run does anything useful.

If you wish to copy or copy and modify any of my comments or explanations from the MASM32 or FB macros, feel free.
Title: Re: Timing Macros for PB
Post by: MikeT on January 22, 2008, 07:19:08 AM
Thx Michael
I changed nLoops and the position of the Sleep command
Results
16
16
12
88

Last two numbers are consistant.
first two are 20 one in every 10 executions.

In general is 1000 loops enough to measure?

What would be the lowest number of loops to get consitent results do you think?


[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 22, 2008, 07:42:58 PM
On my P3 I get 38, 11, 17, 93 with no variations observed. Within my experience the variations you are seeing on the short sequences of instructions are normal for a P4. These sequences are not representative of real-world code, they are just something that I threw together as a quick test. Most code that would need to be optimized would involve longer sequences. You can also expect to see variations on code were a single pass through the code takes more than, very roughly, a few hundred clock cycles, but at least on my system the variations seldom reach even one percent.

Within my experience 1000 loops is more than enough. On my P3 the FB version running the same test code produces good repeatability down to 2 loops. I use 1000 loops normally, but I commonly reduce this for code where 1000 loops would take too long to execute. For code that cannot be looped I reduce the number of loops to 1, and then average the results over multiple trials.
Title: Re: Timing Macros for PB
Post by: MikeT on January 22, 2008, 10:39:06 PM
Michael,
Thank you for developing this with me. I has been educational.
Today I wanted to put it to a real world test, in this case timing a function that converts a string representation of a number to a LONG.
http://www.powerbasic.com/support/pbforums/showthread.php?t=14373&highlight=VAL

In the end I created the attached test bed to evaluate the functions.
The timing variations I got with CPU Clock count method & the Counter Cycle MACROs were:
for 1000 loops:
420-438
279-299
77-231
&
344112-370288
226056-252064
74724-76052

For 100 Loops:
460-465
282-311
82-111
&
40816-43616
25244-27836
7724-10908

I was hoping for less variation, but I am probably just not understanding the nature of the beast.
Also, Is there any direct relationship between Clocks and CounterCycles?




[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 23, 2008, 03:25:32 AM
I see now that I should have explained the macros in more detail. I was assuming that you understood what I meant by "test loop", and when I saw evidence that you did not, I failed to recognize it. The test loop is built into the macros, with the first part of it in the COUNTER_BEGIN macro, and the last part of it in the COUNTER_END macro. When the macros are expanded in memory, they look something like this:

Initialize CounterOverhead and CounterCycles
Set priority class.

FOR LoopCounter = 1 TO nLoops '' Start of empty reference loop.
    ! XOR eax, eax            '' CPUID function 0
    ! push ebx                '' preserve EBX
    ! db &H0F,&HA2            '' CPUID
    ! pop ebx                 '' recover EBX
    ! db &H0F,&H31            '' RDTSC
    ! mov CounterTSC1, eax    '' store low-order dword
    ! mov CounterTSC1[4], edx '' store high-order dword
   
    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx
   
    '' Capture the lowest cycle count that occurs.
   
    IF (CounterTSC2 - CounterTSC1) < CounterOverhead THEN
        CounterOverhead = CounterTSC2 - CounterTSC1
    END IF   
NEXT                          '' End of empty reference loop.

SLEEP 0                       '' Start a new timeslice.

FOR LoopCounter = 1 TO nLoops '' Start of test loop.

    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC1, eax
    ! mov CounterTSC1[4], edx

    '' Code placed between the end of the COUNTER_BEGIN
    '' macro and the start of the COUNTER_END macro
    '' code goes here.

    ! XOR eax, eax
    ! push ebx
    ! db &H0F,&HA2
    ! pop ebx
    ! db &H0F,&H31
    ! mov CounterTSC2, eax
    ! mov CounterTSC2[4], edx

    '' Capture the lowest cycle count that occurs.
   
    IF (CounterTSC2 - CounterTSC1) < CounterCycles THEN
        CounterCycles = CounterTSC2 - CounterTSC1
    END IF   
NEXT                          '' End of test loop.

Restore priority class to normal.
Calculate counter cycles, correcting it for the loop overhead.


So to pick one example, that illustrates how the macros are intended to be used, and assuming that nLoops = 1000, this code:

COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
  RetVal = VAL(sNum)
COUNTER_END

Will execute the RetVal = VAL(sNum) statement 1000 times, capturing the lowest count that occurs in any single pass, and correcting it for the loop overhead, so the count will include only the cycles for the RetVal = VAL(sNum) statement, and not the cycles for the test loop overhead.

This code:

COUNTER_BEGIN(nLoops, %HIGH_PRIORITY_CLASS)
    FOR k = 1 TO nLoops    ' inner loop
        RetVal = VAL(sNum)
    NEXT
COUNTER_END

Will execute the outer test loop (the test loop built into the macros) 1000 times, and the inner loop 1000 times. So the outer test loop will capture the lowest count that occurs in any single pass through the inner test loop, and this count will include 1000 executions of the RetVal = VAL(sNum) statement, as well as the loop overhead for the inner loop.

The attachment tests a FB version of Paul's code, and on my P3 it returns 88, 88, 88 with no variations observed.




[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MikeT on January 23, 2008, 08:32:47 PM
aaaah... NOW i get the concept. I couldn't see the wood for the trees... I got so focused on other things i missed this obvious piece of the puzzle.

Your fbtest on my machine returns:
132-144
132-144
128-140

Could you test the atached code to see if there are any variations.
On my machine 1000 loops:
420-438
279-299
77-231
&
416-424
300-332
68 - no variation

100 loops:
439-465
282-312
82-111
&
400-424
288-308
68 - no variation

Is it possible that my P4 is responsible for the variations?
Would it help to escalate the priority?
(I think I remember you mentioned the risk is high if the code is not stable)

What puzzles me most, is that there is a 25 Clk overhead for the orginal method yet it reports consistently lower clks for the second algo than the Macros?

The purpose of investigating these Macros is to find a more stable way of timing code at the fine level for development. The absolute number is not as relevant to me as the stability of the number. A 30 Clk variation at 300Clks is a 10% error. Thats not so great.



[attachment deleted by admin]
Title: Re: Timing Macros for PB
Post by: MichaelW on January 24, 2008, 12:19:39 AM
Running on my P3 the macro return values, for 100 or 1000 loops, are very consistent, with only an occasional 2-cycle variation in the second test. For the time_stamp_count tests, the return values are usually within a few cycles of the return values for the macros, but occasionally they are way off. The way the tests are being performed the overhead for the time_stamp_count procedure call is negligible, accounting for less than one cycle in the final results, even when nLoops = 100.

The P4 is at least part of the problem, but there could be other things going on in your system that could affect the timing. Assuming you are running Windows 2000 or XP, you might take a look at the CPU usage and time on the Processes tab in Task Manager, looking for processes where the CPU usage is variable and frequently high. You might also want to examine the base priority of the processes, but you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu. The greatest effect would be from a process running at an elevated priority, where the CPU usage is variable and frequently high. Testing with REALTIME_PRIORITY_CLASS might reduce or eliminate the effects of these other things. If the code is stable I see no reason to avoid REALTIME_PRIORITY_CLASS, but before using it I would make sure that I had no applications open with unsaved data.

Another possibility might be to warm up the CPUID instruction, as I have seen done in an Intel example source. This seems to have no effect on a P3 or on the AMD processors that I have tested, but it might have an effect on a P4. The code would be added in the COUNTER_BEGIN macro before the start of the reference loop, so adding it would not disturb the relative alignment of the FOR loop labels.

...
    CounterOverhead = 2000000000
    CounterCycles   = 2000000000

    CALL SetPriorityClass( GetCurrentProcess, priority_class )

    '' Warm up the CPUID instruction

    ! push ebx
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! XOR eax, eax
    ! db &H0F,&HA2
    ! pop ebx

    FOR LoopCounter = 1 TO nLoops
...


A more stable way of timing code at the fine level, where the consistency is more important than the actual count, was my goal in developing the second set of MASM32 macros and the FB macros that use the same basic method.

Also, with the FB macros running a P3, and assuming that I remember to add the several-second delay before the first test, if I run an empty test loop:

counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles


I get a consistent return value of zero. What do your macros do on a P4?
Title: Re: Timing Macros for PB
Post by: MikeT on January 26, 2008, 07:22:28 PM
Quotebut you will probably first need to add it to the displayed columns in the Select Columns dialog accessible from the View menu.
Ah OK. Well all I see in High proiority is
taskmgr.exe
winlogon.exe
csrss.exe
Everything else is normal priority
I keep this machine clean as a whistle.
Nothing is taking up processor time as i watch for several minutes.
When I run your test, it reports about 91% CPU usage for a moment

counter_begin( NLOOPS, HIGH_PRIORITY_CLASS )
counter_end
print counter_cycles

consitently returns zero yes

Adding the warm up code to your testing makes it a little more stable. only the third test fluctuates occasionally.
Adding the warm up code to the VAL testing makes little difference at 100 or 1000 loops on this P4 with XP Pro SP2.
But I am relieved to learn that on your P3 it is consitent.


Thank you for your help with this Michael. I learned a lot.

Title: Re: Timing Macros for PB
Post by: ecube on December 10, 2008, 09:18:13 PM
wow MichaelW you really put lot of time into this, you should tell the PB developers about it, sure they can give you recognition and use it internally to speed test things they do.