News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Timing macros rebirth

Started by Petroizki, January 22, 2006, 07:40:46 AM

Previous topic - Next topic

MichaelW

#15
[EDIT] Got this wrong the first time, these are now for NORMAL_PRIORITY_CLASS (big difference)[/EDIT]


Petroizki's macros clock counts:
lstrcmp:  3179, 3216, 3229, 3286, 3222, 3255, 3262, 3257, 3241, 3236,
szCmp:    140, 152, 134, 143, 151, 146, 146, 137, 139, 143,
multiply: 9, 19, 4, 13, 4, 8, 3, 9, 14, 8,
division: 31, 27, 35, 36, 33, 38, 34, 29, 28, 45,

MichaelW's macros clock counts:
lstrcmp:  1199, 1204, 1206, 1208, 1207, 1200, 1215, 1204, 1205, 1202,
szCmp:    50, 55, 56, 54, 51, 48, 50, 52, 50, 51,
multiply: 0, 1, 1, 0, 0, 0, 0, 1, 4294967295, 1,
division: 35, 35, 34, 36, 37, 35, 35, 34, 35, 36,

eschew obfuscation

Petroizki

New Upload.

In this version i disabled Sleep calls, and made some other changes, that seemed to steady the setup time timing in debugger. But it's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing.  :bg

I compiled the compare.exe with NORMAL_PRIORITY_CLASS.

MichaelW

Much better.

Petroizki's macros clock counts:

lstrcmp:  1125, 1149, 1149, 1126, 1126, 1149, 1126, 1149, 1149, 1126,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

MichaelW's macros clock counts:
lstrcmp:  1133, 1148, 1141, 1147, 1132, 1134, 1136, 1132, 1143, 1138,
szCmp:    47, 49, 48, 49, 49, 47, 52, 47, 49, 48,
multiply: 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
division: 34, 34, 34, 34, 34, 34, 34, 34, 35, 41,

eschew obfuscation

Petroizki

Thanks Michael, seems like it's giving pretty much perfect values.

New Upload.

I have added a new timer based on the macros, it's entirely for timing procedures only. The reason is to get rid of the inline timing problems, and to bypass some aligning problems. The procedure timer creates it's own 8kb page, and it does increase your exe size, but who cares, no one is going to use this stuff in their programs release versions anyway. If you are timing procedures, I strongly advice you to use this timer, instead of the inline macros. Only use inline macros when you are timing some important parts of the code.

The procedure timer is as easy to use as possible, just like using invoke:
PTIMER_PROC_COUNT lstrlen, ADDR str1 ; <- calls procedure timer, single-pass
; <- result is in '[qwPROCTIMER_RESULT]'


For looping the procedure several times, use the looping call:
PTIMER_PROC_LOOP_COUNT lstrlen, ADDR str1 ; <- call procedure timing, multiple times
; <- like always, result is in '[qwPROCTIMER_RESULT]'


And again, the looping version must always behave exactly the same way on every call, to get stable results. The procedure can expect the registers, flags, and FPU to be like they were set before the timer call. Parameters are automatically repushed on every loop, so you can mess with them, however the timer does push some of it's own values to the stack, so don't expect esp-register to be the same. The timed procedure must be using calling convention 'stdcall' (procedure pops parameters from stack).

The resulting value ignores clock count of ret instruction and parameter pushing..

QuotePetroizki's inline macros clock counts:
lstrcmp:  1017, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010, 1010,
szCmp:    31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
division: 38, 38, 38, 38, 38, 38, 38, 38, 38, 38,

Petroizki's procmacros clock counts:
lstrcmp:  1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012,
szCmp:    28, 28, 28, 28, 28, 28, 28, 28, 28, 28,

zooba

Quote from: Petroizki on January 23, 2006, 09:22:22 AM
It's very difficult to know what would be better for slower CPU. I think i have to buy some old 300MHz for testing.  :bg

Have you tried getting a program which eats up CPU time? CPU Killer is one that comes to mind... :U

hutch--

Easy,

Write an app in C++ with MFC, OOP(S) and the works.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Petroizki

:toothy

Oh, and new upload too. Some registers were messed up, should work perfectly now. :red

How does the comparison.exe perform on other machines?

zooba

Quote from: hutch-- on January 24, 2006, 02:31:43 AM
Write an app in C++ with MFC, OOP(S) and the works.  :bg

And make it loop from 1 to 10. That should be long enough to do all these tests :U

Processor:         Intel(R) Pentium(R) M processor 1400MHz

Petroizki's inline macros clock counts:
lstrcmp:  806, 821, 806, 806, 806, 806, 806, 806, 806, 806,
szCmp:    31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
multiply: 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
division: 36, 36, 36, 36, 36, 36, 36, 36, 36, 36,

Petroizki's procmacros clock counts:
lstrcmp:  796, 797, 797, 797, 797, 797, 797, 797, 797, 797,
szCmp:    25, 25, 25, 25, 25, 25, 25, 25, 25, 25,

MichaelW's macros clock counts:
lstrcmp:  811, 806, 801, 810, 817, 804, 807, 806, 821, 816,
szCmp:    22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
multiply: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
division: 35, 35, 35, 35, 35, 34, 35, 35, 36, 34,


It was actually running at 1400MHz too. Damn speedstep takes it down to 700MHz most of the time :wink

MichaelW

This is on a P3-500, but for some reason the processor does not display.

Petroizki's inline macros clock counts:
lstrcmp:  1123, 1123, 1147, 1123, 1148, 1137, 1123, 1147, 1165, 1123,
szCmp:    43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
multiply: 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
division: 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,

Petroizki's procmacros clock counts:
lstrcmp:  1123, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128, 1128,
szCmp:    37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

MichaelW's macros clock counts:
lstrcmp:  1205, 1174, 1172, 1192, 1173, 1171, 1173, 1175, 1201, 1171,
szCmp:    49, 48, 54, 50, 49, 50, 51, 56, 53, 53,
multiply: 0, 1, 0, 0, 0, 0, 0, 4294967295, 1, 0,
division: 35, 36, 35, 35, 36, 35, 36, 35, 35, 36,


And this is on a K5-PR166:

Processor: AMD-K5(tm) Processor

Petroizki's inline macros clock counts:
lstrcmp:  162, 162, 162, 162, 162, 162, 162, 162, 162, 162,
szCmp:    40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
multiply: 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
division: 37, 37, 37, 37, 37, 37, 37, 37, 37, 37,

Petroizki's procmacros clock counts:
lstrcmp:  186, 186, 187, 187, 187, 187, 187, 187, 187, 187,
szCmp:    32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

MichaelW's macros clock counts:
lstrcmp:  190, 191, 192, 192, 192, 190, 191, 191, 191, 190,
szCmp:    36, 36, 36, 36, 36, 36, 36, 36, 36, 36,
multiply: 4294967295, 4294967295, 4294967294, 4294967295, 4294967295, 4294967295
, 4294967295, 4294967295, 4294967295, 4294967295,
division: 35, 34, 34, 34, 34, 34, 34, 34, 34, 34,


And just in case anyone noticed, the K5 was a very advanced processor for its time  :8)



eschew obfuscation

Petroizki

Quote from: MichaelW on January 24, 2006, 09:08:52 AMThis is on a P3-500, but for some reason the processor does not display.
The processor probably does not support extended cpuid level 80000004h? Which is weird.. Maybe the CPU string is only in 80000002h-80000003h. I made changes to the comparison.exe, so it should print the CPU string, if it's there.

Quote from: MichaelW on January 24, 2006, 09:08:52 AMAnd just in case anyone noticed, the K5 was a very advanced processor for its time  :8)
lstrcmp sure does look impressive, what OS is it? Windows XP's lstrlen is almost ten times slower...  :eek


It seems that the procmacros always work well on szCmp, shows to be slightly faster than ptimers (no parameter pushing and ret timed), like it should be. lstrlen however seems to get slightly more clock counts sometimes, maybe it's an alignment issue.

New Upload. Procedure timer now corrects stack, if the timed procedure failed to do so (C-calling convention for example). Timing procedures with C-calling convention might not always give the exact clock count (+-1 clock cycles in some cases?), but I don't think nobody cares. However, when possible; use stdcall.

MichaelW

The K5 is running under Windows 98 SE, and the P3 under Windows 2000 SP4.

In response to CPUID function 80000000h the P3 returns 03020101h. In response to CPUID function 0 it returns 2. AFAIK the Processor Brand String is available only on the more recent Intel processors, and the extended functions were first implemented on the P4. For the older (Intel) processors you must get the Brand Index using CPUID function 1, and then look up the Processor Brand String in a table. See the IA-32 Intel Architecture Software Developer's Manual Volume 2A, Table 3-19, Mapping of Brand Indices and IA-32 Processor Brand Strings.

I think Intel started out making a mess of this, and then decided to follow AMD's lead.

eschew obfuscation

Mark_Larson

Quote from: MichaelW on January 22, 2006, 11:16:16 AM
I seem to recall an Intel document with code that "warmed up" CPUID or RDTSC or both, but after a quick search I could not find it.

CPUID is a "serializing" instruction.  It flushes the pipeline.  If I remember right the first p4's had a 21 stage pipeline.  I might be mis-remembering.  If you have a big program, and you are timing just one routine in a part of that program, then CPUID will give you more accurate timings because it'll flush the pipeline before executing the code.  If you just have a small snippet of code that runs all by itself, then you aren't going to notice a difference.

And I hear you about only having an hour of free time to do stuff.  I don't have much time to code either.

Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

This was a quick test to determine if it is possible to get a consistent cycle count for the timing overhead, under the best of conditions. Running this code on my P3 the overhead is 139 cycles for perhaps 90-95% of the loops. The count for the first loop is out of line for obvious reasons, but other smaller inconsistencies, where the count is always > 139, seem to occur more or less randomly, even with REALTIME_PRIORITY_CLASS, and even when starting at the beginning of a time slice. I see no way to eliminate the inconsistencies without filtering of the results, and I now think the filtering should just select the shortest time, as Ratch suggested here.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        startcount  dd 0
        counts      dd 20 dup (0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep,2000

    invoke GetCurrentProcess
    ;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    mov   edi, 19*4

    align 16
  @@:
    invoke Sleep, 0

    xor   eax, eax
    cpuid
    rdtsc
    mov   startcount, eax
    ;;;;
    xor   eax, eax
    cpuid
    rdtsc

    sub   eax, startcount
    mov   [edi+counts], eax
    sub   edi, 4
    jnz   @B
   
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS

    mov   edi, 19*4
  @@:
    print ustr$([edi+counts]),13,10
    sub   edi, 4
    jnz   @B
   
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

eschew obfuscation

Petroizki

Ok, new upload.

I now use the shortest time choosing, for the setup time overhead.

Choosing the shortest time to the actual loop counting does not work for me. For example, when I loop 32 times through 4xnop instructions in my comp; I get 31x1 clock cycles and 1x0 clock cycles. I just think the current average counting gives more stable results. I don't know how it is with other CPU's, but i just want to filter the abnormal situations on small code.

MichaelW

I have doubts that the nop results are valid because AFAIK nop is a special case instruction. This code returns very consistent and more or less reasonable counts for most of the instructions that I tested (the exceptions are noted). A result of zero seems reasonable (for this code) for an instruction that executes in less than one clock cycle and/or runs in parallel with one of the overhead instructions. Perhaps there is some simple method of isolating the test instruction(s) from the overhead instructions so the two cannot run in parallel. Perhaps by bracketing the test instructions with CPUIDs and including them in the overhead?

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        startcount  dd 0
        overhead    dd -1
        count       dd -1
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep,2000

    invoke GetCurrentProcess
    ;invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
    invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    mov   esi, 10
    .WHILE( esi )
      mov   edi, 30
      align 16   
      .WHILE( edi )
        invoke Sleep, 0
        xor   eax, eax
        cpuid
        rdtsc
        mov   startcount, eax
        ;;;;
        xor   eax, eax
        cpuid
        rdtsc
        sub   eax, startcount
        .IF( eax < overhead )
            mov   overhead, eax
        .ENDIF
        sub   edi, 1
      .ENDW 

      mov   edi, 30
      align 16
      .WHILE( edi )
        invoke Sleep, 0
        xor   eax, eax
        cpuid
        rdtsc
        mov   startcount, eax
        ;;;;;;
        ;nop              ; 0
        ;nops 2           ; 1
        ;nops 3           ; 1
        nops 4           ; 1
        ;nops 5           ; 0 usually, sometimes 1 or 2
        ;ror   eax, 1     ; 1
        ;mov   eax, 1     ; 0
        ; mul   ebx       ; 1 (in parallel with overhead?)
        ;xor   edx, edx   ; 0 by itself
        ;div   ebx        ; 7 with (xor  edx, edx)
        ;jmp   @F         ; 0
      @@:
        ;push  eax        ; 0 for both
        ;pop   eax        ;
        ;fdiv             ; 77
        ;cmpxchg dl, bh   ; 2 for 5 iterations of esi loop, then 0 ??
        ;;;;;;
        xor   eax, eax
        cpuid
        rdtsc
        sub   eax, startcount
        sub   eax, overhead
        .IF( eax < count )
            mov   count, eax
        .ENDIF
        sub   edi, 1
      .ENDW

      print "overhead="
      print ustr$(overhead),13,10
      print "count="
      print ustr$(count),13,10
      sub   esi, 1
    .ENDW 
   
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
   
    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


eschew obfuscation