Print Page - Timing Tests

Title: Timing Tests
Post by: Jimg on August 26, 2005, 01:37:38 AM

I've been experimenting with timing routines to see if I could get some consistency.

For this test, I used Kuntnor's strcat routine. Each line is a copy of the routine run 15 times. The first run is on two strings of 1 byte each, the remaining 14 are two strings of 1000 bytes each. Each run's result is the minimum cycles running the proc only ten times. I'm pretty happy with the consistency considering the minimal number of times executed. Each routine, t1-t20 is the same routine duplicated 20 times to check the consistency. Here's a sample output-

Quote
Str 1 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
Str 1 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
=== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t2 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t3 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t4 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t6 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t8 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t17 12 2811 2811 2811 2811 2811 2811 2811 2811 2801 2816 2974 2974 2974 2974
t18 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t19 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t20 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811

1 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093

2 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093

3 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093

The OVH routine is the overhead. It is simply a proc with a return. The overhead was computed by simply testing the OVH routine the same way as the rest, 10 reps. The overhead was subtracted from the results of all the remaining tests.

To get this consistency, I found I had to physically copy the code to test from each of the procs, into a fixed test area. When I tested them in place, there was a definite pattern showing that location has a large effect on how fast a proc runs. In addition, I called all the procs from the exact same code and location, as, again, the location of the calling routine had an effect on timing, even when aligned identically. Any finally, I had to copy (or create in this case) all the test strings to the same location for all tests.

The timing routines are as follows-

Code Select


TimesToLoop equ 10
PriClass textequ <REALTIME_PRIORITY_CLASS> ; priority to use for tests


BeginCounter macro  ; This macro is inserted before a block of test code
    align 16    ; just to be sure all code starts with equal advantage
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, PriClass  ; use the highest priority for all the tests
    invoke GetCurrentThread
    invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
            
    mov Min,9999999999
    mov Loops,TimesToLoop
    mov ExtraLoops,10   ; will do several times first to let things settle out
    inv Sleep,0
ExtraLoopsLoop = $
    sub ExtraLoops,1
TimingLoop = $
    mov [StartingCycle],eax     ; just to make sure they are in the cache
    mov [StartingCycle+4],eax
    Repeat 6    ; Intel says do 4 times, but I've found that more works a little better
        xor eax,eax
        cpuid       ; call the EAX=0 version of CPUID to clear pipeline (serialize) (finish any pending instructions)
    Endm
    rdtsc
    mov [StartingCycle],eax
    mov [StartingCycle+4],edx
endm

EndCounter macro        ; This macro is inserted after each block of test code
    xor eax,eax
    cpuid       ; flush
    rdtsc
    sub eax,[StartingCycle] ; get result in EDX:EAX
    sbb edx,[StartingCycle+4]
    cmp ExtraLoops,0
    jne ExtraLoopsLoop  ; skip first several iterations
    cmp Min,eax         ; save the fastest one
    jb @f
    mov Min,eax
@@:
    sub Loops,1
    jnz TimingLoop
    invoke GetCurrentThread
    inv SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
    invoke GetCurrentProcess
    invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS ; back to normal
    inv Sleep,0 
    mov eax,Min     ; return the minimum elapsed time
    .if DoingOverhead==0
        sub eax,OverHead
    .endif
endm

Notice that I run 10 loops through the code before saving any results (extraloops). I've found the first few timings are really flakey so I just throw them away. Also notice that I run cpuid 6 times before starting the test. I found some Intel documentation implying it would take 4 times for the code to actually settle down, and I've found six to be better on my cpu.
Also notice that I set the thread priority to the maximum. Don't do this on a machine that has to be responding to something else.

There are a few glitches in the timing. Something that takes 24 cycles occasionally runs even when I set the thread to the highest possible priority. The next step will be to run each test several times, and discard any abnormal ones.

This is all done on an Athlon, so your mileage may vary.

[attachment deleted by admin]

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Jimg on August 26, 2005, 01:37:38 AM