I've been experimenting with timing routines to see if I could get some consistency.
For this test, I used Kuntnor's strcat routine. Each line is a copy of the routine run 15 times. The first run is on two strings of 1 byte each, the remaining 14 are two strings of 1000 bytes each. Each run's result is the minimum cycles running the proc only ten times. I'm pretty happy with the consistency considering the minimal number of times executed. Each routine, t1-t20 is the same routine duplicated 20 times to check the consistency. Here's a sample output-
Quote
Str 1 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
Str 1 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
=== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
0 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t2 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t3 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t4 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t6 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t8 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t17 12 2811 2811 2811 2811 2811 2811 2811 2811 2801 2816 2974 2974 2974 2974
t18 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t19 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
t20 12 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811 2811
1 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
2 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
3 byte misalignment
OVH 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t1 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t2 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t3 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t4 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t6 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t8 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t17 12 3093 3093 3093 3093 3093 3093 3093 3069 3069 3069 3069 3069 3069 3069
t18 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t19 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
t20 12 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093 3093
The OVH routine is the overhead. It is simply a proc with a return. The overhead was computed by simply testing the OVH routine the same way as the rest, 10 reps. The overhead was subtracted from the results of all the remaining tests.
To get this consistency, I found I had to physically copy the code to test from each of the procs, into a fixed test area. When I tested them in place, there was a definite pattern showing that location has a large effect on how fast a proc runs. In addition, I called all the procs from the exact same code and location, as, again, the location of the calling routine had an effect on timing, even when aligned identically. Any finally, I had to copy (or create in this case) all the test strings to the same location for all tests.
The timing routines are as follows-
TimesToLoop equ 10
PriClass textequ <REALTIME_PRIORITY_CLASS> ; priority to use for tests
BeginCounter macro ; This macro is inserted before a block of test code
align 16 ; just to be sure all code starts with equal advantage
invoke GetCurrentProcess
invoke SetPriorityClass, eax, PriClass ; use the highest priority for all the tests
invoke GetCurrentThread
invoke SetThreadPriority,eax,THREAD_PRIORITY_TIME_CRITICAL
mov Min,9999999999
mov Loops,TimesToLoop
mov ExtraLoops,10 ; will do several times first to let things settle out
inv Sleep,0
ExtraLoopsLoop = $
sub ExtraLoops,1
TimingLoop = $
mov [StartingCycle],eax ; just to make sure they are in the cache
mov [StartingCycle+4],eax
Repeat 6 ; Intel says do 4 times, but I've found that more works a little better
xor eax,eax
cpuid ; call the EAX=0 version of CPUID to clear pipeline (serialize) (finish any pending instructions)
Endm
rdtsc
mov [StartingCycle],eax
mov [StartingCycle+4],edx
endm
EndCounter macro ; This macro is inserted after each block of test code
xor eax,eax
cpuid ; flush
rdtsc
sub eax,[StartingCycle] ; get result in EDX:EAX
sbb edx,[StartingCycle+4]
cmp ExtraLoops,0
jne ExtraLoopsLoop ; skip first several iterations
cmp Min,eax ; save the fastest one
jb @f
mov Min,eax
@@:
sub Loops,1
jnz TimingLoop
invoke GetCurrentThread
inv SetThreadPriority,eax,THREAD_PRIORITY_NORMAL
invoke GetCurrentProcess
invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS ; back to normal
inv Sleep,0
mov eax,Min ; return the minimum elapsed time
.if DoingOverhead==0
sub eax,OverHead
.endif
endm
Notice that I run 10 loops through the code before saving any results (extraloops). I've found the first few timings are really flakey so I just throw them away. Also notice that I run cpuid 6 times before starting the test. I found some Intel documentation implying it would take 4 times for the code to actually settle down, and I've found six to be better on my cpu.
Also notice that I set the thread priority to the maximum. Don't do this on a machine that has to be responding to something else.
There are a few glitches in the timing. Something that takes 24 cycles occasionally runs even when I set the thread to the highest possible priority. The next step will be to run each test several times, and discard any abnormal ones.
This is all done on an Athlon, so your mileage may vary.
[attachment deleted by admin]