News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

hutch--

 :dazzled:

It must be the silly season, everbody seems to be unhappy.  :boohoo:
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

To get back on topic-
Quote from: jj2007 on May 03, 2009, 05:21:49 AM
I have no explanation why code can suddenly, out of the blue, run 3% faster, but it happens all the time. To improve reliability, one might consider to eliminate both fast and slow outliers, but it would require some overhead in \masm32\macros\timers.asm - such as a 100 loops to calculate the expected average before starting the main exercise...?
Some times I think the words I write just stay local to my machine, echo back to me when I read a thread, but never actually go where anyone else can see them ::)

This is exactly what I have been complaining about the last few pages.

Time each execution of the code.  Throw away the slowest half, because something was obviously going on in a windows background process.   If you do this, there is no need to run it a million times, the fastest values are the ones that weren't affected by something else.  I have found that 100 iterations is more than enough, either throw away the slowest half and average the rest, or just pick the fastest one.  Doing either of these I get rock solid consistent results.  That's why I say it's silly to loop a million times.

dedndave

I am beginning to agree with you Jim - lol
(that'll cheer him up, for sure, Steve)
I think part of the problem may be the length of time each test takes.
Let's call each pass a "burst" - not a good term for purists, perhaps, but it is descriptive.
If the burst period is brief, in terms of CPU time, the occurance of anomalies will be minimalized.
Of course, if it is too brief, the time measurements have too much overhead.
Carefully selecting the length of each burst seems important.
In fact, the iterations per burst, should be adjusted until the burst period falls within a certain window.
This seems to make sense, particularily when comparing one "Instruction Sequence Under Test" to another.
If we want to compare two ISUT's, we should adjust one or both iteration counts until the burst lengths are nearly the same.
Then, run them enough times to assure acquistion of the fastest time, as Jim suggests.
Also, it should not be difficult to run time measurements on the overhead and subtract it from the results.
This overhead will vary from one platform to another, the same as any other instruction sequence.
Any time measurements we agree upon should 1) produce predictable accurate results with known ISUT's, and
2) produce stable and repeatable results on several platforms with several ISUT's.
As far as 32-bit code is concerned, I am a novice, to be sure.
But, after 30 years in Electronics Engineering, I do have some experience devising certification/verification tests.
Certainly, there is much for me to learn about all the CPU's out there, but basics are basic and statistics still apply.

dedndave

this subject has peaked my curiosity, for some reason - lol

Unfortunately, I don't feel qualified to develop the entire program, myself. There is too much
about 32-bit code that I have yet to learn in order to cover all the bases.

One simple question popped into my head, though. It has to do with register dependancies.
As most of us know, the NOP instruction was derived from XCHG AX,AX, or XCHG EAX,EAX in protected mode.
I think the assembler will code 90h for either. These timing routines you guys use should show results easily.

ISUT #1:

NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP

ISUT #2

XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX
XCHG EAX,ECX

ISUT #3

XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX
XCHG EAX,ECX
XCHG EAX,EDX

I wonder if NOP is dependant on the value in AX/EAX?
My guess is that the microcode is smart enough to know better.
Those guys at Intel are pretty sharp.


btw - is there a "standard" timing test program used here in MASM32 forum,
or are the results I see posted using different code?

jj2007

Quote from: MichaelW on May 03, 2009, 07:23:08 AM
JJ,

The code is not suddenly running 3%, or whatever, faster. The problem is that the test is being interrupted, and the more it's interrupted the higher the cycle counts.

Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.

Quote
The second set of timing macros was an attempt to correct this problem. These macros capture the lowest cycle count that occurs in a single loop through the block of code, on the assumption that the lowest count is the correct count.

Are these the second attachment in the sticky Lab post? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen

Quote
The higher counts that occur are the result of one or more context switches within the loop. Context switches can occur at the end of a time slice, so to minimize the possibility of the loop overlapping the time slice the ctr_begin macro starts a new time slice at the beginning of the loop. If the execution time for a single loop is greater than the duration of a time slice (approximately 20ms under Windows), then the loop will overlap the time slice, and if another thread of equal priority is ready to run, then a context switch will occur.

This would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?

Quote
Unfortunately, these macros do not work well with a P4, typically returning cycle counts that are a multiple of 4 and frequently higher than they should be.

Indeed ;-)

@dedndave: Most who do timings here use MichaelW's macros, see first post in the Laboratory.

Mark Jones

Quote from: jj2007 on May 03, 2009, 06:37:33 PM
Michael,
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from?

My take on this, is that there is always going to be some difference between hardware and OS, even depending on the running apps, so +/-3% is a not worth the hassle. Just take the fastest time, and consider it "the fastest time." For user-mode code, there are always going to be things to slow it down. The fastest time at least gives a baseline speed value.

I guess that means for real-mode code, a new set of "timers" need to be created. :bg
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time

-120 cycles, rol ecx,32

For example?

MichaelW

Quote from: jj2007 on May 03, 2009, 06:37:33 PM
"My" problem with this (I am a measurement specialist, too, although not in Masm) is that this is such a rare event - which would imply that 99% of the time the code is measured, say, 3% too slow. Where does this "constant" +3% error come from? See remarks on time slices below.

With counts in the range of several thousand cycles, I suspect that all of the results include interruptions. I have no idea how to account for the "constant" 3%, but it seems plausible to me that the system is performing some activity that in bursts is using 3% of the processor time.

Quote
Are these the second attachment in the sticky Lab post? Celeron M results are not so clear to me ::)
HIGH_PRIORITY_CLASS
-132 cycles, empty
0 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
-108 cycles, mul ecx
0 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
36 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-108 cycles, mov eax,1
0 cycles, mov eax,1 mov eax,2
0 cycles, nops 4
0 cycles, mul ecx
-120 cycles, rol ecx,32
0 cycles, rcr ecx,31
36 cycles, div ecx
24 cycles, StrLen


Yes, in counter2.zip. Your results look like P4 results, in code without any significant delay after the app starts and before entering the timing loops. Try implementing a 3-4 second delay at the start and, if you have one, try running the test on a non-NetBurst processor. Also, it's not reasonable to expect meaningful counts on any processor for an instruction that executes in less than one clock cycle, or due to the inability to isolate the timed instructions from the timing instructions, in any small number of clock cycles. I would be interested to see the results for my code from the attachment, both with and without the delay at the start.

QuoteThis would somehow imply that the loop to be timed must need more than 20ms - quite a high number of cycles, and not typical for what we are timing here... or do I misunderstand something?

Sorry, I copied most of the text from the documentation for similar macros that I implemented in another language, where I expected some of the readers to have no idea of what a clock cycle actually is.
eschew obfuscation

jj2007

Michael,
Here are the timings for counter2.zip, on a Celeron M (not a P4):
Quote99
99
444     192
444     192
444     192
444     192
444     192
444     192
444     192
444     156
444     156
444     156
444     192
444     192
444     156
444     156
444     192
444     192
444     192
444     192
444     192
444     192

Microsoft blames rdtsc for the negative cycles and outliers, and suggests QPC plus clamping...

Jimg

Quote from: jj2007 on May 03, 2009, 07:57:58 PM
Quote from: Mark Jones on May 03, 2009, 07:39:21 PM
Just take the fastest time

-120 cycles, rol ecx,32

For example?

Clearly the method for determining the loop overhead will occasionally itself contain windows glitches.  Perhaps you need to test the loop overhead a hundred times, and use the smallest result.  Subtracting smallest overhead from fastest time should be quite consistent.

dedndave

from  what i can see, there is no good way to time these functions - lol
this is especially true for those (myself included) that have a dual-core CPU
one thing MS suggests to help is to confine the thread to a single core
now, how are you supposed to evaluate the advantage of having two cores ? - lol
it appears that the burst needs to be substantially long - which introduces anomalies
i think, at some point, you have to call it "good enough" and accept what you get

is it possible to....
switch to real mode
get a value from the motherboard counter-timer
switch to protected mode
ISUT burst sample
switch back to real mode
get another value from the motherboard counter-timer
switch back to protected mode
yield result

i realize the overhead is large, but it could be subtracted out
scratch that idea - real mode doesn't get you access to the counter-timer, either
let me dust off my stopwatch - lol

hutch--

 :bg

I wonder how long it will take for everyone to realise that real time testing on large samples is one of the few techniques that does not suffer most of these problems. Tailor the test data to the task at hand, make it BIG enough and run it LONG enough and you will get under the magic one percent.

There is yet another variation of this, allocate so much memory that it cannot fit into cache then copy the test data in blocks larger than the cache sequentially in the allocated memory then do random rotation of the bits being read and you will really see how bad some algos are.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

I tend to agree, more and more each day.  It's not nearly as much fun, but real world testing is the way to go.

dedndave

............ and test it on a few different CPU's
i fear optimization that leans toward the authors CPU only

MichaelW

QuoteMicrosoft blames rdtsc for...

That article is not about timing code in clock cycles; it's about implementing high-resolution timers. A variable clock speed makes the TSC useless as a time base for a timer, so the article recommends using the high-resolution performance counter. With multiple processors/cores there are multiple TSCs, typically running asynchronously, causing obvious problems if the timer thread is not confined to a single processor/core, so the article recommends using the high-resolution performance counter.

For timing code in clock cycles the problem with multiple processors/cores should be solvable by restricting the code to running on a single processor/core using SetProcessAffinityMask or SetThreadAffinityMask. For timing code in clock cycles a variable clock speed does not present a problem, but it does present a serious problem for timing in units of time.
eschew obfuscation