News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

jj2007

Quote from: Jimg on May 01, 2009, 10:48:08 PM
It really depends upon how you write the test.  This test uses repeat 1000, and only does it once.  lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M

Jim,
3968    cycles for lodsb
997     cycles for 100*movzx, inc esi
997     cycles for 100*movzx, add esi, 1
4017    cycles for 100*mov al
3993    cycles for lodsb


This is your test, but with LOOP_COUNT = 1000000. MichaelW has written the timing routines, and is in a much better position to explain what happens if you reduce the outer count to 1. I had written the REPEAT 1000 because esi is being increased when doing a lodsb - doing that a million times may have undesired side effects :wink

You can do a rougher test by allocating a large buffer for the esi memory access, and then simply use GetTickCount with a large ncnt.

dedndave

i think i am looking for Mark Larsons' page - lol
i am looking for the page that has code optimization by Mark - lol
the one that Neil was refering to in the first post of the thread
any help ?

Neil

It's at the top right of the page, under Forum Links & Websites :U

dedndave

that page really tells me that i have much to learn - lol
i was fairly proficient at writing fast code for the 8088
in those days, we also went for small code - not as important now - this simple fact will really change how i write code
i would say that half of what applies then still applies (per Marks' page)
half is totally different - even reverse
that is the worst case for my learning curve
having to remember which half it is in is half the battle - lol

Rainstorm

Quoteguess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page

I think hutch has mentioned this stuff in his help file about the string instructions (other than rep) being slower.

jj2007

Mark's page is full of really useful hints, but don't forget his advice to time the code. On top of The Laboratory, you find the code timing macros - extremely useful.

Two minor points re Mark's page:
- Point 6, PSHUFD faster than MOVDQA: On my Celeron M, MOVDQA is a lot faster
- mmx: If you can, go for xmm instead. SSE2 is in general faster, and it avoids trashing the FPU. Using the FPU is not important for everybody, but it offers high precision at reasonable speeds - and the mmx instructions destroy the content of the FPU register. Combined mmx/FPU code is really really slow.

Jimg

Precisely my point.  It depends upon how you test it.
If you loop a million times, it's a good test of looping a million times.
If you execute it a million of the instructions once, it's a good test of the instruction.
What do you want to test, looping or instruction time?
In normal code, you are calling this proc or that proc and none of them are in the cache.  To test instruction timing, you have to test code that is not in the cache, just the way is is executed most of the time.  Looping a million times is just silly.  It's just a test of the size and speed of the cache on a particular machine, not the code.
If you're going to loop, time it properly.  The current timing macros do not.
Time each loop separately. Pick the fastest one.  That's the fastest the code can run.
Or pick the MEDIAN.
If you print out the times for each each loop, you will see half a dozen of them that are hundreds or even thousands of times larger than the norm.
Doing an average including these where windows goes off and does its thing is not in any regard a test of the time it take any particular piece of code to run.
Doing an average is just silly.
If you want to test real world, duplicate the code many times and time that once.

dedndave

i dunno Jim

seems to me that the code that i really want to optimize is that which is inside loops - that means it is in the cache
(provided it is a short enough loop - not some long aborition)
much of the code that is executed once in a program should be written for clarity and small size - not speed

however, if you generate 1000's of the same instruction, and execute them, most of them get cached also
you also measure the time required to re-load the cache every so often

i agree that you do not want to time the loop itself
it seems to me a practical method is a comprimise
instead of executing the same instruction inside a loop 1000 times
or generating 1000 copies of the instruction, then executing it
make a loop of 100 copies, execute it 10 times, then subtract some agreed-to standard overhead time for the loop

Jimg

Exactly.  If you want to know how fast a particular chunk of code executes, e.g.  where it normally loops 100 times, then test the chunk of code looping 100 times.  If you ask "which instruction is fastest", then test the instruction, not looping.  And in both cases, don't do it a million times and average in windows doing housekeeping.
Do what you want to test. If that is normally a loop that executed 36 times, then time how long it takes to do the loop 36 times.
The time it takes is the time it takes.  If you want multiple samples, do the test again. Timing it each time.  Pick either the first time (most realistic), the fastest, or the median.  Not the blooming average.

dedndave

also - those measured times you mentioned - when windows is doing housecleaning and other programs are executing
statistically, those data points should be thrown out altogether
nor is the fastest time going to be perfect, either
if you have a set of data points like this.....

1  10
2  13
3  11
4  19
5  14
6  12
7  11
8  12
9  17

it is good practice to toss out points 4 and 9, and take the average of the rest
even though you may be measuring other things, it is a good PRACTICAL representation of the real-world

Jimg

I don't think the average is ever a good value, it is too prone to widows effects.  If you examine real world values, the median is almost always smaller than the average, unless the code under test always triggers some kind of windows event, in which case, timing it is irrelevant.   If you don't want to take the fastest, take the median.  Or throw away the slowest half and average what's left.  Do something to get rid of windows effects.

Jimg

Also, in the real world, (lodsb) is slightly slower then (movzx eax,[esi]/add esi,1), however, also in the real world, and how the code is usually used, the difference is insignificant because the instructions end up in a non-optimal alignment, or are affected by the preceding and following instructions executing simultaneously, or any of several other variables.


dedndave

I understand what you are saying, Jim.
The fastest time may well represent the most accurate measurement of the instruction, itself.

However, the problem is that it may not, as well.
In other words, if we run instructionA and happen to hit it's best time.
Then, we run instructionB and do not happen to hit it's best time.
We have invalidated the comparison of the two.

But if we take an average of the two instructions as mentioned above,
we are likely to obtain more useful comparison information.
Let's face it, we really do not want to know how many nanoseconds each takes,
rather, we want to know which of the two is performing the best.
By averaging a set of values, we take into account that we may or may not
have measured the best performance time of each instruction.

Jimg

QuoteBy averaging a set of values, we take into account that we may or may not have measured the best performance time of each instruction.
I agree with everything you've said up to that point.  Average will never make a measurement better.
Take a median.  Take a standard deviation.  The average is still way too prone to other effects.  Or if you must average, average only the lower half of the results.

dedndave

Well, at least we agree to disagree - lol

Truthfully, I doubt there would be much difference in the two methods.
Although you may see a static difference between the two methods, the
comparison ratios of two instructions would be very nearly the same,
assuming we tossed out the same data points.

i.e. Even though...
my method might say 100 cycles for instructionA and 120 cycles for instructionB (20% change)
your method might yield 90 and 108 (20% change)