News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster loop?

Started by Eddy, May 24, 2005, 08:32:14 PM

Previous topic - Next topic

MichaelW

Quote from: Mark Jones on May 30, 2005, 05:01:17 PM
If it were possible to time single instructions with Michael's timing macros, I'd code an app to time every instruction. Then we could see timings for every processor.

On the old forum Mark Larson suggested a project like this, but it never really went anywhere. It is possible to derive a repeatable cycle count for a single instruction, but if the goal were to determine the minimum cycle count, it would be necessary to create optimal execution conditions for the instruction. I think for some instructions this would be no problem, but for others it would be difficult. If I recall correctly, Mark wanted to automate the process as much as possible, but I could not envision any workable method of doing that. It seemed to me that the task was just going to require a lot of coding and testing. But that was before any (to my knowledge) easy-to-use timing macros were available :bg
eschew obfuscation

Mark Jones

Hmm indeed. I was thinking of three aspects:

1. Use a lookup table (probably by macro) to get the next instruction(s) to test. This way a lot of redundant code could be mitigated. The table could be ordered from 386 --> 786 so that all available instructions for the current processor would be tested.

2. Loop each single instruction 100 times and subtract the accrued loop delay from the result. The various processors might have different loop timings, so the compensation value might be CPU-dependent. I tried timing a NOP (then a series of NOPs) 10M times and sometimes would get 0 clocks, or even -2. Which might be right or wrong but leads us to:

3. Better results if all CPU cache, look-ahead features, and optimizations can be turned off. IF they can be turned off, or otherwise mitigated. (Cache flush routine?)

But I agree, the semantics would be the most difficult. How would variable-length instructions be tested for instance? With a fast random number generator?

It's an intriguing idea. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MichaelW

Quote from: Mark Jones on May 31, 2005, 01:19:42 PM
3. Better results if all CPU cache, look-ahead features, and optimizations can be turned off. IF they can be turned off, or otherwise mitigated. (Cache flush routine?)

I agree that this would make it easier to get consistent results, but the results would not reflect the real world where processors are used with all of these performance features enabled. I think an optimal cycle count would be more useful, because it would allow you to readily judge the level of optimization for an instruction sequence by comparing the sum of the optimal cycle counts for the sequence to the actual cycle count.
eschew obfuscation

Ian_B

Quote from: MazeGen on May 27, 2005, 03:12:06 PM

Quote from: Eddy on May 27, 2005, 02:51:48 PM
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.

BTW, if you know how to measure the code speed, you can try the following modification. It may seem strange, but I remember that somebody said it is much quicker on P4. I don't have the proof, though.


    Add1:
    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! jnc over
    ! add  edx, 1 ' or inc edx
    over:
    ! add  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! inc  ecx
    ! jnz Add1


I know someone already showed that this was slower, but this may be because (depending on the range of data you are generally adding) the jump taken over the addition of the carry will more often be mispredicted. There is some of my code doing something similar on the old forum, and I worked out that for me it was faster to add all the possible carries first, then jump over code to remove them if they weren't actually needed.

MazeGen

Yes, your code can be easily found on the old forum.

For completeness' sake, I was reffering to this topic.