Fastest Absolute Function

jj2007 · August 26, 2010, 09:17:06 PM

Quote from: Rockoon on August 26, 2010, 08:10:19 PMcode was indeed pairing up with CPUID on my AMD.

How would that affect/not affect the reference loop? Would we need instructions that do pair up with CPUID in the ref loop?

Rockoon · August 26, 2010, 11:40:15 PM

Quote from: jj2007 on August 26, 2010, 09:17:06 PM
Quote from: Rockoon on August 26, 2010, 08:10:19 PMcode was indeed pairing up with CPUID on my AMD.

How would that affect/not affect the reference loop? Would we need instructions that do pair up with CPUID in the ref loop?

There is no reference loop. Its a straight...

CPUID
RDTSC
nothing here
CPUID
RDTSC

...sequence

Essentially, its only measuring the time it takes to execute the timing overhead.

The out of order execution prevention probably happens in the pipeline itself by not pushing u-ops forward into the scheduler, but unless both sides of CPUID are fenced, then CPUID's u-ops themselves can get paired with either the instructions before or after.

dedndave · August 27, 2010, 12:55:22 AM

so um - how would you suggest we measure the reference measurement time ?

MichaelW · August 27, 2010, 02:56:00 AM

Quote from: Rockoon on August 26, 2010, 11:40:15 PM
There is no reference loop. Its a straight...

CPUID
RDTSC
nothing here
CPUID
RDTSC

...sequence

Essentially, its only measuring the time it takes to execute the timing overhead.

If you are speaking of the counter_begin code, there is a reference loop:

Code Select


        xor   eax, eax        ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter

        push  edx             ;; Preserve high-order 32 bits of start count
        push  eax             ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor   eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop
        sub   __counter__loop__counter__, 1
        jnz   @B

        xor   eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        . . .

The pairing problem has been discussed before. The cycle count for CPUID varies with the input value, so for consistency it's necessary to use the same value for each execution. Even if CPUID cannot pair with the surrounding instructions, the instruction used to set the input value can. The bottom line is that these macros cannot reasonably be used to time code that executes in only a few cycles.

dedndave · August 27, 2010, 03:49:22 AM

i suppose you could run 1 iteration of the test code for a reference measurement (or 2, perhaps)
then run 10001 iterations (10000 requested, for example) to run the test
then take the difference in times
and divide by 10000 to find the avg test time

but, i am not sure that will solve the granularity issue with the CPUID instruction
it takes, let's say, 80 clock cycles and can be +/- a few cycles
that means that you may not be able to measure times with more resolution than a few cylces

the problem seems to be that CPUID is a serializing instruction used to force inline execution of RDTSC
that mechanism, alone, is going to wobble a few cycles - lol
Michael has tried using other serializing instructions, and has found CPUID to yield the best results

jj2007 · August 27, 2010, 06:41:28 AM

Quote from: Rockoon on August 26, 2010, 11:40:15 PM
There is no reference loop. Its a straight...
...
Essentially, its only measuring the time it takes to execute the timing overhead.

The out of order execution prevention probably happens in the pipeline itself by not pushing u-ops forward into the scheduler, but unless both sides of CPUID are fenced, then CPUID's u-ops themselves can get paired with either the instructions before or after.

Right. I had not looked at it for ages.
Usually, for suspiciously short timings, a REPEAT 100 inside the counter_ structure helps to get more reasonable values:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1769    cycles for 100*AbsLingo, pos
741     cycles for 100*AbsJJp proc, pos
304     cycles for mov arg, 100*AbsJJ(arg), pos
248     cycles for 100*SetAbsJJ arg, pos

8       cycles for AbsLingo, neg
4       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg

lingo · August 27, 2010, 11:11:46 AM

"The thieves are always liars"[/U]
Bla..bla..bla. As you saw your test program is proven crap, hence your results too... :lol

dioxin · August 27, 2010, 03:28:46 PM

Michael,
from the timing macros:

Code Select

        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop

..
..
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      label:                  ;; Start test loop

Does this not leave room for a few clks of error because the CPUID finishes and then executes an unknown number of padding NOPs to get the alignment for the next instruction. The amount of padding varies from 0 to 15 bytes for an alignment of 16 so the reference loop and the real loop may well have differnt padding included in the timing. As far as I can see there is no attempt to keep the padding identical for the reference loop and the timing loop.

Paul.

MichaelW · August 28, 2010, 12:54:34 AM

Hi Paul,

Yes it does, or at least probably does. Assembled with ML 6.15 there are 45 bytes between the labels, when it should be some multiple of 16. The alignment padding will consist of 1 or 2 relatively fast instructions, so I would think the error would be at most 2 cycles. After 24 hours on my feet I'm not thinking too clearly, but I think it should be possible to determine the distance at assembly time and control it without polluting either loop by adding an appropriate number of nops after the RDTSC following the reference loop.

Does the alignment of the labels even matter for anything after the P3?

hutch-- · August 28, 2010, 07:39:32 AM

How about something like this ?

Code Select


    align 16

    jmp lbl0    ; 0EB0Eh

    nop         ; 12 nops
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop

  lbl0:

    rdtsc       ; 0F31

  lbl_16_BYTE_align:

MichaelW · August 28, 2010, 10:29:48 AM

What I had in mind was to modify the counter_begin macro such that the distance between the loop labels would be a fixed multiple of 16 bytes, regardless of the MASM (or JWASM) version used to assemble the code. Doing it cleanly would require the distance to a forward label, and so far I have not been able to make this work. So for a quick test I changed the macro to display the number of padding bytes, so they can be inserted manually. Running on a P3, which generally produces very consistent cycle counts, in the short time I had to play with it I could not see any significant difference.

Edit:

I corrected an error in my modified macro, and moved everything into the attachment. I tested the original macro against the modified macro, at random alignments, and to maximize the consistency I used REALTIME_PRIORITY_CLASS and avoided accessing the console during the test. I used an empty timing loop to make any differences in cycle counts between the reference and timing loops obvious. Running on a P3 I cannot detect any significant differences with the empty timing loop, and only very small differences with one or a few instructions in the loop. Depending on your assembler, you may need to change the number of padding nops for the reference and/or timing loops.

Rockoon · August 29, 2010, 01:43:33 PM

To cover the specifics of modern chips, you really want 64-byte alignment to deal with the largest cache lines in use today.

This can be important because even on full cache hits, most modern CPU's take an extra cycle to decode instructions which straddle cache lines .. that is specifically that their decoder can only see 1 cache line at a time.. its "window" to memory is a cache line and even with full-on cache hits, it takes 1 cycle for the decoder to "switch" to the next "window."

The simple solution is to stop trying to time such short snippets of code, because quite frankly they are too short to get useful results even if you could time them perfectly anyways .. in such short code snippets.. throughput is usually more important than latency.

News:

Fastest Absolute Function