The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: agner on August 14, 2006, 07:40:58 AM

Title: Optimization manual updated again. Now covers Core 2
Post by: agner on August 14, 2006, 07:40:58 AM
I have updated my manual once again. Now covering everything about the new Intel Core 2 processor including a detailed study of the pipeline and execution units and complete lists of instruction timings.

This time my manual has come before the official manuals from Intel. Their software manuals for the Core 2 are not out yet. Thank you to a friendly person who gave me remote access to a prerelease sample of the Core 2. This enabled me to test almost everything.

The execution core is more powerful than anything we have seen until now. It can do up to three full 128-bit vector calculations per clock cycle. Unfortunately, the instruction fetch and predecode stage has not been expanded enough to keep up with the rest of the pipeline, so this is a serious bottleneck in many situations.

The section on AMD microarchitecture in my manual has also been revised, thanks to help from Andreas Kaiser.

http://www.agner.org/optimize/
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: MazeGen on August 17, 2006, 06:59:59 AM
You work is really unique. Thanks for your hard work, agner :U
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: Ghirai on August 17, 2006, 12:58:43 PM
Very nice, thanks for your work.

Would it be okay to add the asm manual to my mirror (link in my signature)?
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: drhowarddrfine on August 17, 2006, 01:28:38 PM
Thank you very much, Agner! :U
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: BogdanOntanu on August 17, 2006, 04:37:46 PM
Thank you very much ;)
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: Vortex on August 17, 2006, 05:08:00 PM
Agner,

Very nice work :U
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: stanhebben on October 02, 2006, 08:12:24 PM
I found a little error in the optimization manual:

At page 74, you state that the only way to load unaligned data in XMM registers is to use MOVDQU, MOVUPS, MOVUPD or LDDQU. You can use MOVLPD/MOVHPD pairs, which results in faster code.

Stan
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: gabor on October 03, 2006, 09:28:59 AM
Hello agner!

Thanks for your work! It is very nice to see people who do a lot of work in a specific area and are ready to share the gained knowledge! Thank you again!
I had a look on your website. There are interessting topics! This Cultural Selection Theory looks very exciting....

Greets, Gábor
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: gwapo on October 04, 2006, 02:27:21 AM
Great work, thanks!
I thought Mark Larson's "Code Optimization" (http://www.mark.masmcode.com/) is already the best optimization document I've got. I guess multiple "the best" documents are always better than having single "the best" document  :U

\
-chris
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: agner on October 04, 2006, 07:02:30 AM
Your link to Mark Larson doesn't work. Where is the document you are referring to?
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: PBrennick on October 04, 2006, 10:11:07 AM
Agner,
The link works just fine.  Please try it again.  The server may have been down when you last tried.

Paul
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: KSS on April 07, 2010, 08:35:29 AM
Agner, thanks you for your work! :U
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: theunknownguy on April 14, 2010, 06:10:12 AM
Agner you rulz ! thanks so much my favorite read for optimization  :dance:
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: dedndave on April 14, 2010, 10:25:23 AM
you guys have woken a 4 year old thread   :P
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 02:30:58 PM
Quote from: dedndave
you guys have woken a 4 year old thread   :P

Can we dig up the zombie P4 designers that removed the barrel shifter, and beat them to death again?

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 14, 2010, 03:20:39 PM
Quote from: clive on April 14, 2010, 02:30:58 PM
Can we dig up the zombie P4 designers that removed the barrel shifter, and beat them to death again?

Try imul instead, it's blazingly fast on modern CPUs.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles, a barrel should be 1.

IMUL doesn't work well with right shifts, rotates, rotates through carry, using/setting carry. IDIV will eat 3 registers, has similar carry and rotate issues, and even more hideous latency.

The Core and Atom have significantly better shifting performance, AMD always has been good, some of the P4 issues were addressed in Prescott. Still the 64-bit right shifts are dogs in the entire P4 family. The P4 Willamette was a totally awful implementation, and Northwood wasn't much better, which is why I recommended beating them around the head. Whoosh!

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 14, 2010, 04:56:21 PM
Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter.

We aren't into betting and guessing here, Clive. Celeron M:

145     cycles for 100*imul 8
95      cycles for 100*shl 3
145     cycles for 100*imul 10
149     cycles for 100*lea mul10

Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 05:38:51 PM
Quote from: jj2007
We aren't into betting and guessing here, Clive. Celeron M:

145     cycles for 100*imul 8
95      cycles for 100*shl 3
145     cycles for 100*imul 10
149     cycles for 100*lea mul10


Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.

Presler (Pentium D)


96      cycles for 100*imul 8
81      cycles for 100*shl 3
98      cycles for 100*imul 10
167     cycles for 100*lea mul10


10 Tests, x30000
IMUL eax, 2
186555 Ticks
186555 Ticks
186555 Ticks
186585 Ticks
186555 Ticks
186555 Ticks
186570 Ticks
186555 Ticks
186555 Ticks
243900 Ticks
186555 Ticks (min), 6.218500

SHL eax, 1
37065 Ticks
37050 Ticks
37170 Ticks
37065 Ticks
37050 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37020 Ticks
37020 Ticks (min), 1.234000

SHL eax, 8
68985 Ticks
31770 Ticks
31695 Ticks
31650 Ticks
31830 Ticks
31635 Ticks
31740 Ticks
31890 Ticks
31635 Ticks
31680 Ticks
31635 Ticks (min), 1.054500


Your test suggests that IMUL is a single cycle, when I'm pretty damn sure it has a 6 cycle latency (ie resources actually consumed, time to get a result). On a Prescott it's 5 compared to ~1 for shifts, the divide clocks in at 79 cycles. Blazingly fast? I'm not convinced.

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 14, 2010, 08:01:31 PM
Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency.

Version 2 - I added a mov esi, eax behind each multiplication to create a dependency chain. If you have a different definition of latency, post code please.

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 1000
imul eax, eax, 8
mov esi, eax
ENDM
counter_end


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
173     cycles for 100*imul 8
287     cycles for 100*mul 8
149     cycles for 100*shl 3
157     cycles for 100*shl 1
173     cycles for 100*imul 10
288     cycles for 100*mul 10
218     cycles for 100*lea mul10

172     ticks for 100*imul 8
287     ticks for 100*mul 8
150     ticks for 100*shl 3
157     ticks for 100*shl 1
172     ticks for 100*imul 10
287     ticks for 100*mul 10
218     ticks for 100*lea mul10
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 08:37:56 PM
That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.

This should give you a measurement of latency, rather than dispatch/throughput. Pretty much zero effort applied to aligining or optimizing, designed to give ballpark numbers.

-Clive

DWORD imultest(void)
{
  DWORD ticks;

  _asm
  {
    rdtsc
    push eax

    mov ecx,1000

    mov eax,1

loop_mul:

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    dec ecx
    jnz loop_mul

    rdtsc
    pop ecx
    sub eax,ecx
    mov ticks,eax
  }

  return(ticks);
}


DWORD idivtest(void)
{
  DWORD ticks;

  _asm
  {
    rdtsc
    push eax

    mov ecx,1000

    mov eax,1
    xor edx,edx

loop_div:

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    dec ecx
    jnz loop_div

    rdtsc
    pop ecx
    sub eax,ecx
    mov ticks,eax
  }

  return(ticks);
}
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 14, 2010, 08:48:25 PM
Quote from: clive on April 14, 2010, 08:37:56 PM
That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.

Interesting. So the processor "knows" that I don't really need the result when I put mov esi, eax there?
Can you please post your code in assembler, so that everybody can test it? Thanks.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 09:10:54 PM
The processor uses register renaming, and out-of-order execution. The value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.

-Clive

Here in MASM assembler.

        .586

        .MODEL FLAT, C

        .CODE

        ; ml -c -Fl imultest.asm

        ; Time 30000 instructions, returns estimated clock ticks in EAX (divide by 30000.0 to get estimated clocks)
        ; The looping code accounts for 3%, if that, of the total

imultest PROC

    rdtsc
    push eax

    mov ecx,1000

    mov eax,1

    ALIGN 16

loop_mul:

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    dec ecx
    jnz loop_mul

    rdtsc
    pop ecx
    sub eax,ecx

    ret

imultest ENDP

idivtest PROC

    rdtsc
    push eax

    mov ecx,1000

    mov eax,1
    xor edx,edx

    ALIGN 16

loop_div:

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    dec ecx
    jnz loop_div

    rdtsc
    pop ecx
    sub eax,ecx

    ret

idivtest ENDP

        END
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 14, 2010, 09:42:31 PM
In your HLA

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1; eax = eax * 1, EAX remains one, but the processor doesn't know this
REPEAT 100 ; I'd probably do more, to make the dependency chain maximal
imul eax, eax, 1
ENDM
counter_end


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1 ; clever constant selection eax = edx:eax / eax, edx = edx:eax % eax. EDX remains zero, EAX remains one, but the processor doesn't know this
xor edx,edx
REPEAT 100
idiv eax
ENDM
counter_end


I use C as my framework rather than just call CRT routines. I'll post C source and execuatables if they're helpful, but this should be a simple cut-n-paste exercise. Basically trying to refute that IMUL is a single cycle instruction.

-Clive

Edit: Attached C/EXE

Prescott P4 instruction latency
Empty
      1612 Ticks (min),  0.054 cycles

IMUL eax, 1
    298478 Ticks (min),  9.949 cycles

IDIV eax, 1
   2398350 Ticks (min), 79.945 cycles

SHL eax, 1 / SHR eax, 1
     28448 Ticks (min),  0.948 cycles

SHL eax, 8 / SHR eax, 8
     28448 Ticks (min),  0.948 cycles

XCHG eax, [mem]
   2752365 Ticks (min), 91.746 cycles


Presler P4
Empty
      2025 Ticks (min),  0.068 cycles

IMUL eax, 1
    373095 Ticks (min), 12.437 cycles

IDIV eax, 1
   2397930 Ticks (min), 79.931 cycles

SHL eax, 1 / SHR eax, 1
     28035 Ticks (min),  0.935 cycles

SHL eax, 8 / SHR eax, 8
     28035 Ticks (min),  0.935 cycles

XCHG eax, [mem]
   2744310 Ticks (min), 91.477 cycles


Atom
Empty
      4056 Ticks (min),  0.135 cycles

IMUL eax, 1
    145992 Ticks (min),  4.866 cycles

IDIV eax, 1
   1828008 Ticks (min), 60.934 cycles

SHL eax, 1 / SHR eax, 1
     25992 Ticks (min),  0.866 cycles

SHL eax, 8 / SHR eax, 8
     27888 Ticks (min),  0.930 cycles

XCHG eax, [mem]
    177120 Ticks (min),  5.904 cycles


I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 15, 2010, 08:05:13 AM
Quote from: clive on April 14, 2010, 09:10:54 PMThe value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.

There is no evidence that ADD ESI,EAX has "more consequence": On my Celeron, timings are identical (taking into account the extra half cycle of add vs mov); on my P4, the ADD ESI,EAX version is even faster.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
48      cycles for 100*mov esi
95      cycles for 100*add esi
172     cycles for 100*imul 8, mov esi
223     cycles for 100*imul 8, add esi

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
45      cycles for 100*mov esi
96      cycles for 100*add esi
223     cycles for 100*imul 8, mov esi
211     cycles for 100*imul 8, add esi
282     cycles for 100*mul 8
150     cycles for 100*shl 3
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: hutch-- on April 15, 2010, 09:47:47 AM
Thye comments of PIVs sound right, I had PIVs from an early 1.5 to the last 3.8 over 3 core types and they were all quirky in comparison to earlier and later processors. LEA slowed by a long way which impacted on fixed multiplications and this more or less matched the Inntel PIV data that recommended using ADD instead up to about 3 before LEA was faster. The Core quad has a fast LEA but one of the few things the last PIV was faster with was shifts and rotates. The late Prescott PIVs had a long pipeline that made stalls very exspensive, the Northwood core was shorter and recovered from stalls faster.

The Core2 quad is much more like a PIII in its preferred style of code, just faster. I have only just got the new Nehalem going so i have yet to do any serious timing on it but a quick play earlier today says it has different characteristics to the Core series. There appears to be a shift to SSE instructions in the Core series as comparisons to the PIVs I have showed much bigger improvements with SSE code on the Quad that on any of the PIVs. I would imagine that the finer tracking in the very recent Intel cores allows a bit more room for hardcoded instructions and it appears that the extra space is going into SSE instructions rather than the older integer instructions.

This much I have learnt about core design is that I don't really want to know as each core is substantially different, tend to be RISC under the hood and depending on the core model have different preferred instruction sets which follow from the RISC primitives below the published interface.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 15, 2010, 04:06:25 PM
Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
232     cycles for 100*imul 8
312     cycles for 100*mul 8
138     cycles for 100*shl 3
149     cycles for 100*shl 1
255     cycles for 100*imul 10
303     cycles for 100*mul 10
249     cycles for 100*lea mul10
85      cycles for 100*or
1030    cycles for 100*imul 1 - dependent
1242    cycles for 100*mul 1 - dependent
8172    cycles for 100*idiv 1 - dependent
101     cycles for 100*or - dependent
102     cycles for 100*rol 1 - dependent
104     cycles for 100*rol 8 - dependent

118     ticks for 100*imul 8
160     ticks for 100*mul 8
78      ticks for 100*shl 3
72      ticks for 100*shl 1
120     ticks for 100*imul 10
162     ticks for 100*mul 10
136     ticks for 100*lea mul10
45      ticks for 100*or
540     ticks for 100*imul 1 - dependent
654     ticks for 100*mul 1 - dependent
4338    ticks for 100*idiv 1 - dependent
53      ticks for 100*or - dependent
54      ticks for 100*rol 1 - dependent
55      ticks for 100*rol 8 - dependent


NewCastle
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
100     cycles for 100*imul 8
200     cycles for 100*mul 8
97      cycles for 100*shl 3
97      cycles for 100*shl 1
99      cycles for 100*imul 10
198     cycles for 100*mul 10
140     cycles for 100*lea mul10
62      cycles for 100*or
301     cycles for 100*imul 1 - dependent
300     cycles for 100*mul 1 - dependent
4051    cycles for 100*idiv 1 - dependent
96      cycles for 100*or - dependent
96      cycles for 100*rol 1 - dependent
97      cycles for 100*rol 8 - dependent

80      ticks for 100*imul 8
160     ticks for 100*mul 8
78      ticks for 100*shl 3
78      ticks for 100*shl 1
80      ticks for 100*imul 10
159     ticks for 100*mul 10
112     ticks for 100*lea mul10
47      ticks for 100*or
243     ticks for 100*imul 1 - dependent
238     ticks for 100*mul 1 - dependent
3272    ticks for 100*idiv 1 - dependent
75      ticks for 100*or - dependent
75      ticks for 100*rol 1 - dependent
79      ticks for 100*rol 8 - dependent
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 15, 2010, 11:04:29 PM
Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles,

Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.

Quote from: clive on April 15, 2010, 04:06:25 PM
Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
138     cycles for 100*shl 3
255     cycles for 100*imul 10
303     cycles for 100*mul 10
249     cycles for 100*lea mul10
1030    cycles for 100*imul 1 - dependent
1242    cycles for 100*mul 1 - dependent
...
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
99      cycles for 100*imul 10
301     cycles for 100*imul 1 - dependent


I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCaste. Good.

Now the interesting question is perhaps whether latency is more relevant than throughput. Of course, since you are programming in C, you depend very much on what your compiler knows about it. Assembler coders can optimise their code.
Imagine this (rather useless) sequence:

Quote   mov eax, esi
   REPEAT 100
      imul eax, eax, 10
   ENDM
   mov esi, eax

Overall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel:

Quote   mov eax, esi
   REPEAT 100
      imul eax, eax, 10
      mov edi, ebx
      mov ebx, esi
      inc ecx
      dec edx
      dec ebx
   ENDM
   mov esi, eax

And, surprise surprise, both sequences run in four cycles on my slow cheap Celeron M - and imul is the best overall choice...:

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
150     cycles for 100*shl 3
410     cycles for 100*shl 3 with five extra instructions
173     cycles for 100*imul 10
287     cycles for 100*mul 10
219     cycles for 100*lea mul10
471     cycles for 100*lea mul10 with five extra instructions
397     cycles for 100*imul 10 - dependent
410     cycles for 100*imul 10 - dependent with five extra instructions
496     cycles for 100*mul eax - dependent
576     cycles for 100*mul eax - dependent with five extra instructions
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: hutch-- on April 15, 2010, 11:42:09 PM
The problem with MUL / IMUL is with blended model programming, everything from i486 up to late PIVs were expensive with multiplication operations. later cores are supposed to have handled muls differently but if you want code that runs OK on most things you need to address how it works on PIVs down. One of the few blessings of 64 bit OS versions is that the hardware must be late enough to run it which means a reasonably late model processor so you are safe with many later faster instructions.

Code you write with some certainty about the lowest processor that will run it is a lot easier to write and should be faster than code that must run on everything.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 16, 2010, 02:56:25 AM
Quote from: jj2007
I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCastle. Good.

Yes, and the 14 is a Willamette (1st Pentium 4, and source of much derision) number, from the notes I have. I'll demonstrate if you insist. The box is at the out-laws, so not readily accessible, I could get P3 and PPro numbers more readily.

This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf

"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx

The problem with optimizing for one architecture is that the installed base is so divergent, so unless the application warrants multiple code paths, and the hideous amount of regression testing require to exercise all paths, one generally focus on a reasonable solution. It makes sense on things like folding, Seti, encryption cracking, etc. but I wouldn't classify these as commercial applications.

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 16, 2010, 04:39:50 AM
Quote from: jj2007
Imagine this (rather useless) sequence:

mov eax, esi
REPEAT 100
imul eax, eax, 10
ENDM
mov esi, eax

You do understand that your whole REPEAT 100 construct is totally contrived, right? Unless the fragment you test is self contained and dependent, all you end up doing is slip streaming the whole repeated sequence into the pipeline, the timing of the actual sequence gets lost. You'd get a better real world timing using a single iteration. In any situation where your timing suggests an execution speed of a sequence containing an IMUL is faster than the demonstrated speed of the instruction, you either a) do not have a dependency on the result, or b) you are not getting an accurate timing.

The idea of the 100 iterations is to push the looping instruction times into the ~1% of the total measurement, but unless the fragments are carefully crafted this is not what is happening.

I code in C because it's convenient, I've been coding in about a dozen assembly languages for as long as you have.

QuoteOverall speed depends, as you rightly assume, on latency

I haven't assumed, I know this, and don't need to benchmark everything to recognize when it might occur. Many algorithms in C and ASM can be discarded with static analysis.

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 16, 2010, 09:57:36 AM
Quote from: clive on April 16, 2010, 02:56:25 AM
This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf
And it confirms that for a modern CPU, e.g. a Core 2, the latency is 3 cycles, while throughput is 1 cycle for imul.

Quote
"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx
That was written 8 years ago...

Quote from: clive on April 16, 2010, 04:39:50 AM

QuoteOverall speed depends, as you rightly assume, on latency

I haven't assumed, I know this, and don't need to benchmark everything

You quote rather selectively. My "overall" related to your isolated 100*imul eax, eax, 1 example, and the text continues as follows:

QuoteOverall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel

Which implies that latency is pretty irrelevant in an optimised Masm app. Throughput is the interesting variable.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: hutch-- on April 16, 2010, 12:07:39 PM
Pick your Intel core, pick a result.



PIV 3.8 gig.

220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...

Core2 Quad 3.0 gig

220
220
220
265 MS leamul10
235 MS imul10
437 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
437 MS shl10
235 MS add10
265 MS leamul10
235 MS imul10
437 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
437 MS shl10
236 MS add10
Press any key to continue ...

2.8 i7 860 quad

220
220
234 MS leamul10
249 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
250 MS imul10
390 MS shl10
296 MS add10
250 MS leamul10
234 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
218 MS leamul10
218 MS imul10
390 MS shl10
281 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
296 MS add10
Average algo timing
234 MS leamul10
235 MS imul10
382 MS shl10
298 MS add10
Press any key to continue ...


This is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    leamul10 PROTO :DWORD
    imul10   PROTO :DWORD
    shl10    PROTO :DWORD
    add10    PROTO :DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tim1  :DWORD
    LOCAL tim2  :DWORD
    LOCAL tim3  :DWORD
    LOCAL tim4  :DWORD

    mov tim1, 0
    mov tim2, 0
    mov tim3, 0
    mov tim4, 0

    iteration equ <100000000>

    push esi

    mov eax, 22
    invoke leamul10,eax
    print sstr$(eax),13,10


    mov eax, 22
    invoke imul10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke shl10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke add10,eax
    print sstr$(eax),13,10

    REPEAT 8

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke leamul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim1, eax
    print str$(eax)," MS leamul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke imul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim2, eax
    print str$(eax)," MS imul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke shl10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim3, eax
    print str$(eax)," MS shl10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke add10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim4, eax
    print str$(eax)," MS add10",13,10

; ----------------------------------------------

    ENDM


    print "Average algo timing",13,10

    shr tim1, 3
    print str$(tim1)," MS leamul10",13,10

    shr tim2, 3
    print str$(tim2)," MS imul10",13,10

    shr tim3, 3
    print str$(tim3)," MS shl10",13,10

    shr tim4, 3
    print str$(tim4)," MS add10",13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

leamul10 proc num:DWORD

    mov eax, [esp+4]
    lea eax, [eax+eax*4]
    add eax, eax

    ret 4

leamul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

imul10 proc num:DWORD

    mov eax, [esp+4]
    imul eax, 10

    ret 4

imul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

shl10 proc num:DWORD

    mov eax, [esp+4]
    shl DWORD PTR [esp+4], 3
    add eax, eax
    add eax, [esp+4]

    ret 4

shl10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

add10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, [esp+4]

    add ecx, ecx
    add eax, eax
    add eax, eax
    add eax, eax
    add eax, ecx

    ret 4

add10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 16, 2010, 12:29:20 PM
Quote from: hutch-- on April 16, 2010, 12:07:39 PM
Pick your Intel core, pick a result.

Hutch, the ABI!!!! :naughty:

    mov eax, [esp+4]
    shl DWORD PTR [esp+4], 3  ; <<<<<<
    add eax, eax
    add eax, [esp+4]

:bg

Prescott:
Average algo timing
545 MS leamul10
568 MS imul10
767 MS shl10
599 MS add10


(now let's do the same for mul123 :toothy)
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: hutch-- on April 16, 2010, 12:37:03 PM
JJ,

> Hutch, the ABI!!!!  ???? [esp] is a memory operand.  :bg

You were too fast finding the test, it had 2 blunders in it, 2 memory reads for the shift and add algos, now it is one and more closely matches the first two. Mainly this demo shows core differences.



PIV Prescott 3.8 gig.

220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...

Core2 Quad 3.0 gig

220
220
220
220
265 MS leamul10
235 MS imul10
265 MS shl10
266 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
235 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
265 MS shl10
250 MS add10
Press any key to continue ...

Nehalem core i7 quad 2.8 gig

220
220
220
220
250 MS leamul10
234 MS imul10
281 MS shl10
327 MS add10
234 MS leamul10
250 MS imul10
250 MS shl10
327 MS add10
265 MS leamul10
234 MS imul10
266 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
249 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
250 MS shl10
328 MS add10
218 MS leamul10
218 MS imul10
250 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
328 MS add10
Average algo timing
237 MS leamul10
234 MS imul10
251 MS shl10
323 MS add10
Press any key to continue ...


The test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    leamul10 PROTO :DWORD
    imul10   PROTO :DWORD
    shl10    PROTO :DWORD
    add10    PROTO :DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL tim1  :DWORD
    LOCAL tim2  :DWORD
    LOCAL tim3  :DWORD
    LOCAL tim4  :DWORD

    mov tim1, 0
    mov tim2, 0
    mov tim3, 0
    mov tim4, 0

    iteration equ <100000000>

    push esi

    mov eax, 22
    invoke leamul10,eax
    print sstr$(eax),13,10


    mov eax, 22
    invoke imul10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke shl10,eax
    print sstr$(eax),13,10

    mov eax, 22
    invoke add10,eax
    print sstr$(eax),13,10

    REPEAT 8

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke leamul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim1, eax
    print str$(eax)," MS leamul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke imul10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim2, eax
    print str$(eax)," MS imul10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke shl10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim3, eax
    print str$(eax)," MS shl10",13,10

; ----------------------------------------------

    invoke GetTickCount
    push eax

    mov esi, iteration

  @@:
    invoke add10,eax
    sub esi, 1
    jnz @B

    invoke GetTickCount
    pop ecx
    sub eax, ecx
    add tim4, eax
    print str$(eax)," MS add10",13,10

; ----------------------------------------------

    ENDM


    print "Average algo timing",13,10

    shr tim1, 3
    print str$(tim1)," MS leamul10",13,10

    shr tim2, 3
    print str$(tim2)," MS imul10",13,10

    shr tim3, 3
    print str$(tim3)," MS shl10",13,10

    shr tim4, 3
    print str$(tim4)," MS add10",13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

leamul10 proc num:DWORD

    mov eax, [esp+4]
    lea eax, [eax+eax*4]
    add eax, eax

    ret 4

leamul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

imul10 proc num:DWORD

    mov eax, [esp+4]
    imul eax, 10

    ret 4

imul10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

shl10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, eax
    shl eax, 3
    add eax, ecx
    add eax, ecx

    ret 4

shl10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

add10 proc num:DWORD

    mov eax, [esp+4]
    mov ecx, eax

    add ecx, ecx
    add eax, eax
    add eax, eax
    add eax, eax
    add eax, ecx

    ret 4

add10 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: jj2007 on April 16, 2010, 02:11:26 PM
Yep, for a P4 and mul10, the imul is not the best solution, it's 25% slower. Sizewise it wins for everything different from powers of 2, of course.

Average algo timing

Prescott
365 MS leamul10
449 MS imul10
361 MS shl10
361 MS add10

Celeron M
536 MS leamul10
542 MS imul10
576 MS shl10
839 MS add10

Code size, mul10 only:
12      bytes for add10
9       bytes for shl10
5       bytes for lea10
3       bytes for imul10
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 16, 2010, 05:16:48 PM
Let's examine the benchmarking of code fragments.

Method A

L = Loop time in cycles
T = Test unit in cycles

t = measured cycles per unit

Testing 100 units

  |  T  |  T  |  T  | .. |  T  |  T  | L |

  |  t  |

t = ((T * 100) + L) / 100

If T = 10, L = 2, then t = 10.02 cycles, error from looping is 0.2%

Method B

L = Loop time in cycles
X = Test unit in cycles billed
Y = Test unit in cycles deferred (ie pipelined)

T = X + Y Test unit in cycles actual

t = measure cycles per unit

Testing 100 units

  |    T    |
  | X |  Y  |
      | X |  Y  |
          | X |  Y  |
              | X |  Y  |
                ..
                             | X |  Y  |
                                 | X |  Y  | L |

  | t |

t = ((X * 100) + Y + L) / 100

If X = 4, Y = 6, L = 2, then T = 10, yet t = 4.08 cycles, error 59.2% (nominally (Y / T) * 100)


Method B is not how to benchmark a code fragment, you'd get better results by doing 1 iteration. It is only useful for loop unwinding, much code that processes streams of data has dependencies.

The purpose of doing 100 iterations is to drop the looping overhead to 1% or below.

Latency is important, dependent things don't really happen before the answer is provided. Throughput only defines how fast you can stuff something into a pipeline, you can usually stuff a new thing in the pipeline every cycle (with multiple execution units, more than one operation can start per cycle), but it will take multiple cycles for the answer to appear.

-Clive
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: hutch-- on April 16, 2010, 09:45:49 PM
Clive,

I find the idea interesting as i understand the difference between throughput and latency but I don't know a good way to test it. The benchmark I posted above puts each method in a seperate procedure and they all have the same stack and data load overhead but its a test of throughput, not instruction latency.

Michael has developed the set of timing macros that address small instruction counts and within the limits of ring3 access it works reasonably well but the only method i know that will handle small instruction counts with high precision is the one Agner Fog developed a few years ago that was done on a boot disk after switching the processor into protected mode.

I would certainly be interested in a technique that can accurately time instruction latency in small instruction counts, where I have had to deal with this in code I usually set up a test piece as close to the real world requirement as I can get then time it against something else, usually an original algo then a modified one so I can time any diference with the modified one.
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: clive on April 25, 2010, 08:04:19 PM
Quote from: clive
I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.

Ok, so here with a Willamette P4 (Celeron 1.8 GHz), I'll note the 14 cycle IMUL performance, and the 4 cycle shifts, and the totally egregious processor stalling XCHG on memory. AMD made a lot of money on the back of this brain numbingly defective P4 design.

Empty
      1596 Ticks (min),  0.053 cycles

IMUL eax, 1
    418528 Ticks (min), 13.951 cycles

IDIV eax, 1
   1708472 Ticks (min), 56.949 cycles

SHL eax, 1 / SHR eax, 1
    118532 Ticks (min),  3.951 cycles

SHL eax, 8 / SHR eax, 8
    118532 Ticks (min),  3.951 cycles

XCHG eax, [mem]
   3727636 Ticks (min), 124.255 cycles


Intel(R) Celeron(R) CPU 1.80GHz (SSE2)
464     cycles for 100*imul 8
1011    cycles for 100*mul 8
99      cycles for 100*shl 3
93      cycles for 100*shl 1
464     cycles for 100*imul 10
1081    cycles for 100*mul 10
202     cycles for 100*lea mul10
57      cycles for 100*or
1482    cycles for 100*imul 1 - dependent
1616    cycles for 100*mul 1 - dependent
5860    cycles for 100*idiv 1 - dependent
37      cycles for 100*or - dependent
403     cycles for 100*rol 1 - dependent
394     cycles for 100*rol 8 - dependent

419     ticks for 100*imul 8
986     ticks for 100*mul 8
82      ticks for 100*shl 3
79      ticks for 100*shl 1
413     ticks for 100*imul 10
911     ticks for 100*mul 10
170     ticks for 100*lea mul10
49      ticks for 100*or
1255    ticks for 100*imul 1 - dependent
1455    ticks for 100*mul 1 - dependent
5224    ticks for 100*idiv 1 - dependent
33      ticks for 100*or - dependent
356     ticks for 100*rol 1 - dependent
355     ticks for 100*rol 8 - dependent
Title: Re: Optimization manual updated again. Now covers Core 2
Post by: MichaelW on April 26, 2010, 12:29:13 AM
QuoteAMD made a lot of money on the back of this brain numbingly defective P4 design.

It surprised me that Intel failed to anticipate the "wall" they hit on clock speed increases, and that they stuck with the plan until AMD was kicking their butt in almost every benchmark.