News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Optimization manual updated again. Now covers Core 2

Started by agner, August 14, 2006, 07:40:58 AM

Previous topic - Next topic

jj2007

Quote from: clive on April 14, 2010, 02:30:58 PM
Can we dig up the zombie P4 designers that removed the barrel shifter, and beat them to death again?

Try imul instead, it's blazingly fast on modern CPUs.

clive

Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles, a barrel should be 1.

IMUL doesn't work well with right shifts, rotates, rotates through carry, using/setting carry. IDIV will eat 3 registers, has similar carry and rotate issues, and even more hideous latency.

The Core and Atom have significantly better shifting performance, AMD always has been good, some of the P4 issues were addressed in Prescott. Still the 64-bit right shifts are dogs in the entire P4 family. The P4 Willamette was a totally awful implementation, and Northwood wasn't much better, which is why I recommended beating them around the head. Whoosh!

-Clive
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter.

We aren't into betting and guessing here, Clive. Celeron M:

145     cycles for 100*imul 8
95      cycles for 100*shl 3
145     cycles for 100*imul 10
149     cycles for 100*lea mul10


clive

Quote from: jj2007
We aren't into betting and guessing here, Clive. Celeron M:

145     cycles for 100*imul 8
95      cycles for 100*shl 3
145     cycles for 100*imul 10
149     cycles for 100*lea mul10


Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.

Presler (Pentium D)


96      cycles for 100*imul 8
81      cycles for 100*shl 3
98      cycles for 100*imul 10
167     cycles for 100*lea mul10


10 Tests, x30000
IMUL eax, 2
186555 Ticks
186555 Ticks
186555 Ticks
186585 Ticks
186555 Ticks
186555 Ticks
186570 Ticks
186555 Ticks
186555 Ticks
243900 Ticks
186555 Ticks (min), 6.218500

SHL eax, 1
37065 Ticks
37050 Ticks
37170 Ticks
37065 Ticks
37050 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37020 Ticks
37020 Ticks (min), 1.234000

SHL eax, 8
68985 Ticks
31770 Ticks
31695 Ticks
31650 Ticks
31830 Ticks
31635 Ticks
31740 Ticks
31890 Ticks
31635 Ticks
31680 Ticks
31635 Ticks (min), 1.054500


Your test suggests that IMUL is a single cycle, when I'm pretty damn sure it has a 6 cycle latency (ie resources actually consumed, time to get a result). On a Prescott it's 5 compared to ~1 for shifts, the divide clocks in at 79 cycles. Blazingly fast? I'm not convinced.

-Clive
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency.

Version 2 - I added a mov esi, eax behind each multiplication to create a dependency chain. If you have a different definition of latency, post code please.

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 1000
imul eax, eax, 8
mov esi, eax
ENDM
counter_end


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
173     cycles for 100*imul 8
287     cycles for 100*mul 8
149     cycles for 100*shl 3
157     cycles for 100*shl 1
173     cycles for 100*imul 10
288     cycles for 100*mul 10
218     cycles for 100*lea mul10

172     ticks for 100*imul 8
287     ticks for 100*mul 8
150     ticks for 100*shl 3
157     ticks for 100*shl 1
172     ticks for 100*imul 10
287     ticks for 100*mul 10
218     ticks for 100*lea mul10

clive

That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.

This should give you a measurement of latency, rather than dispatch/throughput. Pretty much zero effort applied to aligining or optimizing, designed to give ballpark numbers.

-Clive

DWORD imultest(void)
{
  DWORD ticks;

  _asm
  {
    rdtsc
    push eax

    mov ecx,1000

    mov eax,1

loop_mul:

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    dec ecx
    jnz loop_mul

    rdtsc
    pop ecx
    sub eax,ecx
    mov ticks,eax
  }

  return(ticks);
}


DWORD idivtest(void)
{
  DWORD ticks;

  _asm
  {
    rdtsc
    push eax

    mov ecx,1000

    mov eax,1
    xor edx,edx

loop_div:

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    dec ecx
    jnz loop_div

    rdtsc
    pop ecx
    sub eax,ecx
    mov ticks,eax
  }

  return(ticks);
}
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 14, 2010, 08:37:56 PM
That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.

Interesting. So the processor "knows" that I don't really need the result when I put mov esi, eax there?
Can you please post your code in assembler, so that everybody can test it? Thanks.

clive

The processor uses register renaming, and out-of-order execution. The value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.

-Clive

Here in MASM assembler.

        .586

        .MODEL FLAT, C

        .CODE

        ; ml -c -Fl imultest.asm

        ; Time 30000 instructions, returns estimated clock ticks in EAX (divide by 30000.0 to get estimated clocks)
        ; The looping code accounts for 3%, if that, of the total

imultest PROC

    rdtsc
    push eax

    mov ecx,1000

    mov eax,1

    ALIGN 16

loop_mul:

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1
    imul eax,1

    dec ecx
    jnz loop_mul

    rdtsc
    pop ecx
    sub eax,ecx

    ret

imultest ENDP

idivtest PROC

    rdtsc
    push eax

    mov ecx,1000

    mov eax,1
    xor edx,edx

    ALIGN 16

loop_div:

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax
    idiv eax

    dec ecx
    jnz loop_div

    rdtsc
    pop ecx
    sub eax,ecx

    ret

idivtest ENDP

        END
It could be a random act of randomness. Those happen a lot as well.

clive

In your HLA

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1; eax = eax * 1, EAX remains one, but the processor doesn't know this
REPEAT 100 ; I'd probably do more, to make the dependency chain maximal
imul eax, eax, 1
ENDM
counter_end


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1 ; clever constant selection eax = edx:eax / eax, edx = edx:eax % eax. EDX remains zero, EAX remains one, but the processor doesn't know this
xor edx,edx
REPEAT 100
idiv eax
ENDM
counter_end


I use C as my framework rather than just call CRT routines. I'll post C source and execuatables if they're helpful, but this should be a simple cut-n-paste exercise. Basically trying to refute that IMUL is a single cycle instruction.

-Clive

Edit: Attached C/EXE

Prescott P4 instruction latency
Empty
      1612 Ticks (min),  0.054 cycles

IMUL eax, 1
    298478 Ticks (min),  9.949 cycles

IDIV eax, 1
   2398350 Ticks (min), 79.945 cycles

SHL eax, 1 / SHR eax, 1
     28448 Ticks (min),  0.948 cycles

SHL eax, 8 / SHR eax, 8
     28448 Ticks (min),  0.948 cycles

XCHG eax, [mem]
   2752365 Ticks (min), 91.746 cycles


Presler P4
Empty
      2025 Ticks (min),  0.068 cycles

IMUL eax, 1
    373095 Ticks (min), 12.437 cycles

IDIV eax, 1
   2397930 Ticks (min), 79.931 cycles

SHL eax, 1 / SHR eax, 1
     28035 Ticks (min),  0.935 cycles

SHL eax, 8 / SHR eax, 8
     28035 Ticks (min),  0.935 cycles

XCHG eax, [mem]
   2744310 Ticks (min), 91.477 cycles


Atom
Empty
      4056 Ticks (min),  0.135 cycles

IMUL eax, 1
    145992 Ticks (min),  4.866 cycles

IDIV eax, 1
   1828008 Ticks (min), 60.934 cycles

SHL eax, 1 / SHR eax, 1
     25992 Ticks (min),  0.866 cycles

SHL eax, 8 / SHR eax, 8
     27888 Ticks (min),  0.930 cycles

XCHG eax, [mem]
    177120 Ticks (min),  5.904 cycles


I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 14, 2010, 09:10:54 PMThe value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.

There is no evidence that ADD ESI,EAX has "more consequence": On my Celeron, timings are identical (taking into account the extra half cycle of add vs mov); on my P4, the ADD ESI,EAX version is even faster.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
48      cycles for 100*mov esi
95      cycles for 100*add esi
172     cycles for 100*imul 8, mov esi
223     cycles for 100*imul 8, add esi

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
45      cycles for 100*mov esi
96      cycles for 100*add esi
223     cycles for 100*imul 8, mov esi
211     cycles for 100*imul 8, add esi
282     cycles for 100*mul 8
150     cycles for 100*shl 3

hutch--

Thye comments of PIVs sound right, I had PIVs from an early 1.5 to the last 3.8 over 3 core types and they were all quirky in comparison to earlier and later processors. LEA slowed by a long way which impacted on fixed multiplications and this more or less matched the Inntel PIV data that recommended using ADD instead up to about 3 before LEA was faster. The Core quad has a fast LEA but one of the few things the last PIV was faster with was shifts and rotates. The late Prescott PIVs had a long pipeline that made stalls very exspensive, the Northwood core was shorter and recovered from stalls faster.

The Core2 quad is much more like a PIII in its preferred style of code, just faster. I have only just got the new Nehalem going so i have yet to do any serious timing on it but a quick play earlier today says it has different characteristics to the Core series. There appears to be a shift to SSE instructions in the Core series as comparisons to the PIVs I have showed much bigger improvements with SSE code on the Quad that on any of the PIVs. I would imagine that the finer tracking in the very recent Intel cores allows a bit more room for hardcoded instructions and it appears that the extra space is going into SSE instructions rather than the older integer instructions.

This much I have learnt about core design is that I don't really want to know as each core is substantially different, tend to be RISC under the hood and depending on the core model have different preferred instruction sets which follow from the RISC primitives below the published interface.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
232     cycles for 100*imul 8
312     cycles for 100*mul 8
138     cycles for 100*shl 3
149     cycles for 100*shl 1
255     cycles for 100*imul 10
303     cycles for 100*mul 10
249     cycles for 100*lea mul10
85      cycles for 100*or
1030    cycles for 100*imul 1 - dependent
1242    cycles for 100*mul 1 - dependent
8172    cycles for 100*idiv 1 - dependent
101     cycles for 100*or - dependent
102     cycles for 100*rol 1 - dependent
104     cycles for 100*rol 8 - dependent

118     ticks for 100*imul 8
160     ticks for 100*mul 8
78      ticks for 100*shl 3
72      ticks for 100*shl 1
120     ticks for 100*imul 10
162     ticks for 100*mul 10
136     ticks for 100*lea mul10
45      ticks for 100*or
540     ticks for 100*imul 1 - dependent
654     ticks for 100*mul 1 - dependent
4338    ticks for 100*idiv 1 - dependent
53      ticks for 100*or - dependent
54      ticks for 100*rol 1 - dependent
55      ticks for 100*rol 8 - dependent


NewCastle
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
100     cycles for 100*imul 8
200     cycles for 100*mul 8
97      cycles for 100*shl 3
97      cycles for 100*shl 1
99      cycles for 100*imul 10
198     cycles for 100*mul 10
140     cycles for 100*lea mul10
62      cycles for 100*or
301     cycles for 100*imul 1 - dependent
300     cycles for 100*mul 1 - dependent
4051    cycles for 100*idiv 1 - dependent
96      cycles for 100*or - dependent
96      cycles for 100*rol 1 - dependent
97      cycles for 100*rol 8 - dependent

80      ticks for 100*imul 8
160     ticks for 100*mul 8
78      ticks for 100*shl 3
78      ticks for 100*shl 1
80      ticks for 100*imul 10
159     ticks for 100*mul 10
112     ticks for 100*lea mul10
47      ticks for 100*or
243     ticks for 100*imul 1 - dependent
238     ticks for 100*mul 1 - dependent
3272    ticks for 100*idiv 1 - dependent
75      ticks for 100*or - dependent
75      ticks for 100*rol 1 - dependent
79      ticks for 100*rol 8 - dependent
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.

I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles,

Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.

Quote from: clive on April 15, 2010, 04:06:25 PM
Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
138     cycles for 100*shl 3
255     cycles for 100*imul 10
303     cycles for 100*mul 10
249     cycles for 100*lea mul10
1030    cycles for 100*imul 1 - dependent
1242    cycles for 100*mul 1 - dependent
...
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
99      cycles for 100*imul 10
301     cycles for 100*imul 1 - dependent


I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCaste. Good.

Now the interesting question is perhaps whether latency is more relevant than throughput. Of course, since you are programming in C, you depend very much on what your compiler knows about it. Assembler coders can optimise their code.
Imagine this (rather useless) sequence:

Quote   mov eax, esi
   REPEAT 100
      imul eax, eax, 10
   ENDM
   mov esi, eax

Overall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel:

Quote   mov eax, esi
   REPEAT 100
      imul eax, eax, 10
      mov edi, ebx
      mov ebx, esi
      inc ecx
      dec edx
      dec ebx
   ENDM
   mov esi, eax

And, surprise surprise, both sequences run in four cycles on my slow cheap Celeron M - and imul is the best overall choice...:

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
150     cycles for 100*shl 3
410     cycles for 100*shl 3 with five extra instructions
173     cycles for 100*imul 10
287     cycles for 100*mul 10
219     cycles for 100*lea mul10
471     cycles for 100*lea mul10 with five extra instructions
397     cycles for 100*imul 10 - dependent
410     cycles for 100*imul 10 - dependent with five extra instructions
496     cycles for 100*mul eax - dependent
576     cycles for 100*mul eax - dependent with five extra instructions

hutch--

The problem with MUL / IMUL is with blended model programming, everything from i486 up to late PIVs were expensive with multiplication operations. later cores are supposed to have handled muls differently but if you want code that runs OK on most things you need to address how it works on PIVs down. One of the few blessings of 64 bit OS versions is that the hardware must be late enough to run it which means a reasonably late model processor so you are safe with many later faster instructions.

Code you write with some certainty about the lowest processor that will run it is a lot easier to write and should be faster than code that must run on everything.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

Quote from: jj2007
I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCastle. Good.

Yes, and the 14 is a Willamette (1st Pentium 4, and source of much derision) number, from the notes I have. I'll demonstrate if you insist. The box is at the out-laws, so not readily accessible, I could get P3 and PPro numbers more readily.

This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf

"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx

The problem with optimizing for one architecture is that the installed base is so divergent, so unless the application warrants multiple code paths, and the hideous amount of regression testing require to exercise all paths, one generally focus on a reasonable solution. It makes sense on things like folding, Seti, encryption cracking, etc. but I wouldn't classify these as commercial applications.

-Clive
It could be a random act of randomness. Those happen a lot as well.