I have updated my manual once again. Now covering everything about the new Intel Core 2 processor including a detailed study of the pipeline and execution units and complete lists of instruction timings.
This time my manual has come before the official manuals from Intel. Their software manuals for the Core 2 are not out yet. Thank you to a friendly person who gave me remote access to a prerelease sample of the Core 2. This enabled me to test almost everything.
The execution core is more powerful than anything we have seen until now. It can do up to three full 128-bit vector calculations per clock cycle. Unfortunately, the instruction fetch and predecode stage has not been expanded enough to keep up with the rest of the pipeline, so this is a serious bottleneck in many situations.
The section on AMD microarchitecture in my manual has also been revised, thanks to help from Andreas Kaiser.
http://www.agner.org/optimize/
You work is really unique. Thanks for your hard work, agner :U
Very nice, thanks for your work.
Would it be okay to add the asm manual to my mirror (link in my signature)?
Thank you very much, Agner! :U
Thank you very much ;)
Agner,
Very nice work :U
I found a little error in the optimization manual:
At page 74, you state that the only way to load unaligned data in XMM registers is to use MOVDQU, MOVUPS, MOVUPD or LDDQU. You can use MOVLPD/MOVHPD pairs, which results in faster code.
Stan
Hello agner!
Thanks for your work! It is very nice to see people who do a lot of work in a specific area and are ready to share the gained knowledge! Thank you again!
I had a look on your website. There are interessting topics! This Cultural Selection Theory looks very exciting....
Greets, Gábor
Great work, thanks!
I thought Mark Larson's "Code Optimization" (http://www.mark.masmcode.com/) is already the best optimization document I've got. I guess multiple "the best" documents are always better than having single "the best" document :U
\
-chris
Your link to Mark Larson doesn't work. Where is the document you are referring to?
Agner,
The link works just fine. Please try it again. The server may have been down when you last tried.
Paul
Agner, thanks you for your work! :U
Agner you rulz ! thanks so much my favorite read for optimization :dance:
you guys have woken a 4 year old thread :P
Quote from: dedndave
you guys have woken a 4 year old thread :P
Can we dig up the zombie P4 designers that removed the barrel shifter, and beat them to death again?
-Clive
Quote from: clive on April 14, 2010, 02:30:58 PM
Can we dig up the zombie P4 designers that removed the barrel shifter, and beat them to death again?
Try
imul instead, it's blazingly fast on modern CPUs.
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.
I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles, a barrel should be 1.
IMUL doesn't work well with right shifts, rotates, rotates through carry, using/setting carry. IDIV will eat 3 registers, has similar carry and rotate issues, and even more hideous latency.
The Core and Atom have significantly better shifting performance, AMD always has been good, some of the P4 issues were addressed in Prescott. Still the 64-bit right shifts are dogs in the entire P4 family. The P4 Willamette was a totally awful implementation, and Northwood wasn't much better, which is why I recommended beating them around the head. Whoosh!
-Clive
Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.
I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter.
We aren't into betting and guessing here, Clive. Celeron M:
145 cycles for 100*imul 8
95 cycles for 100*shl 3
145 cycles for 100*imul 10
149 cycles for 100*lea mul10
Quote from: jj2007
We aren't into betting and guessing here, Clive. Celeron M:
145 cycles for 100*imul 8
95 cycles for 100*shl 3
145 cycles for 100*imul 10
149 cycles for 100*lea mul10
Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.
Presler (Pentium D)
96 cycles for 100*imul 8
81 cycles for 100*shl 3
98 cycles for 100*imul 10
167 cycles for 100*lea mul10
10 Tests, x30000
IMUL eax, 2
186555 Ticks
186555 Ticks
186555 Ticks
186585 Ticks
186555 Ticks
186555 Ticks
186570 Ticks
186555 Ticks
186555 Ticks
243900 Ticks
186555 Ticks (min), 6.218500
SHL eax, 1
37065 Ticks
37050 Ticks
37170 Ticks
37065 Ticks
37050 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37035 Ticks
37020 Ticks
37020 Ticks (min), 1.234000
SHL eax, 8
68985 Ticks
31770 Ticks
31695 Ticks
31650 Ticks
31830 Ticks
31635 Ticks
31740 Ticks
31890 Ticks
31635 Ticks
31680 Ticks
31635 Ticks (min), 1.054500
Your test suggests that IMUL is a single cycle, when I'm pretty damn sure it has a 6 cycle latency (ie resources actually consumed, time to get a result). On a Prescott it's 5 compared to ~1 for shifts, the divide clocks in at 79 cycles. Blazingly fast? I'm not convinced.
-Clive
Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency.
Version 2 - I added a mov esi, eax behind each multiplication to create a dependency chain. If you have a different definition of latency, post code please.
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 1000
imul eax, eax, 8
mov esi, eax
ENDM
counter_end
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
173 cycles for 100*imul 8
287 cycles for 100*mul 8
149 cycles for 100*shl 3
157 cycles for 100*shl 1
173 cycles for 100*imul 10
288 cycles for 100*mul 10
218 cycles for 100*lea mul10
172 ticks for 100*imul 8
287 ticks for 100*mul 8
150 ticks for 100*shl 3
157 ticks for 100*shl 1
172 ticks for 100*imul 10
287 ticks for 100*mul 10
218 ticks for 100*lea mul10
That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.
This should give you a measurement of latency, rather than dispatch/throughput. Pretty much zero effort applied to aligining or optimizing, designed to give ballpark numbers.
-Clive
DWORD imultest(void)
{
DWORD ticks;
_asm
{
rdtsc
push eax
mov ecx,1000
mov eax,1
loop_mul:
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
dec ecx
jnz loop_mul
rdtsc
pop ecx
sub eax,ecx
mov ticks,eax
}
return(ticks);
}
DWORD idivtest(void)
{
DWORD ticks;
_asm
{
rdtsc
push eax
mov ecx,1000
mov eax,1
xor edx,edx
loop_div:
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
dec ecx
jnz loop_div
rdtsc
pop ecx
sub eax,ecx
mov ticks,eax
}
return(ticks);
}
Quote from: clive on April 14, 2010, 08:37:56 PM
That's hardly a dependency chain, you throw away the result, and light off another IMUL into the pipeline.
Interesting. So the processor "knows" that I don't really need the result when I put mov esi, eax there?
Can you please post your code in assembler, so that everybody can test it? Thanks.
The processor uses register renaming, and out-of-order execution. The value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.
-Clive
Here in MASM assembler.
.586
.MODEL FLAT, C
.CODE
; ml -c -Fl imultest.asm
; Time 30000 instructions, returns estimated clock ticks in EAX (divide by 30000.0 to get estimated clocks)
; The looping code accounts for 3%, if that, of the total
imultest PROC
rdtsc
push eax
mov ecx,1000
mov eax,1
ALIGN 16
loop_mul:
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
imul eax,1
dec ecx
jnz loop_mul
rdtsc
pop ecx
sub eax,ecx
ret
imultest ENDP
idivtest PROC
rdtsc
push eax
mov ecx,1000
mov eax,1
xor edx,edx
ALIGN 16
loop_div:
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
idiv eax
dec ecx
jnz loop_div
rdtsc
pop ecx
sub eax,ecx
ret
idivtest ENDP
END
In your HLA
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1; eax = eax * 1, EAX remains one, but the processor doesn't know this
REPEAT 100 ; I'd probably do more, to make the dependency chain maximal
imul eax, eax, 1
ENDM
counter_end
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1 ; clever constant selection eax = edx:eax / eax, edx = edx:eax % eax. EDX remains zero, EAX remains one, but the processor doesn't know this
xor edx,edx
REPEAT 100
idiv eax
ENDM
counter_end
I use C as my framework rather than just call CRT routines. I'll post C source and execuatables if they're helpful, but this should be a simple cut-n-paste exercise. Basically trying to refute that IMUL is a single cycle instruction.
-Clive
Edit: Attached C/EXE
Prescott P4 instruction latency
Empty
1612 Ticks (min), 0.054 cycles
IMUL eax, 1
298478 Ticks (min), 9.949 cycles
IDIV eax, 1
2398350 Ticks (min), 79.945 cycles
SHL eax, 1 / SHR eax, 1
28448 Ticks (min), 0.948 cycles
SHL eax, 8 / SHR eax, 8
28448 Ticks (min), 0.948 cycles
XCHG eax, [mem]
2752365 Ticks (min), 91.746 cycles
Presler P4
Empty
2025 Ticks (min), 0.068 cycles
IMUL eax, 1
373095 Ticks (min), 12.437 cycles
IDIV eax, 1
2397930 Ticks (min), 79.931 cycles
SHL eax, 1 / SHR eax, 1
28035 Ticks (min), 0.935 cycles
SHL eax, 8 / SHR eax, 8
28035 Ticks (min), 0.935 cycles
XCHG eax, [mem]
2744310 Ticks (min), 91.477 cycles
Atom
Empty
4056 Ticks (min), 0.135 cycles
IMUL eax, 1
145992 Ticks (min), 4.866 cycles
IDIV eax, 1
1828008 Ticks (min), 60.934 cycles
SHL eax, 1 / SHR eax, 1
25992 Ticks (min), 0.866 cycles
SHL eax, 8 / SHR eax, 8
27888 Ticks (min), 0.930 cycles
XCHG eax, [mem]
177120 Ticks (min), 5.904 cycles
I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.
Quote from: clive on April 14, 2010, 09:10:54 PMThe value moved to ESI is just retired when the IMUL completes. As no one cares about the value, when this happens nothing is holding up the processor. Loading EAX breaks the chain for the next IMUL. Using ADD ESI,EAX would have more consequence.
There is no evidence that ADD ESI,EAX has "more consequence": On my Celeron, timings are identical (taking into account the extra half cycle of add vs mov); on my P4, the ADD ESI,EAX version is even faster.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
48 cycles for 100*mov esi
95 cycles for 100*add esi
172 cycles for 100*imul 8, mov esi
223 cycles for 100*imul 8, add esi
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
45 cycles for 100*mov esi
96 cycles for 100*add esi
223 cycles for 100*imul 8, mov esi
211 cycles for 100*imul 8, add esi
282 cycles for 100*mul 8
150 cycles for 100*shl 3
Thye comments of PIVs sound right, I had PIVs from an early 1.5 to the last 3.8 over 3 core types and they were all quirky in comparison to earlier and later processors. LEA slowed by a long way which impacted on fixed multiplications and this more or less matched the Inntel PIV data that recommended using ADD instead up to about 3 before LEA was faster. The Core quad has a fast LEA but one of the few things the last PIV was faster with was shifts and rotates. The late Prescott PIVs had a long pipeline that made stalls very exspensive, the Northwood core was shorter and recovered from stalls faster.
The Core2 quad is much more like a PIII in its preferred style of code, just faster. I have only just got the new Nehalem going so i have yet to do any serious timing on it but a quick play earlier today says it has different characteristics to the Core series. There appears to be a shift to SSE instructions in the Core series as comparisons to the PIVs I have showed much bigger improvements with SSE code on the Quad that on any of the PIVs. I would imagine that the finer tracking in the very recent Intel cores allows a bit more room for hardcoded instructions and it appears that the extra space is going into SSE instructions rather than the older integer instructions.
This much I have learnt about core design is that I don't really want to know as each core is substantially different, tend to be RISC under the hood and depending on the core model have different preferred instruction sets which follow from the RISC primitives below the published interface.
Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
232 cycles for 100*imul 8
312 cycles for 100*mul 8
138 cycles for 100*shl 3
149 cycles for 100*shl 1
255 cycles for 100*imul 10
303 cycles for 100*mul 10
249 cycles for 100*lea mul10
85 cycles for 100*or
1030 cycles for 100*imul 1 - dependent
1242 cycles for 100*mul 1 - dependent
8172 cycles for 100*idiv 1 - dependent
101 cycles for 100*or - dependent
102 cycles for 100*rol 1 - dependent
104 cycles for 100*rol 8 - dependent
118 ticks for 100*imul 8
160 ticks for 100*mul 8
78 ticks for 100*shl 3
72 ticks for 100*shl 1
120 ticks for 100*imul 10
162 ticks for 100*mul 10
136 ticks for 100*lea mul10
45 ticks for 100*or
540 ticks for 100*imul 1 - dependent
654 ticks for 100*mul 1 - dependent
4338 ticks for 100*idiv 1 - dependent
53 ticks for 100*or - dependent
54 ticks for 100*rol 1 - dependent
55 ticks for 100*rol 8 - dependent
NewCastle
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
100 cycles for 100*imul 8
200 cycles for 100*mul 8
97 cycles for 100*shl 3
97 cycles for 100*shl 1
99 cycles for 100*imul 10
198 cycles for 100*mul 10
140 cycles for 100*lea mul10
62 cycles for 100*or
301 cycles for 100*imul 1 - dependent
300 cycles for 100*mul 1 - dependent
4051 cycles for 100*idiv 1 - dependent
96 cycles for 100*or - dependent
96 cycles for 100*rol 1 - dependent
97 cycles for 100*rol 8 - dependent
80 ticks for 100*imul 8
160 ticks for 100*mul 8
78 ticks for 100*shl 3
78 ticks for 100*shl 1
80 ticks for 100*imul 10
159 ticks for 100*mul 10
112 ticks for 100*lea mul10
47 ticks for 100*or
243 ticks for 100*imul 1 - dependent
238 ticks for 100*mul 1 - dependent
3272 ticks for 100*idiv 1 - dependent
75 ticks for 100*or - dependent
75 ticks for 100*rol 1 - dependent
79 ticks for 100*rol 8 - dependent
Quote from: clive on April 14, 2010, 04:33:28 PM
Quote from: jj2007
Try imul instead, it's blazingly fast on modern CPUs.
I'm familiar with the math. I'd be willing to bet the latency is still significantly worse than a barrel shifter. On the P4 it was 14 cycles as I recall, and Intel was recommending ADD/LEA instruction combination. The latency is still 3-5 cycles,
Quote from: clive on April 14, 2010, 05:38:51 PM
Then do me a favour and measure the latency. I bet, because it beats wasting a lot of time benchmarking crap when I understand how to build barrel shifters and multipliers in hardware. I'd rather bet and get the right/better results, than measure and get the wrong one.
Quote from: clive on April 15, 2010, 04:06:25 PM
Prescott P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
138 cycles for 100*shl 3
255 cycles for 100*imul 10
303 cycles for 100*mul 10
249 cycles for 100*lea mul10
1030 cycles for 100*imul 1 - dependent
1242 cycles for 100*mul 1 - dependent
...
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
99 cycles for 100*imul 10
301 cycles for 100*imul 1 - dependent
I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCaste. Good.
Now the interesting question is perhaps whether latency is more relevant than throughput. Of course, since you are programming in C, you depend very much on what your compiler knows about it. Assembler coders can optimise their code.
Imagine this (rather useless) sequence:
Quote mov eax, esi
REPEAT 100
imul eax, eax, 10
ENDM
mov esi, eax
Overall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel:
Quote mov eax, esi
REPEAT 100
imul eax, eax, 10
mov edi, ebx
mov ebx, esi
inc ecx
dec edx
dec ebx
ENDM
mov esi, eax
And, surprise surprise, both sequences run in four cycles on my slow cheap Celeron M - and imul is the best overall choice...:
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
150 cycles for 100*shl 3
410 cycles for 100*shl 3 with five extra instructions
173 cycles for 100*imul 10
287 cycles for 100*mul 10
219 cycles for 100*lea mul10
471 cycles for 100*lea mul10 with five extra instructions
397 cycles for 100*imul 10 - dependent
410 cycles for 100*imul 10 - dependent with five extra instructions
496 cycles for 100*mul eax - dependent
576 cycles for 100*mul eax - dependent with five extra instructions
The problem with MUL / IMUL is with blended model programming, everything from i486 up to late PIVs were expensive with multiplication operations. later cores are supposed to have handled muls differently but if you want code that runs OK on most things you need to address how it works on PIVs down. One of the few blessings of 64 bit OS versions is that the hardware must be late enough to run it which means a reasonably late model processor so you are safe with many later faster instructions.
Code you write with some certainty about the lowest processor that will run it is a lot easier to write and should be faster than code that must run on everything.
Quote from: jj2007
I see we are converging a little bit - imul latency is down from over 14 to 10 for a P4 and 3 for the NewCastle. Good.
Yes, and the 14 is a Willamette (1st Pentium 4, and source of much derision) number, from the notes I have. I'll demonstrate if you insist. The box is at the out-laws, so not readily accessible, I could get P3 and PPro numbers more readily.
This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf
"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx
The problem with optimizing for one architecture is that the installed base is so divergent, so unless the application warrants multiple code paths, and the hideous amount of regression testing require to exercise all paths, one generally focus on a reasonable solution. It makes sense on things like folding, Seti, encryption cracking, etc. but I wouldn't classify these as commercial applications.
-Clive
Quote from: jj2007
Imagine this (rather useless) sequence:
mov eax, esi
REPEAT 100
imul eax, eax, 10
ENDM
mov esi, eax
You do understand that your whole REPEAT 100 construct is totally contrived, right? Unless the fragment you test is self contained and dependent, all you end up doing is slip streaming the whole repeated sequence into the pipeline, the timing of the actual sequence gets lost. You'd get a better real world timing using a single iteration. In any situation where your timing suggests an execution speed of a sequence containing an IMUL is faster than the demonstrated speed of the instruction, you either a) do not have a dependency on the result, or b) you are not getting an accurate timing.
The idea of the 100 iterations is to push the looping instruction times into the ~1% of the total measurement, but unless the fragments are carefully crafted this is not what is happening.
I code in C because it's convenient, I've been coding in about a dozen assembly languages for as long as you have.
QuoteOverall speed depends, as you rightly assume, on latency
I haven't assumed, I know this, and don't need to benchmark everything to recognize when it might occur. Many algorithms in C and ASM can be discarded with static analysis.
-Clive
Quote from: clive on April 16, 2010, 02:56:25 AM
This paper has some good analysis
http://gmplib.org/~tege/x86-timing.pdf
And it confirms that for a modern CPU, e.g. a Core 2, the latency is 3 cycles, while throughput is
1 cycle for imul.
Quote
"When compiling with /G7, we produce the faster (but longer) sequence that avoids using the imul instruction, which has a 14-cycle latency on the Intel Pentium 4:"
http://msdn.microsoft.com/en-us/library/aa290055%28VS.71%29.aspx
That was written 8 years ago...
Quote from: clive on April 16, 2010, 04:39:50 AM
QuoteOverall speed depends, as you rightly assume, on latency
I haven't assumed, I know this, and don't need to benchmark everything
You quote rather selectively. My "overall" related to your isolated
100*imul eax, eax, 1 example, and the text continues as follows:
QuoteOverall speed depends, as you rightly assume, on latency: the next imul can only start when the previous one has finished.
However, nobody in Masmland stops you from doing other things in parallel
Which implies that latency is pretty irrelevant in an optimised Masm app. Throughput is the interesting variable.
Pick your Intel core, pick a result.
PIV 3.8 gig.
220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...
Core2 Quad 3.0 gig
220
220
220
265 MS leamul10
235 MS imul10
437 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
266 MS leamul10
234 MS imul10
437 MS shl10
235 MS add10
265 MS leamul10
235 MS imul10
437 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
438 MS shl10
234 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
437 MS shl10
236 MS add10
Press any key to continue ...
2.8 i7 860 quad
220
220
234 MS leamul10
249 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
250 MS imul10
390 MS shl10
296 MS add10
250 MS leamul10
234 MS imul10
390 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
218 MS leamul10
218 MS imul10
390 MS shl10
281 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
297 MS add10
234 MS leamul10
234 MS imul10
374 MS shl10
296 MS add10
Average algo timing
234 MS leamul10
235 MS imul10
382 MS shl10
298 MS add10
Press any key to continue ...
This is the test piece.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
leamul10 PROTO :DWORD
imul10 PROTO :DWORD
shl10 PROTO :DWORD
add10 PROTO :DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL tim1 :DWORD
LOCAL tim2 :DWORD
LOCAL tim3 :DWORD
LOCAL tim4 :DWORD
mov tim1, 0
mov tim2, 0
mov tim3, 0
mov tim4, 0
iteration equ <100000000>
push esi
mov eax, 22
invoke leamul10,eax
print sstr$(eax),13,10
mov eax, 22
invoke imul10,eax
print sstr$(eax),13,10
mov eax, 22
invoke shl10,eax
print sstr$(eax),13,10
mov eax, 22
invoke add10,eax
print sstr$(eax),13,10
REPEAT 8
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke leamul10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim1, eax
print str$(eax)," MS leamul10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke imul10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim2, eax
print str$(eax)," MS imul10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke shl10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim3, eax
print str$(eax)," MS shl10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke add10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim4, eax
print str$(eax)," MS add10",13,10
; ----------------------------------------------
ENDM
print "Average algo timing",13,10
shr tim1, 3
print str$(tim1)," MS leamul10",13,10
shr tim2, 3
print str$(tim2)," MS imul10",13,10
shr tim3, 3
print str$(tim3)," MS shl10",13,10
shr tim4, 3
print str$(tim4)," MS add10",13,10
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
leamul10 proc num:DWORD
mov eax, [esp+4]
lea eax, [eax+eax*4]
add eax, eax
ret 4
leamul10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
imul10 proc num:DWORD
mov eax, [esp+4]
imul eax, 10
ret 4
imul10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
shl10 proc num:DWORD
mov eax, [esp+4]
shl DWORD PTR [esp+4], 3
add eax, eax
add eax, [esp+4]
ret 4
shl10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
add10 proc num:DWORD
mov eax, [esp+4]
mov ecx, [esp+4]
add ecx, ecx
add eax, eax
add eax, eax
add eax, eax
add eax, ecx
ret 4
add10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Quote from: hutch-- on April 16, 2010, 12:07:39 PM
Pick your Intel core, pick a result.
Hutch, the ABI!!!! :naughty:
mov eax, [esp+4]
shl DWORD PTR [esp+4], 3 ; <<<<<<
add eax, eax
add eax, [esp+4]
:bg
Prescott:
Average algo timing
545 MS leamul10
568 MS imul10
767 MS shl10
599 MS add10
(now let's do the same for mul123 :toothy)
JJ,
> Hutch, the ABI!!!! ???? [esp] is a memory operand. :bg
You were too fast finding the test, it had 2 blunders in it, 2 memory reads for the shift and add algos, now it is one and more closely matches the first two. Mainly this demo shows core differences.
PIV Prescott 3.8 gig.
220
220
220
220
328 MS leamul10
406 MS imul10
516 MS shl10
312 MS add10
328 MS leamul10
407 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
531 MS shl10
313 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
344 MS leamul10
406 MS imul10
500 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
532 MS shl10
328 MS add10
312 MS leamul10
407 MS imul10
484 MS shl10
328 MS add10
328 MS leamul10
406 MS imul10
516 MS shl10
328 MS add10
Average algo timing
328 MS leamul10
406 MS imul10
511 MS shl10
324 MS add10
Press any key to continue ...
Core2 Quad 3.0 gig
220
220
220
220
265 MS leamul10
235 MS imul10
265 MS shl10
266 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
250 MS add10
266 MS leamul10
234 MS imul10
266 MS shl10
250 MS add10
265 MS leamul10
235 MS imul10
265 MS shl10
235 MS add10
Average algo timing
265 MS leamul10
234 MS imul10
265 MS shl10
250 MS add10
Press any key to continue ...
Nehalem core i7 quad 2.8 gig
220
220
220
220
250 MS leamul10
234 MS imul10
281 MS shl10
327 MS add10
234 MS leamul10
250 MS imul10
250 MS shl10
327 MS add10
265 MS leamul10
234 MS imul10
266 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
249 MS shl10
312 MS add10
234 MS leamul10
234 MS imul10
250 MS shl10
328 MS add10
218 MS leamul10
218 MS imul10
250 MS shl10
327 MS add10
234 MS leamul10
234 MS imul10
234 MS shl10
328 MS add10
Average algo timing
237 MS leamul10
234 MS imul10
251 MS shl10
323 MS add10
Press any key to continue ...
The test piece.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
leamul10 PROTO :DWORD
imul10 PROTO :DWORD
shl10 PROTO :DWORD
add10 PROTO :DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL tim1 :DWORD
LOCAL tim2 :DWORD
LOCAL tim3 :DWORD
LOCAL tim4 :DWORD
mov tim1, 0
mov tim2, 0
mov tim3, 0
mov tim4, 0
iteration equ <100000000>
push esi
mov eax, 22
invoke leamul10,eax
print sstr$(eax),13,10
mov eax, 22
invoke imul10,eax
print sstr$(eax),13,10
mov eax, 22
invoke shl10,eax
print sstr$(eax),13,10
mov eax, 22
invoke add10,eax
print sstr$(eax),13,10
REPEAT 8
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke leamul10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim1, eax
print str$(eax)," MS leamul10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke imul10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim2, eax
print str$(eax)," MS imul10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke shl10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim3, eax
print str$(eax)," MS shl10",13,10
; ----------------------------------------------
invoke GetTickCount
push eax
mov esi, iteration
@@:
invoke add10,eax
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tim4, eax
print str$(eax)," MS add10",13,10
; ----------------------------------------------
ENDM
print "Average algo timing",13,10
shr tim1, 3
print str$(tim1)," MS leamul10",13,10
shr tim2, 3
print str$(tim2)," MS imul10",13,10
shr tim3, 3
print str$(tim3)," MS shl10",13,10
shr tim4, 3
print str$(tim4)," MS add10",13,10
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
leamul10 proc num:DWORD
mov eax, [esp+4]
lea eax, [eax+eax*4]
add eax, eax
ret 4
leamul10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
imul10 proc num:DWORD
mov eax, [esp+4]
imul eax, 10
ret 4
imul10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
shl10 proc num:DWORD
mov eax, [esp+4]
mov ecx, eax
shl eax, 3
add eax, ecx
add eax, ecx
ret 4
shl10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
add10 proc num:DWORD
mov eax, [esp+4]
mov ecx, eax
add ecx, ecx
add eax, eax
add eax, eax
add eax, eax
add eax, ecx
ret 4
add10 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Yep, for a P4 and mul10, the imul is not the best solution, it's 25% slower. Sizewise it wins for everything different from powers of 2, of course.
Average algo timing
Prescott
365 MS leamul10
449 MS imul10
361 MS shl10
361 MS add10
Celeron M
536 MS leamul10
542 MS imul10
576 MS shl10
839 MS add10
Code size, mul10 only:
12 bytes for add10
9 bytes for shl10
5 bytes for lea10
3 bytes for imul10
Let's examine the benchmarking of code fragments.
Method A
L = Loop time in cycles
T = Test unit in cycles
t = measured cycles per unit
Testing 100 units
| T | T | T | .. | T | T | L |
| t |
t = ((T * 100) + L) / 100
If T = 10, L = 2, then t = 10.02 cycles, error from looping is 0.2%
Method B
L = Loop time in cycles
X = Test unit in cycles billed
Y = Test unit in cycles deferred (ie pipelined)
T = X + Y Test unit in cycles actual
t = measure cycles per unit
Testing 100 units
| T |
| X | Y |
| X | Y |
| X | Y |
| X | Y |
..
| X | Y |
| X | Y | L |
| t |
t = ((X * 100) + Y + L) / 100
If X = 4, Y = 6, L = 2, then T = 10, yet t = 4.08 cycles, error 59.2% (nominally (Y / T) * 100)
Method B is not how to benchmark a code fragment, you'd get better results by doing 1 iteration. It is only useful for loop unwinding, much code that processes streams of data has dependencies.
The purpose of doing 100 iterations is to drop the looping overhead to 1% or below.
Latency is important, dependent things don't really happen before the answer is provided. Throughput only defines how fast you can stuff something into a pipeline, you can usually stuff a new thing in the pipeline every cycle (with multiple execution units, more than one operation can start per cycle), but it will take multiple cycles for the answer to appear.
-Clive
Clive,
I find the idea interesting as i understand the difference between throughput and latency but I don't know a good way to test it. The benchmark I posted above puts each method in a seperate procedure and they all have the same stack and data load overhead but its a test of throughput, not instruction latency.
Michael has developed the set of timing macros that address small instruction counts and within the limits of ring3 access it works reasonably well but the only method i know that will handle small instruction counts with high precision is the one Agner Fog developed a few years ago that was done on a boot disk after switching the processor into protected mode.
I would certainly be interested in a technique that can accurately time instruction latency in small instruction counts, where I have had to deal with this in code I usually set up a test piece as close to the real world requirement as I can get then time it against something else, usually an original algo then a modified one so I can time any diference with the modified one.
Quote from: clive
I don't have a Willamette P4 to hand, but the IMUL should have latency ~14, and the SHR/SHL 8 probably ~8. I'll dig one up at the weekend.
Ok, so here with a Willamette P4 (Celeron 1.8 GHz), I'll note the 14 cycle IMUL performance, and the 4 cycle shifts, and the totally egregious processor stalling XCHG on memory. AMD made a lot of money on the back of this brain numbingly defective P4 design.
Empty
1596 Ticks (min), 0.053 cycles
IMUL eax, 1
418528 Ticks (min), 13.951 cycles
IDIV eax, 1
1708472 Ticks (min), 56.949 cycles
SHL eax, 1 / SHR eax, 1
118532 Ticks (min), 3.951 cycles
SHL eax, 8 / SHR eax, 8
118532 Ticks (min), 3.951 cycles
XCHG eax, [mem]
3727636 Ticks (min), 124.255 cycles
Intel(R) Celeron(R) CPU 1.80GHz (SSE2)
464 cycles for 100*imul 8
1011 cycles for 100*mul 8
99 cycles for 100*shl 3
93 cycles for 100*shl 1
464 cycles for 100*imul 10
1081 cycles for 100*mul 10
202 cycles for 100*lea mul10
57 cycles for 100*or
1482 cycles for 100*imul 1 - dependent
1616 cycles for 100*mul 1 - dependent
5860 cycles for 100*idiv 1 - dependent
37 cycles for 100*or - dependent
403 cycles for 100*rol 1 - dependent
394 cycles for 100*rol 8 - dependent
419 ticks for 100*imul 8
986 ticks for 100*mul 8
82 ticks for 100*shl 3
79 ticks for 100*shl 1
413 ticks for 100*imul 10
911 ticks for 100*mul 10
170 ticks for 100*lea mul10
49 ticks for 100*or
1255 ticks for 100*imul 1 - dependent
1455 ticks for 100*mul 1 - dependent
5224 ticks for 100*idiv 1 - dependent
33 ticks for 100*or - dependent
356 ticks for 100*rol 1 - dependent
355 ticks for 100*rol 8 - dependent
QuoteAMD made a lot of money on the back of this brain numbingly defective P4 design.
It surprised me that Intel failed to anticipate the "wall" they hit on clock speed increases, and that they stuck with the plan until AMD was kicking their butt in almost every benchmark.