I was doing some simple timing tests and I noticed that movaps, and most of the sse floating point instructions, were running slow.
I have my code below. I tested doing a bunch of NOPs, another test with various register integer additions, another test with SSE integer instructions, and another with SSE floating point instructions
I wrote the results in comments above each block of code below.
My computer is Intel Core2, so it can execute multiple instructions at once (superscalar processing capabilities)
It appears that for NOPs and most other instructions I can get 3 instructions through per cycle (timing with rdtsc)
But with the SSE floating point operations I can only get 1 through per cycle.
Can anyone make sense of this?
NOPRetireTest_A PROC
; retires 10 billion NOPS
mov rax, 10000000
@@:
CNTR = 0
WHILE CNTR LT 250
;; retires 3 per clock
;;nop
;;nop
;;nop
;;nop
;; retires 3 per clock
;;inc r8
;;inc r9
;;inc r10
;;inc r11
;; retires 1 per clock - makes sense because of heavy dependencies
;;inc r8
;;inc r8
;;inc r8
;;inc r8
;; retires 3 per clock
;;pxor xmm0, xmm4
;;pxor xmm1, xmm4
;;pxor xmm2, xmm4
;;pxor xmm3, xmm4
;; retires 3 per clock
;;movdqa xmm0, xmm4
;;movdqa xmm1, xmm4
;;movdqa xmm2, xmm4
;;movdqa xmm3, xmm4
;; only retires 1 per clock. WHY???
;xorps xmm0, xmm4
;xorps xmm1, xmm4
;xorps xmm2, xmm4
;xorps xmm3, xmm4
;; only retires 1 per clock. WHY???
movaps xmm0, xmm4
movaps xmm1, xmm4
movaps xmm2, xmm4
movaps xmm3, xmm4
CNTR = CNTR + 1
ENDM
dec rax;
jnz @b
RET
NOPRetireTest_A ENDP
END
i think the question is - how much work is getting done in 1 cycle :P
Not for all operations/instructions multiple execution units are available, which is required for parallel execution. Also the latency of the execution units differs.
You should also notice, that the CPU keeps track of the datatypes used in the XMM registers and a mismatch can cause an enormous performance lost.
You have read Intel's and/or AMD's optimization manuals?
qWord
Shouldn't they have multiple execution units for floating point? Especially something as simple as independent mov operations?
hmm, maybe not. I just ran the following test interleaving movaps and addps that ran at almost 1.25 operations per cycle
movaps xmm0, xmm4
addps xmm1, xmm4
movaps xmm2, xmm4
addps xmm3, xmm4
Is this the optimization manual you are talking about?
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
I think that the difference between movdqa and movaps has something to do with the data types: propably movaps does some check on the data/values for validity.
i think Agner Fog said something about this...
use movdqa for integer data
use movaps for floating point data
something to do with domain crossing penalties :P
i haven't figured out what the hell he's talking about, yet - lol
On my AMD Phenom II I get 3 per clk (or 2 per clk) as stated in the AMD Optimization Guide.
What is the expected throughput of your CPU? The document you link to has this information. Check tables C10 and C10a for your version of processor. Some only expect 1 per clk.
Paul.
From table c10a, my processor is in the colomn labeled:
06_25/
2C/1A/
1E/1F/
2E
(It's actual an intel Xeon not core duo as I said above, but all colomns are the same for this instruction)
It says that movaps has a latency of 1, but a throughput of 0.33. That's what I expected but I still only observe a throughput of 1.
Quote from: dioxin on March 13, 2012, 05:08:35 PM
On my AMD Phenom II I get 3 per clk (or 2 per clk) as stated in the AMD Optimization Guide.
What is the expected throughput of your CPU? The document you link to has this information. Check tables C10 and C10a for your version of processor. Some only expect 1 per clk.
Paul.