News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

movaps running very slow

Started by ASMManiac, March 13, 2012, 05:15:20 AM

Previous topic - Next topic

ASMManiac

I was doing some simple timing tests and I noticed that movaps, and most of the sse floating point instructions, were running slow.

I have my code below.  I tested doing a bunch of NOPs, another test with various register integer additions, another test with SSE integer instructions, and another with SSE floating point instructions
I wrote the results in comments above each block of code below.
My computer is Intel Core2, so it can execute multiple instructions at once (superscalar processing capabilities)
It appears that for NOPs and most other instructions I can get 3 instructions through per cycle (timing with rdtsc)
But with the SSE floating point operations I can only get 1 through per cycle.
Can anyone make sense of this?

NOPRetireTest_A PROC

    ; retires 10 billion NOPS
    mov rax, 10000000
@@:
   
    CNTR = 0
    WHILE CNTR LT 250
        ;; retires 3 per clock
        ;;nop
        ;;nop
        ;;nop
        ;;nop

        ;; retires 3 per clock
        ;;inc r8
        ;;inc r9
        ;;inc r10
        ;;inc r11

        ;; retires 1 per clock - makes sense because of heavy dependencies
        ;;inc r8
        ;;inc r8
        ;;inc r8
        ;;inc r8

        ;; retires 3 per clock
        ;;pxor xmm0, xmm4
        ;;pxor xmm1, xmm4
        ;;pxor xmm2, xmm4
        ;;pxor xmm3, xmm4

        ;; retires 3 per clock
        ;;movdqa xmm0, xmm4
        ;;movdqa xmm1, xmm4
        ;;movdqa xmm2, xmm4
        ;;movdqa xmm3, xmm4

        ;; only retires 1 per clock.  WHY???
        ;xorps xmm0, xmm4
        ;xorps xmm1, xmm4
        ;xorps xmm2, xmm4
        ;xorps xmm3, xmm4

        ;; only retires 1 per clock.  WHY???
        movaps xmm0, xmm4
        movaps xmm1, xmm4
        movaps xmm2, xmm4
        movaps xmm3, xmm4

        CNTR = CNTR + 1
    ENDM
    dec rax;
jnz @b
    RET
NOPRetireTest_A ENDP
END

dedndave

i think the question is - how much work is getting done in 1 cycle   :P

qWord

Not for all operations/instructions multiple execution units are available, which is required for parallel execution. Also the latency of the execution units differs.
You should also notice, that the CPU keeps track of the datatypes used in the XMM registers and a mismatch can cause an enormous performance lost.
You have read Intel's and/or AMD's optimization manuals?

qWord
FPU in a trice: SmplMath
It's that simple!

ASMManiac

Shouldn't they have multiple execution units for floating point?  Especially something as simple as independent mov operations?

hmm, maybe not.  I just ran the following test interleaving movaps and addps that ran at almost 1.25 operations per cycle
        movaps xmm0, xmm4
        addps xmm1, xmm4
        movaps xmm2, xmm4
        addps xmm3, xmm4

Is this the optimization manual you are talking about?
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

qWord

I think that the difference between movdqa and movaps has something to do with the data types: propably movaps does some check on the data/values for validity.
FPU in a trice: SmplMath
It's that simple!

dedndave

i think Agner Fog said something about this...
use movdqa for integer data
use movaps for floating point data
something to do with domain crossing penalties   :P
i haven't figured out what the hell he's talking about, yet - lol

dioxin

On my AMD Phenom II I get 3 per clk (or 2 per clk) as stated in the AMD Optimization Guide.

What is the expected throughput of your CPU? The document you link to has this information. Check tables C10 and C10a for your version of processor. Some only expect 1 per clk.

Paul.


ASMManiac

From table c10a, my processor is in the colomn labeled:

06_25/
2C/1A/
1E/1F/
2E

(It's actual an intel Xeon not core duo as I said above, but all colomns are the same for this instruction)
It says that movaps has a latency of 1, but a throughput of 0.33.  That's what I expected but I still only observe a throughput of 1.


Quote from: dioxin on March 13, 2012, 05:08:35 PM
On my AMD Phenom II I get 3 per clk (or 2 per clk) as stated in the AMD Optimization Guide.

What is the expected throughput of your CPU? The document you link to has this information. Check tables C10 and C10a for your version of processor. Some only expect 1 per clk.

Paul.