News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

stalling a register?

Started by thomas_remkus, August 26, 2007, 01:42:25 AM

Previous topic - Next topic

thomas_remkus

I'm reading about optimizations and they are mentioning that you should avoid stalling a register. What is this? And what's the punishment for a stall? Loss of cycles?

u

requesting data from a register immediately after it's been loaded with data
mov eax,[myVar]
add ecx,eax ;<- stall

makes the older cpus skip cycles. The newest could skip cycles, too, if the code around can't be re-arranged for out-of-order execution.
Please use a smaller graphic in your signature.

thomas_remkus

I think I understand. So the processor can attempt to put two instructions on the chip at the same time and do two instructions.

Ian_B

Quote from: thomas_remkus on August 26, 2007, 02:42:09 AM
I think I understand. So the processor can attempt to put two instructions on the chip at the same time and do two instructions.
Kind of. I guess you're reading Agner Fog's optimisation literature from the questions you're asking.  :U

The Pentium line of processors has two Arithmetic/Logic Units on-chip. They are not entirely identical, one is more "general-purpose" and some opcode instructions can only go through that ALU, but most instructions will go through either. So much of the time both ALUs will be processing simultaneously. The Pentium instruction pipeline is built so that instructions can be reordered by a scheduler to take advantage of this, to some extent despite the order you may have written them in code. In particular, the scheduler is smart enough to know (related to your other query) if an instruction relies on a result from a register that hasn't finished being processed yet, which would cause the ALU to stall waiting for that result.

In general, unless you are writing highly speed-critical loops and are prepared to analyse your code in minute detail and perform Uop counts, and then write separate versions of that code for every minute variation of P4/Pentium/Core2 that counts those Uops differently for the same instruction, has a different size of instruction cache and data cache, and then do the same for all the flavours of AMD processors you might encounter out there, etc... you really needn't worry too much about those things. Just ensure that you don't use a register immediately after you've updated a value in it, and insert other instructions that don't depend on that value before it if you can to get the advantage of that "simultaneous" processing.

This is good:

XOR EAX, EAX
MOV ECX, 1
MOV EDX, 2    <---- do something else before using ECX
ADD EAX, ECX    <---- do something else before using EDX
ADD EDX, EDX

But this is not:

XOR EAX, EAX
MOV ECX, 1
ADD EAX, ECX    <---- stalls as waiting for update to value in ECX
MOV EDX, 2
ADD EDX, EDX    <---- stalls as waiting for update to value in EDX


Also try and keep in mind the latency (time to finish) of instructions. For tiny, fast arithmetic instructions like those, they complete very fast indeed (1/2 a clock cycle, which is why you may see references to "quad-pumping" of instructions, where you have 2 ALUs performing 2 instructions per cycle). If you are doing slower instructions such as shifts/rotates or memory accesses then pad the space between the slow instruction and using the result with more instructions using other registers. Another good method is to deliberately pair that instruction with another slow one that can complete in the same time on the other ALU so that overall the code doesn't wait unnecessarily.

Good:

MOV EAX, 1
MOV ECX, 1
MOV EDX, 1
SHL EAX, 5    <---- slow instruction, use of EAX will have to wait
ROL ECX, 5    <---- slow paired instruction means both ALUs complete at same time, no waiting
ADD EAX, EDX
ADD ECX, EDX


That may require more thought about how you design and arrange your code and use registers effectively to pair instructions where possible, but that is the real art involved in good ASM programming that makes it so much fun...   :green2

IanB

Rockoon

Also note that the shift instructions cannot be out-of-order executed prior to any instruction that writes to the flags register because they break register renaming of the flags register (shift instructions perform a read/modify/write of the flags)

Most integer instructions do write to the flags register (all of the math and bitwise instructions do) so shifts should usualy be frowned upon when an alternative exists (such as using    add eax, eax   instead of     shl eax, 1)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

thomas_remkus

Yes. I'm reading Agner Fog's papers. They are over my head but I'm trying to work through them anyways.

So, by using one register I'm not able to split to the other ALU. I understand that, thanks. Are there cases where a CPU has more than just two ALUs? And do the MMX/floating registers complete in the same way?


Rockoon

yes there are CPU's with more than 2 "ALU's"

On the core2 execution Ports 0, 1, and 5 can all handle simple integer operations (adds, subs, and bitwise ops) but only port 1 can handle integer multiplication.

AMD64's also have 3 execution units that can handle the simple integer operations (more inclusive than the core2) but only ALU0 can handle integer multiplication.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.