I'm reading about optimizations and they are mentioning that you should avoid stalling a register. What is this? And what's the punishment for a stall? Loss of cycles?
requesting data from a register immediately after it's been loaded with data
mov eax,[myVar]
add ecx,eax ;<- stall
makes the older cpus skip cycles. The newest could skip cycles, too, if the code around can't be re-arranged for out-of-order execution.
I think I understand. So the processor can attempt to put two instructions on the chip at the same time and do two instructions.
Quote from: thomas_remkus on August 26, 2007, 02:42:09 AM
I think I understand. So the processor can attempt to put two instructions on the chip at the same time and do two instructions.
Kind of. I guess you're reading Agner Fog's optimisation literature from the questions you're asking. :U
The Pentium line of processors has two Arithmetic/Logic Units on-chip. They are not entirely identical, one is more "general-purpose" and some opcode instructions can only go through that ALU, but most instructions will go through either. So much of the time both ALUs will be processing simultaneously. The Pentium instruction pipeline is built so that instructions can be reordered by a scheduler to take advantage of this, to some extent despite the order you may have written them in code. In particular, the scheduler is smart enough to know (related to your other query) if an instruction relies on a result from a register that hasn't finished being processed yet, which would cause the ALU to stall waiting for that result.
In general, unless you are writing
highly speed-critical loops and are prepared to analyse your code in minute detail and perform Uop counts, and then write separate versions of that code for every minute variation of P4/Pentium/Core2 that counts those Uops differently for the same instruction, has a different size of instruction cache and data cache, and then do the same for all the flavours of AMD processors you might encounter out there, etc... you really needn't worry too much about those things. Just ensure that you don't use a register immediately after you've updated a value in it, and insert other instructions that don't depend on that value before it if you can to get the advantage of that "simultaneous" processing.
This is good:
XOR EAX, EAX
MOV ECX, 1
MOV EDX, 2 <---- do something else before using ECX
ADD EAX, ECX <---- do something else before using EDX
ADD EDX, EDX
But this is not:
XOR EAX, EAX
MOV ECX, 1
ADD EAX, ECX <---- stalls as waiting for update to value in ECX
MOV EDX, 2
ADD EDX, EDX <---- stalls as waiting for update to value in EDX
Also try and keep in mind the latency (time to finish) of instructions. For tiny, fast arithmetic instructions like those, they complete very fast indeed (1/2 a clock cycle, which is why you may see references to "quad-pumping" of instructions, where you have 2 ALUs performing 2 instructions per cycle). If you are doing slower instructions such as shifts/rotates or memory accesses then pad the space between the slow instruction and using the result with more instructions using other registers. Another good method is to deliberately pair that instruction with another slow one that can complete in the same time on the other ALU so that overall the code doesn't wait unnecessarily.
Good:
MOV EAX, 1
MOV ECX, 1
MOV EDX, 1
SHL EAX, 5 <---- slow instruction, use of EAX will have to wait
ROL ECX, 5 <---- slow paired instruction means both ALUs complete at same time, no waiting
ADD EAX, EDX
ADD ECX, EDX
That may require more thought about how you design and arrange your code and use registers effectively to pair instructions where possible, but that is the real art involved in good ASM programming that makes it so much fun... :green2
IanB
Also note that the shift instructions cannot be out-of-order executed prior to any instruction that writes to the flags register because they break register renaming of the flags register (shift instructions perform a read/modify/write of the flags)
Most integer instructions do write to the flags register (all of the math and bitwise instructions do) so shifts should usualy be frowned upon when an alternative exists (such as using add eax, eax instead of shl eax, 1)
Yes. I'm reading Agner Fog's papers. They are over my head but I'm trying to work through them anyways.
So, by using one register I'm not able to split to the other ALU. I understand that, thanks. Are there cases where a CPU has more than just two ALUs? And do the MMX/floating registers complete in the same way?
yes there are CPU's with more than 2 "ALU's"
On the core2 execution Ports 0, 1, and 5 can all handle simple integer operations (adds, subs, and bitwise ops) but only port 1 can handle integer multiplication.
AMD64's also have 3 execution units that can handle the simple integer operations (more inclusive than the core2) but only ALU0 can handle integer multiplication.