News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

peculiar result

Started by denise_amiga, June 05, 2005, 09:35:55 PM

Previous topic - Next topic

MichaelW

Aero,

I am getting 0.5 instructions per cycle with the dependency. Even though each instruction can go through either of two ports, because of the dependency only one port will be active at a time, and the instructions will execute sequentially.

Agner Fog's optimization manual and other related stuff is available here:

http://www.agner.org/assem/


Paul,

When I paired the ADC instructions using different registers I was expecting this to eliminate the dependency, but it clearly did not. I now think the dependency is due to the carry flag, where one instruction cannot complete until the previous instruction has set/cleared the flag.

I tried pairing an ADC reg,reg with an independent LEA reg,mem to test this.

time "adc ecx,ebx            "
time "db 8Dh,5,0,50h,40h,0,13h,0cbh" ; lea eax,[405000h] adc ecx,ebx

adc ecx,ebx             : 196 cycles
db 8Dh,5,0,50h,40h,0,13h,0cbh : 202 cycles

As you can see, the LEA appears to be running in parallel with the ADC.

QvasiModo,

It was not my intention to imply that the instructions must be identical to form a dependency chain, just that the repeated instructions in these tests are identical, and for that reason they do form a dependency chain. I have not thoroughly analyzed this, but it seems that non-identical instructions may form a dependency chain, and identical instructions will form a dependency chain.
eschew obfuscation

QvasiModo

What I actually mean in that dependency chains are formed by any instructions operating on the results of previous instructions. Repeating instructions does not form a dependency chain... at least as far as I know, a series of NOPs will never form one, for example.

Also some instructions are optimized, at least under Pentium IV a XOR of a register with itself doesn't form a dependency chain (the Intel manual recommends it to break chains, actually).

Maybe I'm missing something here? :dazzled:

MichaelW

AFAIK NOP and XORing a register with itself are special cases that are handled as special cases. Most instructions that take two operands have a destination operand, so if you have identical instructions you have identical destination operands, and by my, admittedly simple, reasoning you have a dependency chain. Why would the processors have been designed otherwise, given that repeated identical instructions are not common in real-world code? But before I insert my foot too far into my mouth, I need to do some more testing :lol



eschew obfuscation

QvasiModo

#18
This is what I mean:

Quote from: denise_amiga on June 05, 2005, 09:35:55 PM

  xor ecx,ecx    sub ecx,ecx     mov ecx,0      and ecx,0
      109 ms         67 ms          66 ms          110 ms


xor ecx, ecx - no dependency chain (optimized internally).
sub ecx, ecx - not sure, maybe it's optimized too?
mov ecx, 0 - no dependency chain (register renaming).
and ecx, 0 - dependency on the previous value of ecx (to set the flags).

I find it strange that xor was slower than sub and mov though.

The second set of instructions tested does form a dependency chain, so we were discussing different things it seems (man I hate it when that happens! :bg)

Another factor to consider: how large is the code? The larger, the more memory has to be read.

Mark_Larson

Quote from: AeroASM on June 06, 2005, 07:32:34 PM
Quote from: MichaelW on June 06, 2005, 06:32:28 PM
And for the ADC reg,reg:

On a P3 it generates 2 uops that can go through port 0 or 1, so with a dependency chain the throughput should be 0.5 instructions per cycle, which is what I am getting.

Surely 2 instructions per cycle, or 0.5 cycles per instruction?


He's talking about the "average" instruction rate.  On a P3 with the ALU you can issue two 1 cycle instructions per cycle in PARALLEL that averages out to 0.5 cycles per ALU instruction.  The instructions themselves still take 1 cycle, but since you can do two at the same time through port 0 and port 1 you get the effect of having a 0.5 average cycle rate.  It's usually called IPC  (  instructions per cycle).  On a P4 it's even better since you can issue up to 4 ALU instructions in parallel assuming no dependencies.  You can get an effective rate of 0.25 cyclces per ALU instruction.

Quote from: AeroASM on June 06, 2005, 07:32:34 PM
Where can I find out about uops? Also what are port 0 and port 1? Are they the U and V pipelines?
Instructions get fetched from memory and decoded into micro-ops ( uops).  In general an instruction is composed of multiple micro-ops.  And as a trick instructions that take fewer micro-ops generally run faster.  There is a trick on P3's with doing 4-1-1 micro-op instructions as being the fastest way to execute code on a P3.  That is one 4 micro-op instruction followed by two 1 micro-op instructions.  If I remember right they all run in parallel.  This is no longer the fastest way on a P4, but it still applies to the Pentium M.

  Processors have multiple ports that allow them to execute stuff in parallel.  For P3 and P4 ALU instructions both go through Port 0 and Port 1.  So that is how you can execute two instructions in parallel on a P3, because each ALU instruction goes to a different port.  That is the only way you can truly execute stuff in parallel.  This is also how the old optimization trick of mixing FP and ALU code works.  The FP code goes through Port 1 and the ALU code goes through Port 0.  So it can execute in parallel.  The same thing applies to mixing ALU and MMX/SSE/SSE2 since MMX/SSE/SSE2 all go through Port 1.  There are 4 ports on a P4.  Once per cycle the core can dispatch micro-ops to up to 4 ports.  Since the ALU instructions are double pumped, you can issue two ALU instructions per cycle on Port 0 and Port 1.  The downside is not all ALU instructions are capable of being double pumped.  Something Intel didn't really say a lot about during their whole marketing campaign about the double-pumped ALU.

here are the ALU instructions that support double pumping on a P4.

mov,movsbmovsb, movxb,movzw,neg,not,nop, add,sub , and, or, xor, cmp, test

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm