The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: OceanJeff32 on February 28, 2005, 06:23:52 AM

Title: Latency and Throughput WOW!!!
Post by: OceanJeff32 on February 28, 2005, 06:23:52 AM
Oh man, look at the timings...I've got appendix c  of the IA-32 Architecture Optimization Manual here at work tonight, and geez, some of these instructions are just fast as lightning, whereas the Multiplies and Divides really are slow.

You are definitely right Mark, MMX and SSE/2 instructions take up a lot of latency (ie. clock cycles)

Amazing to actually know this information.

Anybody know where to get latency and programming information for Nintendo Gamecube? (that is, that doesn't cost $5000 or more)

Correct me if I'm wrong, but it looks like it's quicker to MOV data from memory whenever you need it, than it is to store it in the MMX , XMM registers.

The book says MOV is 1 or 0.5 latency, and only suffers a +1 performance hit when loading from memory. (That's still 4 to 4.5 less clock cycles than loading to and from an MMX register).

Oh well,

Later guys (and gals)

Jeff C
:U
Title: Re: Latency and Throughput WOW!!!
Post by: hutch-- on February 28, 2005, 08:29:07 AM
Jeff,

On later hardware MMX has just about had it so if you have the access, use the XMM versions as they are not shared with the FP capacity either. SIMD has its advantages in parallel processing and is probably better used for that task where if you want integer instructions (the normal stuff) you try and use the preferred instruction set that on later hardware averages a half cycle if you can schedule them correctly without stalls.

I cannot help you with the games cube as I don't know what is in it. If you know the processor you MAY be able to get some data from the manufacturer.
Title: Re: Latency and Throughput WOW!!!
Post by: Mirno on February 28, 2005, 02:21:09 PM
A quick google turned up this on the GameCube:
http://www.gc-linux.org/docs/yagcd/

Mirno
Title: Re: Latency and Throughput WOW!!!
Post by: Mark_Larson on February 28, 2005, 08:05:48 PM
Quote from: OceanJeff32 on February 28, 2005, 06:23:52 AM
Oh man, look at the timings...I've got appendix c  of the IA-32 Architecture Optimization Manual here at work tonight, and geez, some of these instructions are just fast as lightning, whereas the Multiplies and Divides really are slow.

You are definitely right Mark, MMX and SSE/2 instructions take up a lot of latency (ie. clock cycles)

Amazing to actually know this information.

Anybody know where to get latency and programming information for Nintendo Gamecube? (that is, that doesn't cost $5000 or more)

Correct me if I'm wrong, but it looks like it's quicker to MOV data from memory whenever you need it, than it is to store it in the MMX , XMM registers.

The book says MOV is 1 or 0.5 latency, and only suffers a +1 performance hit when loading from memory. (That's still 4 to 4.5 less clock cycles than loading to and from an MMX register).

Oh well,

Later guys (and gals)

Jeff C
:U


Well to make things even more confusing Intel introduced the trace cache on the P4.  It replaces the L1 instruction cache.  The trace cache holds already decoded micro-ops.  So if something is already in the trace cache, since it is already decoded, then the latency and throughput are less for that instruction.  And there is one more twist.  It can also send 3 micro-ops a cycle.  So if you look at the following code.  Assuming the XMMREGs doesn't force a stall from a read after write dependency or a write after write depenedency the 16 lines of code runs in 16 processor cycles!!!  If you look up the latency and throughput for MULPS and ADDPS they are 6/2 and 4/2 respectively.  Yet because of the tracecache they each execute in 1 processor cycle.  Did that make sense at all?  I am working on other things, so wasn't sure how well I explained it.  Because of the trache cache being able to issue 3 micro-ops a cycle, you can actually get lower latency on instructions.  That is why timing your code is so important.  Just looking up the cycle count for an instruction, and then adding them up for all your instructions is not going to tell you how fast your code runs.  I saw farabi doing that earlier.  There are so many things that affect how fast code runs: register renaming, out of order execution, pipelines, stalls, cache hits, cache misses, trace cache, etc.



addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg