News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Latency and Throughput WOW!!!

Started by OceanJeff32, February 28, 2005, 06:23:52 AM

Previous topic - Next topic

OceanJeff32

Oh man, look at the timings...I've got appendix c  of the IA-32 Architecture Optimization Manual here at work tonight, and geez, some of these instructions are just fast as lightning, whereas the Multiplies and Divides really are slow.

You are definitely right Mark, MMX and SSE/2 instructions take up a lot of latency (ie. clock cycles)

Amazing to actually know this information.

Anybody know where to get latency and programming information for Nintendo Gamecube? (that is, that doesn't cost $5000 or more)

Correct me if I'm wrong, but it looks like it's quicker to MOV data from memory whenever you need it, than it is to store it in the MMX , XMM registers.

The book says MOV is 1 or 0.5 latency, and only suffers a +1 performance hit when loading from memory. (That's still 4 to 4.5 less clock cycles than loading to and from an MMX register).

Oh well,

Later guys (and gals)

Jeff C
:U
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

hutch--

Jeff,

On later hardware MMX has just about had it so if you have the access, use the XMM versions as they are not shared with the FP capacity either. SIMD has its advantages in parallel processing and is probably better used for that task where if you want integer instructions (the normal stuff) you try and use the preferred instruction set that on later hardware averages a half cycle if you can schedule them correctly without stalls.

I cannot help you with the games cube as I don't know what is in it. If you know the processor you MAY be able to get some data from the manufacturer.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mirno


Mark_Larson

Quote from: OceanJeff32 on February 28, 2005, 06:23:52 AM
Oh man, look at the timings...I've got appendix c  of the IA-32 Architecture Optimization Manual here at work tonight, and geez, some of these instructions are just fast as lightning, whereas the Multiplies and Divides really are slow.

You are definitely right Mark, MMX and SSE/2 instructions take up a lot of latency (ie. clock cycles)

Amazing to actually know this information.

Anybody know where to get latency and programming information for Nintendo Gamecube? (that is, that doesn't cost $5000 or more)

Correct me if I'm wrong, but it looks like it's quicker to MOV data from memory whenever you need it, than it is to store it in the MMX , XMM registers.

The book says MOV is 1 or 0.5 latency, and only suffers a +1 performance hit when loading from memory. (That's still 4 to 4.5 less clock cycles than loading to and from an MMX register).

Oh well,

Later guys (and gals)

Jeff C
:U


Well to make things even more confusing Intel introduced the trace cache on the P4.  It replaces the L1 instruction cache.  The trace cache holds already decoded micro-ops.  So if something is already in the trace cache, since it is already decoded, then the latency and throughput are less for that instruction.  And there is one more twist.  It can also send 3 micro-ops a cycle.  So if you look at the following code.  Assuming the XMMREGs doesn't force a stall from a read after write dependency or a write after write depenedency the 16 lines of code runs in 16 processor cycles!!!  If you look up the latency and throughput for MULPS and ADDPS they are 6/2 and 4/2 respectively.  Yet because of the tracecache they each execute in 1 processor cycle.  Did that make sense at all?  I am working on other things, so wasn't sure how well I explained it.  Because of the trache cache being able to issue 3 micro-ops a cycle, you can actually get lower latency on instructions.  That is why timing your code is so important.  Just looking up the cycle count for an instruction, and then adding them up for all your instructions is not going to tell you how fast your code runs.  I saw farabi doing that earlier.  There are so many things that affect how fast code runs: register renaming, out of order execution, pipelines, stalls, cache hits, cache misses, trace cache, etc.



addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

addps xmmreg,xmmreg
addps xmmreg, xmmreg
mulps xmmreg,xmmreg
mulps xmmreg,xmmreg

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm