News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Instruction Timing Ticks Available Anywher?

Started by tbandrow, September 07, 2009, 06:43:15 PM

Previous topic - Next topic

tbandrow

Does there exist documentation that gives tick counts for various CPU instructions?  I know this is hard with pipelining, but there is anything out there that gives us a general idea of how long various addressing modes take to decode and evaluation, plus the instruction?

dedndave

first - they vary (in a few cases, drastcially) from one processor to another
i have a link for a PDF that has old timings that can be used for general purpose
also, you can go to the first post of the first thread in the Laboratory Forum and get MichaelW's timing macros
that way, you can measure times

http://homepage.mac.com/randyhyde/webster.cs.ucr.edu/www.artofasm.com/DOS/pdf/apndxd.pdf

welcome to the forum, by the way

redskull

#2
IMHO Agner Fog's stuff is the best available, being based off actual experiment and not just Intel documentation.  However, as you stated, it's rarely simple enough to just 'time' the instruction and call it a day.

http://www.agner.org/optimize/instruction_tables.pdf

I just tweaked the link to make it work properly. hutch
Strange women, lying in ponds, distributing swords, is no basis for a system of government


hutch--

tbandrow,

The basic answer to your question is NO, as you mentioned with multiple pipelines the very old notion of cycle count for each instruction has long gone and what has replaced it is the notion of sequence testing using instruction combinations. If you think of something simple like a short loop,


    mov esi, 1000
  @@:
    add eax, ecx
    sub esi, 1
    jnz @B


What you present to the processor is a sequence of instructions that occur something like,


    add eax, ecx
    sub esi, 1
    "jump" back taken
    add eax, ecx
    sub esi, 1
    "jump" back taken
    add eax, ecx
    sub esi, 1
    "jump" back taken
    add eax, ecx
    sub esi, 1
    "jump" back taken etc ....

  ; until the counter reduces to 0

    add eax, ecx
    sub esi, 1
    "jump" not taken


The processor then schedules this sequence and streams it through the multiple pipelines as best as it can fit the instructions.

If you need to test different combinations for speed, the most reliable way is to write small test pieces where the test fits your task then compare different methods to see which clocks up faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

tbandrow

Quote from: redskull on September 07, 2009, 08:02:09 PM
IMHO Agner Fog's stuff is the best available, being based off actual experiment and not just Intel documentation.  However, as you stated, it's rarely simple enough to just 'time' the instruction and call it a day.


http://www.agner.org/optimize/instruction_tables.pdf


This is excellent.  Thank you.   It almost looks like it could be said that on Core2, all the average stuff takes about a tick or two, status registers are slower, and long jumps will just kill you.

tbandrow

Quote from: hutch-- on September 07, 2009, 11:54:35 PM
The basic answer to your question is NO, as you mentioned with multiple pipelines the very old notion of cycle count for each instruction has long gone and what has replaced it is the notion of sequence testing using instruction combinations. If you think of something simple like a short loop,

So its really like what we call assembly language is, in fact, a higher level language, and the processor is really not just an execution engine like it was, well, when I took assembly in the 1980s (I date myself), but, is almost an interpreter in its own right, sending each instruction across multiple threads.  The complexity boggles my mind.


hutch--

I try to keep track of the technical data but with a transistor count in excess of 125 million, I give up on the detailed tecjhnical data and try and keep up with the abstract design. The instruction set we see at an interface level is constructed from a lower level set of primitives of varying degrees of speed, some special cases and old stuff in microcode.

The speed increases have been impressive though before the thermal limit kicked in. The Core series processors are getting more instructions through for a given clock cycle and the coming Nehalem core stuff is supposed to get more through again.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

it is mind-boggling - i gave up trying to keep up with it years ago - lol
the one that gets me is out-of-order execution
it must be some pretty impressive code that rifles through the stream to find something to do
seems like the code would catch it, first - lol
i had a basic grasp of the 386, and was starting to work on the 486
then the pentium came out and i tossed the book in the corner - lol

Mirno

It's not so much the "assembly language" that's higher level, as the processor itself has a high and a low level.
CISC lost the war, and RISC is the champion - to that end Intel implemented a RISC core, and wrapped a converter layer around it.
The converter layer does some re-ordering itself to optimise things (the out of order stuff).

Further to this, Intel's latest offering added another "layer" between the two (RISC core, and converter) that tries to join sequences together, replacing them with a more complicated version (so CISC didn't lose after all!). It's all a bit crazy, but the basics of it are:
#1 instruction timings on a per instruction basis are largely useless.
#2 the processor is way more complicated than you think. But on an abstract basis it's still pretty easy to think about.
#3 the rules can, and probably will change again in the future. But the processor architects try to insulate the programmers from that as much as possible.

It is #3 that makes the ARM core so impressive for embedded stuff, without that consideration, they can design a very small chip, but the emphasis is then on the compiler & assembler coders instead. Which in the embedded arena is the right way to go.

Mirno