News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Micro ops and latencies?

Started by JPlayer, December 15, 2006, 06:50:36 PM

Previous topic - Next topic

JPlayer

Hi. I now have a function that i'm trying to optimize to beat the compiler-generated code and I am confused as to how i'm supposed to know which instructions are faster than others. Optimization websites and stuff say stuff like: blah takes 2 micro ops and blah2 takes 3 micro ops so it's better to use blah. Where did they get that information? I have all the Intel processor specification manuals by my side and I can't seem to find that information...even in the Intel optimization manual. Am I just not searching hard enough or are they guessing or are they getting that information from some other Intel source? Thanks in advance. My function is less than a microsecond slower than the compiler generated function so i'm REALLY close to being able to get speedup, I just need to know which instructions I should be using for optimal results.

dsouza123

More usefull because of speculative execution and register renaming
is writing different versions and timing them.

Alignment (code and data), the number of instructions in a 16 byte block,
(partial/full register) stalls due to dependencies, careful covering of memory access latency,
cache friendly data, using registers when possible, and algorithm optimizations
all contribute more to a speedup then the micro op timings.

Agner's optimization work may give you some ideas.

dsouza123

The level below assembly instructions differs greatly between CPU generations
and manufactures so it may be a positive optimization for a particular CPU/CPU generation
but negative for others.

Also using SSE2, SSE, MMX, 3DNow if available can give a speedup
due to the execution of one instruction on parallel data.
For example SSE2 can work on 16 bytes in parallel.

Mark_Larson


  I also have a webpage of 60 assembly language optimization tricks broken up into Beginner, Intermediate, and Advanced.

http://www.mark.masmcode.com/
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

stanhebben

You could also try to use profilers to see which parts of the code are the bottlenecks, and you can also run simulations so you can see which instructions take a lot of time.

Amd has a profiler called CodeAnalyzer which is free, Intel has one too. (though I don't know if the latter is free) Of course the Amd profiler is more suitable to optimize for amd processors, and the Intel one for intel processors.

Vortex

JPlayer,

You can download Agner Fog's optimization manuals from :

http://www.agner.org/optimize/

hutch--

JPlayer,

The reference material you have been pointed at is all good stuff and if you can sink it in it will help you write faster code but we have a laboritory for tuning code so if you can break up bits of what you ae after into digestable sized bits and post it there, you may get some assistance from the members. There are a lot of very shilled people floating around who know a lot about making code faster so it will probably work for you as long as you make it understandable.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

JPlayer

Hi. Thanks for the replies. I HAVE been doing the stuff you guys mentioned however...i've been (trying) to pay close attention to memory accesses to prevent stalls, i've been using SSE2, i've been looking at manuals (mainly yours Mark), and i've been testing ideas. I cleaned up my code a little bit more and my function is now about 70 microseconds faster than what the compiler generates which makes me happy (especially cause now I have at least SOMETHING I can show my bosses to show that this effort is worthwhile :) lol). However, I still need it to be A LOT faster. They say that ideally the runtime should be around .1 or .2 milliseconds. I'm at about 1.96 milliseconds right now. I can't get to the goal with this function alone but I can drastically improve the speed still. Hutch, I think I probably will submit some of my code to the laboratory soon. However, I would appreciate it if someone could answer my original question: How do you know the speeds of certain instructions to know which ones you should use? People are getting this information somewhere and i'd really like to know where cause it WILL be of the utmost importance for this particular program since the runtime has to be so insanely fast. If it makes a difference, we are running on an Intel Xeon processor. Thanks.

dsouza123

XEON ?

Which type ?
The new Core 2 microarchitecture or the previous Netburst.

They are completely different, there is almost nothing in common.
The areas to target for optimization have changed dramatically.

As already suggested,
Agner's microarchitecture.pdf and instruction_tables.pdf files, show uops for various instructions.
http://www.agner.org/optimize/

The laboratory is the way to go, sometimes people have come up with amazingly fast alternatives,
using a lookup table instead of calculations, comes to mind.  Alternative instructions that can replace
a piece of code with three or four instructions with one or two.

When you are blinded by the code, others with fresh eyes can see the possibilities.

Many times from a completely different perspective.