Proving Performance ...

James Ladd · July 22, 2010, 10:03:23 PM

Hi All,

It has been a while since I have visited this forum and I hope everyone is happy and well ? :bg

I want to prove the performance of branch hints so I can see if making a modification to a well known tool
is worth while, specifically for fixed loops (from 1 to 10 do something). My instinct suggests that branch
hints are a worth while optimization because they were added by intel to the instruction set and because some
research papers also suggest this optimization is helpful
(citation: Importance of Explicit Vectorization for CPU and GPU Software Performance, Neil G. Dickson, Kamran Karimi, Firas Hamze)

I don't have a great deal of time as I am working full time and I also have a side project building a Smalltalk compiler for the
Java Virtual Machine (http://redline.st)

What I am wanting is to pay someone to do this investigation and provide the code here for others to learn from.
I will be providing a detailed set of requirements, but I am also interested in opinions from you on how to best prove the
impact of branch hints.

In essence I would like a simple application that does the following:

1. Accepts input of how many threads to create
2. Accepts input of how many cycles (loops) each thread should do
3. Creates a start-gate barrier (counted semaphore?) that each thread will wait on before executing.
(So when a thread is created and starts, it waits at this barrier for all other threads to be created.
4. Creates a stop-gate barrier (counted semaphore?) that each thread will wait on when the loop
is completed.
5. When the start-gate barrier in #3 is released (all threads waiting) then all threads will go into
the loop for 'n' cycles specified in #2.
Do you think there should be a set of instructions executed here as some busy work, or just the
loop?
6. When each thread has completed the looping it will wait at the stop-gate for all other threads.
7. When the stop-gate barrier in #4 is released all threads will output the time taken to process
the loop and exit.
8. The program will exit.

*** a version of the application will use branch hints in the loop and a version will not. ***

I'm guessing something like Agner Fog's timing library is used in the loop to produce accurate timing
of the loop, given that it will probably be too fast for higher resolution timers.

Anyone up for this paid challenge ? I'll be posting the results and the code here for the community
afterwards.

Rgs, James.

hutch-- · July 23, 2010, 12:42:59 AM

James,

I am hard to get work out of but I can pass you a comment or two on the PIV era branch hints, design your loop code so that they are correctly predicted as far as possible and forget about branch hinting as it varies from one processor to another. Use you Intel specified preferred instruction set as far as possible and if you are forced by code design to have to use some of the older and slower instructions, minimise their usage to reduce the impact on performance.

James Ladd · July 23, 2010, 01:01:25 AM

Quotedesign your loop code so that they are correctly predicted as far as possible and forget about branch hinting as it varies from one processor to another.

Sagely advice.

Assume I have optimized the loop as best as possible, are you saying that branch hints will not help
performance?

Quoteforget about branch hinting as it varies from one processor to another

Varies how? For example, on one processor it doesn't work?

QuoteIntel specified preferred instruction

So Intel have a set of instructions they prefer you use for looping?
Can you point these out?

I really appreciate the advice and I know you are reliable with your information, but I'm still going to need to prove this theory.

Rgs, James.

dedndave · July 23, 2010, 01:11:22 AM

QuoteAssume I have optimized the loop as best as possible
are you saying that branch hints will not help performance?

if the first statement is true, hints will only slow you down :P

James Ladd · July 23, 2010, 01:11:59 AM

Two interesting pieces of information about branch hints:

http://software.intel.com/en-us/articles/quantify-the-penalty-of-branch-misprediction-on-64-bit-architecture/
http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/

James Ladd · July 23, 2010, 01:33:16 AM

Quoteif the first statement is true, hints will only slow you down Tongue

Ah - so true.

Maybe for what I am doing the best approach is to determine how to avoid branching rather than
providing a hint, as suggested by Hutch and the articles.

I'd still be interested in the example code being done.

Rgs, James.

hutch-- · July 23, 2010, 01:37:28 AM

James,

You are stuck with having to read the current Intel manual set for the i7 series, it applies to the i5 and i3 as well. They also contain a lot of data on legacy processors, Core2 series, PIV and even some earlier design work. Jumps, either conditional or unconditional are predicted backwards unless they have been used in the forward direction for a number of iterations. Read up on the theory of branch prediction buffers and how to optimise for them.

Branch hints are old PIV technology, even if you can get them to work on a PIV which is usually doubtful, the technique will not work on either later or earlier hardware or for that matter on AMD hardware. general purpose code optimisation is a massive set of compromises to work across an arbitrarily selected range of processors. You tend to address the most common hardware available at any given time which includes at least some legacy hardware and produce the type of design that run OK on most of it.

Alternatively you write code for different families of processors and use a CPUID detection routine to pick which is which.

You have some reading to do.

James Ladd · July 23, 2010, 02:18:55 AM

I'm reading :)

jj2007 · July 23, 2010, 06:10:13 AM

James,

QuoteGCC has support for this feature, but it has turned out to not gain
anything and was disabled by default, since branch reordering stramlines
code well enought to match the default predictor behaviour.
Same conclusion was done by other compiler teams too, ICC is not
generating the hints either.

(source)

If compiler developers abandom them, you might as well...

James Ladd · July 23, 2010, 06:21:57 AM

I think I'm agreeing with you.

The key to the performance gain I want is to analyse the code and restructure it in accordance
with the optimization suggestions for pipelining and stall elimination.

Thanks everyone.

News:

Proving Performance ...