News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Meta branch predictor Core2?

Started by theunknownguy, June 24, 2010, 08:14:28 AM

Previous topic - Next topic

theunknownguy

Does anybody know how Core2 recognise if its on a loop or not?

Under my point of view:

Recording the conditional jump and store on its own personal buffer (not the global one), same has the next conditional jumps, make a simple prediction in base of trial and error (dont know how many times) if the branchs is strongly taken too many times (N-1) to the same address between a "delta", then it would "assume" that branch is part of a loop.

How knowing if its a real loop or not i think its done by "delta" prediction.

You could have in a loop many conditional jumps but only 1 repeat itself N times in the most high address. So in this case:

.Repeat
   add edx, 1
   .Repeat
     add eax, 1
   .Until (eax == 5)
.Until (edx == 10)


Its a nested loop and it would check for the first branch in this case the nested loop, will save its jump address, do the trial and error prediction and save. Later check the next branch (main loop) and do the same process. Now the magic should come when pointing that branch 1 is inside of the "range" of branch 2, both with a 100% prediction (N times taken, less one). Then we can assume its a loop nested to other loop. (wich really doenst matter, its just a loop in the end)

But:


.Repeat
   add edx, 1
   .Repeat
     add eax, 1
     jnz @2
   .Until (eax == 5)
.Until (edx == 10)
@2:


Should read the first branch, check for its address and find out that first branch doesnt point between the range of "strongly taken" prediction that the next 2 branch have. And meaning its not consider a loop.
(Note that first branch is taken and that helps to predict if its part of the loop or not)

A reeplasement of JNZ for JE an it would recognise us has a loop.



But really i dont know, i mean if anybody have any other good theory or opinnion all welcome to discuss.

hutch--

Knowing the specific guts of each core type is a very complex thing to learn, on Intel processors since the PIII era conditional jumps are predicted as NOT TAKEN until they ARE TAKEN and are usually predicted to jump backwards. This gives you a loop design much like this.


  mov counter, number
label:
  :more code
  sub counter, 1
  jnz label                ; mispredicted 1st time then predicted to jump backwards.


Now forward jumps are both not predicted AND jump the wrong way so unless its a bypass within a loop it will not be predicted well but then it may not matter if it only executes once.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix opcodes that Intel created for this purpose.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on June 24, 2010, 03:17:53 PM
Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix opcodes that Intel created for this purpose.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/


Would be easier if it worked. Most compiler developers have given up on them - no effect, sometimes even negative.

MichaelW

And in the linked document:
Quote
It is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.
eschew obfuscation

theunknownguy

Quote from: clive on June 24, 2010, 03:17:53 PM
Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix op codes that Intel created for this purpose.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-Intel/


Core2 doesn't support such prefix opcodes to help the branch predictorl... (not for sure)

But under P1 to P4 they work quiet well...

QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that, if you could help the predictor for start with TAKEN in a loop that required it, then i guess you saving much misspredictions for the worst case you start in the BTB in Weakly Taken / Weakly Not Taken.
But guess IBM put it has a prevention. Not knowing how handle that prefix could make even worst the branch predictor... Apart that adding 1 byte to the predecode...

jj2007

Quote from: theunknownguy on June 24, 2010, 06:30:28 PM
QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that

So you strongly disagree with Intel engineers. On which basis? Can you quote own experience/timings etc, or point us to articles or sites that support your strong statement?

theunknownguy

Quote from: jj2007 on June 24, 2010, 08:08:48 PM
Quote from: theunknownguy on June 24, 2010, 06:30:28 PM
QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that

So you strongly disagree with Intel engineers. On which basis? Can you quote own experience/timings etc, or point us to articles or sites that support your strong statement?


If you read better like i say its more like a "prevention". Same has microsoft warns you about many things.

Not knowing how use the prefix will make missprediction on branch when it could be predicted without any alteration. But like i say in a loop it could be needed.

Experience? well on P1 to P4 they work quiet well in nested loops where 1 or 2 conditional branch could not been taken N times and just the last time taken (inverse of what the loop output its).

Anger fog:

A backward branch that alternates would have to be
organized so that it is not taken the first time, to obtain the same effect. Instead of swapping
the two branches, we may insert a 3EH prediction hint prefix immediately before the JNZ
X1 to change the static prediction to "taken" (see p. 30). This will have the same effect.
While this method of controlling the initial state of the local predictor solves the problem in
most cases, it is not completely reliable. It may not work if the first time the branch is seen is
after a mispredicted preceding branch. Furthermore, the sequence may be broken by a task
switch or other event that pushes the branch out of the BTB. We have no way of predicting
whether the branch will be taken or not taken the first time it is seen after such an event.
Fortunately, it appears that the designers have been aware of this problem and
implemented a way to solve it. While researching these mechanisms, I discovered an
undocumented prefix, 64H, which does the trick on the P4. This prefix doesn't change the
static prediction, but it controls the state of the local predictor after the first event


Regarding to P4...

It is rarely worth the effort to take static prediction into account. Almost any branch that is
executed sufficiently often for its timing to have any significant effect is likely to stay in the
BTB so that only the dynamic prediction counts. Static prediction only has a significant
effect if context switches or task switches occur very often.


The Intel Quote:

QuoteIt is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.

Its is not "recommended".

BUT !:

In the event that a branch hint is necessary, the following instruction prefixes can be added before a branch instruction to change the way the static predictor behaves

And has i say i am kind of sure prefix doesnt have the same effect on Core2 (but not 100% sure). Wich is ofc not the topic question.

PS: It still a pointless debate about the prefix on a branch for change the local BTB. My question was about meta branch prediction on Core2duo.

jj2007

Quote from: theunknownguy on June 24, 2010, 08:17:51 PM
Anger fog

Anger and fog, exactly. Post your code and your timings, please.

theunknownguy

Quote from: jj2007 on June 24, 2010, 08:36:12 PM
Quote from: theunknownguy on June 24, 2010, 08:17:51 PM
Anger fog

Anger and fog, exactly. Post your code and your timings, please.


Need to switch to my P4 at home, i am at office... Core2Duo and trying to use the branch hints witout any effect at all...

So now about my real question and leaving the branch prefix wich have not much to do at all.

Does somebody have a more interesting theory about how the Core2Duo uses the meta branch?

PS: Is not only agner, is also intel telling you by this simple quote.  "In the event that a branch hint is necessary"
So its pointless deny or trying to make those prefix like a "bad usage", its just a "recommendation"


theunknownguy

Tomorrow ill post a code for predict loops at least for trying to understand more the loop predication.

Cant now just too tired. And thanks for the answers.

hutch--

Something you learn after writing code for processors from i486 upwards, do not lock your code design into one hardware as the next may not work well with it. I know the prefixes that Clive mentioned but I never saw code run faster with them and they are not fully supported with earlier or later hardware. With current processors I work on a Core2 quad and an i7 quad and they both respond to conventional Intel design specs like the example I posted above.

Theory is fine but like the old motor racing comment, when the flag drops the bullsh*t stops, clock the difference and make your decisions that way.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

theunknownguy

Quote from: hutch-- on June 25, 2010, 12:59:26 AM
Something you learn after writing code for processors from i486 upwards, do not lock your code design into one hardware as the next may not work well with it. I know the prefixes that Clive mentioned but I never saw code run faster with them and they are not fully supported with earlier or later hardware. With current processors I work on a Core2 quad and an i7 quad and they both respond to conventional Intel design specs like the example I posted above.

Theory is fine but like the old motor racing comment, when the flag drops the bullsh*t stops, clock the difference and make your decisions that way.

Thanks hutch for the advice. I tried has you say the prefixes on Core2Duo and no effect so i guess they just are not supported or have no effect over the BTB local buffer.

But i like the way of predicting a loop it would sure help to improve my knowledge about AI, so ill keep forward with it. Also an interesting paper of neuronal branch predication:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.3023&rep=rep1&type=pdf

Also another question does Align works better with prefixes or using NOP?


hutch--

J8ust keep in mind that AI and processor circuitry are at opposite ends of the spectrum, the logic behind processor design is in fact very sophisticated but it has been evolving for may years and later versions are rarely ever compatible with older stuff. In recent processor families, the Core2 series processors are a lot faster than a PIV relatively with SSE instructions, the PIV was slow with LEA which was fast on both earlier and later Intel hardware, bit manipulation is still very ordinary and this probably will not change as the demand is not high enough.

Now branch prediction still related very closely to loop code design and its generally in the innermost part of the loop that it matters the most. If a branch is regularly taken in one direction it will remain in the BTB but a worst case is when you have a branch that is randomly taken OR not taken depending on the previous data, when it is predicted correctly its fast but if the most recent OP was one way and the next one is the other it is predicted incorrectly and you usually end up with a pipeline stall.

Across almost all processors over the last 10 years or so, branch reduction works for you and where you cannot avoid it, laying out code so that a branch is taken backwards in most cases is still the fastest way to write code of that type. There is no magic solution to getting branch prediction right, minimise branching and lay cde out for the most predictable options and you usually cannot do it any faster.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Alignment in data is a direct read speed issue, put data misaligned across the data size boundaries and the processor must make 2 reads to get it and that makes your read speeds slow. Code alignment is another matter, for all of the theory its useful to have but it still tends to work on a "suck it and see" approach. Sometimes you see speed gains and sometimes you see the code go slower by aligning it. You can nearly always align a label by 4 bytes and if its a jump target its safe enough to do but be aware that there are cases where fast code gets slower by aligning the leading labels.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php