The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: theunknownguy on June 24, 2010, 08:14:28 AM

Title: Meta branch predictor Core2?
Post by: theunknownguy on June 24, 2010, 08:14:28 AM
Does anybody know how Core2 recognise if its on a loop or not?

Under my point of view:

Recording the conditional jump and store on its own personal buffer (not the global one), same has the next conditional jumps, make a simple prediction in base of trial and error (dont know how many times) if the branchs is strongly taken too many times (N-1) to the same address between a "delta", then it would "assume" that branch is part of a loop.

How knowing if its a real loop or not i think its done by "delta" prediction.

You could have in a loop many conditional jumps but only 1 repeat itself N times in the most high address. So in this case:

.Repeat
   add edx, 1
   .Repeat
     add eax, 1
   .Until (eax == 5)
.Until (edx == 10)


Its a nested loop and it would check for the first branch in this case the nested loop, will save its jump address, do the trial and error prediction and save. Later check the next branch (main loop) and do the same process. Now the magic should come when pointing that branch 1 is inside of the "range" of branch 2, both with a 100% prediction (N times taken, less one). Then we can assume its a loop nested to other loop. (wich really doenst matter, its just a loop in the end)

But:


.Repeat
   add edx, 1
   .Repeat
     add eax, 1
     jnz @2
   .Until (eax == 5)
.Until (edx == 10)
@2:


Should read the first branch, check for its address and find out that first branch doesnt point between the range of "strongly taken" prediction that the next 2 branch have. And meaning its not consider a loop.
(Note that first branch is taken and that helps to predict if its part of the loop or not)

A reeplasement of JNZ for JE an it would recognise us has a loop.



But really i dont know, i mean if anybody have any other good theory or opinnion all welcome to discuss.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 24, 2010, 02:31:30 PM
Knowing the specific guts of each core type is a very complex thing to learn, on Intel processors since the PIII era conditional jumps are predicted as NOT TAKEN until they ARE TAKEN and are usually predicted to jump backwards. This gives you a loop design much like this.


  mov counter, number
label:
  :more code
  sub counter, 1
  jnz label                ; mispredicted 1st time then predicted to jump backwards.


Now forward jumps are both not predicted AND jump the wrong way so unless its a bypass within a loop it will not be predicted well but then it may not matter if it only executes once.
Title: Re: Meta branch predictor Core2?
Post by: clive on June 24, 2010, 03:17:53 PM
Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix opcodes that Intel created for this purpose.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/
Title: Re: Meta branch predictor Core2?
Post by: jj2007 on June 24, 2010, 04:59:28 PM
Quote from: clive on June 24, 2010, 03:17:53 PM
Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix opcodes that Intel created for this purpose.

http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/


Would be easier if it worked. Most compiler developers have given up on them - no effect, sometimes even negative.
Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 24, 2010, 05:13:10 PM
And in the linked document:
Quote
It is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 24, 2010, 06:30:28 PM
Quote from: clive on June 24, 2010, 03:17:53 PM
Wouldn't it be easier just to mark the branches with the appropriate TAKEN/NOT TAKEN prefix op codes that Intel created for this purpose.

http://software.Intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-Intel/


Core2 doesn't support such prefix opcodes to help the branch predictorl... (not for sure)

But under P1 to P4 they work quiet well...

QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that, if you could help the predictor for start with TAKEN in a loop that required it, then i guess you saving much misspredictions for the worst case you start in the BTB in Weakly Taken / Weakly Not Taken.
But guess IBM put it has a prevention. Not knowing how handle that prefix could make even worst the branch predictor... Apart that adding 1 byte to the predecode...
Title: Re: Meta branch predictor Core2?
Post by: jj2007 on June 24, 2010, 08:08:48 PM
Quote from: theunknownguy on June 24, 2010, 06:30:28 PM
QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that

So you strongly disagree with Intel engineers. On which basis? Can you quote own experience/timings etc, or point us to articles or sites that support your strong statement?
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 24, 2010, 08:17:51 PM
Quote from: jj2007 on June 24, 2010, 08:08:48 PM
Quote from: theunknownguy on June 24, 2010, 06:30:28 PM
QuoteIt is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints

Strongly disagree with that

So you strongly disagree with Intel engineers. On which basis? Can you quote own experience/timings etc, or point us to articles or sites that support your strong statement?


If you read better like i say its more like a "prevention". Same has microsoft warns you about many things.

Not knowing how use the prefix will make missprediction on branch when it could be predicted without any alteration. But like i say in a loop it could be needed.

Experience? well on P1 to P4 they work quiet well in nested loops where 1 or 2 conditional branch could not been taken N times and just the last time taken (inverse of what the loop output its).

Anger fog:

A backward branch that alternates would have to be
organized so that it is not taken the first time, to obtain the same effect. Instead of swapping
the two branches, we may insert a 3EH prediction hint prefix immediately before the JNZ
X1 to change the static prediction to "taken" (see p. 30). This will have the same effect.
While this method of controlling the initial state of the local predictor solves the problem in
most cases, it is not completely reliable. It may not work if the first time the branch is seen is
after a mispredicted preceding branch. Furthermore, the sequence may be broken by a task
switch or other event that pushes the branch out of the BTB. We have no way of predicting
whether the branch will be taken or not taken the first time it is seen after such an event.
Fortunately, it appears that the designers have been aware of this problem and
implemented a way to solve it. While researching these mechanisms, I discovered an
undocumented prefix, 64H, which does the trick on the P4. This prefix doesn't change the
static prediction, but it controls the state of the local predictor after the first event


Regarding to P4...

It is rarely worth the effort to take static prediction into account. Almost any branch that is
executed sufficiently often for its timing to have any significant effect is likely to stay in the
BTB so that only the dynamic prediction counts. Static prediction only has a significant
effect if context switches or task switches occur very often.


The Intel Quote:

QuoteIt is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.

Its is not "recommended".

BUT !:

In the event that a branch hint is necessary, the following instruction prefixes can be added before a branch instruction to change the way the static predictor behaves

And has i say i am kind of sure prefix doesnt have the same effect on Core2 (but not 100% sure). Wich is ofc not the topic question.

PS: It still a pointless debate about the prefix on a branch for change the local BTB. My question was about meta branch prediction on Core2duo.
Title: Re: Meta branch predictor Core2?
Post by: jj2007 on June 24, 2010, 08:36:12 PM
Quote from: theunknownguy on June 24, 2010, 08:17:51 PM
Anger fog

Anger and fog, exactly. Post your code and your timings, please.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 24, 2010, 08:39:54 PM
Quote from: jj2007 on June 24, 2010, 08:36:12 PM
Quote from: theunknownguy on June 24, 2010, 08:17:51 PM
Anger fog

Anger and fog, exactly. Post your code and your timings, please.


Need to switch to my P4 at home, i am at office... Core2Duo and trying to use the branch hints witout any effect at all...

So now about my real question and leaving the branch prefix wich have not much to do at all.

Does somebody have a more interesting theory about how the Core2Duo uses the meta branch?

PS: Is not only agner, is also intel telling you by this simple quote.  "In the event that a branch hint is necessary"
So its pointless deny or trying to make those prefix like a "bad usage", its just a "recommendation"

Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 25, 2010, 12:45:06 AM
Tomorrow ill post a code for predict loops at least for trying to understand more the loop predication.

Cant now just too tired. And thanks for the answers.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 25, 2010, 12:59:26 AM
Something you learn after writing code for processors from i486 upwards, do not lock your code design into one hardware as the next may not work well with it. I know the prefixes that Clive mentioned but I never saw code run faster with them and they are not fully supported with earlier or later hardware. With current processors I work on a Core2 quad and an i7 quad and they both respond to conventional Intel design specs like the example I posted above.

Theory is fine but like the old motor racing comment, when the flag drops the bullsh*t stops, clock the difference and make your decisions that way.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 25, 2010, 01:04:54 AM
Quote from: hutch-- on June 25, 2010, 12:59:26 AM
Something you learn after writing code for processors from i486 upwards, do not lock your code design into one hardware as the next may not work well with it. I know the prefixes that Clive mentioned but I never saw code run faster with them and they are not fully supported with earlier or later hardware. With current processors I work on a Core2 quad and an i7 quad and they both respond to conventional Intel design specs like the example I posted above.

Theory is fine but like the old motor racing comment, when the flag drops the bullsh*t stops, clock the difference and make your decisions that way.

Thanks hutch for the advice. I tried has you say the prefixes on Core2Duo and no effect so i guess they just are not supported or have no effect over the BTB local buffer.

But i like the way of predicting a loop it would sure help to improve my knowledge about AI, so ill keep forward with it. Also an interesting paper of neuronal branch predication:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.3023&rep=rep1&type=pdf

Also another question does Align works better with prefixes or using NOP?

Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 25, 2010, 03:13:47 AM
J8ust keep in mind that AI and processor circuitry are at opposite ends of the spectrum, the logic behind processor design is in fact very sophisticated but it has been evolving for may years and later versions are rarely ever compatible with older stuff. In recent processor families, the Core2 series processors are a lot faster than a PIV relatively with SSE instructions, the PIV was slow with LEA which was fast on both earlier and later Intel hardware, bit manipulation is still very ordinary and this probably will not change as the demand is not high enough.

Now branch prediction still related very closely to loop code design and its generally in the innermost part of the loop that it matters the most. If a branch is regularly taken in one direction it will remain in the BTB but a worst case is when you have a branch that is randomly taken OR not taken depending on the previous data, when it is predicted correctly its fast but if the most recent OP was one way and the next one is the other it is predicted incorrectly and you usually end up with a pipeline stall.

Across almost all processors over the last 10 years or so, branch reduction works for you and where you cannot avoid it, laying out code so that a branch is taken backwards in most cases is still the fastest way to write code of that type. There is no magic solution to getting branch prediction right, minimise branching and lay cde out for the most predictable options and you usually cannot do it any faster.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 25, 2010, 11:00:11 AM
Alignment in data is a direct read speed issue, put data misaligned across the data size boundaries and the processor must make 2 reads to get it and that makes your read speeds slow. Code alignment is another matter, for all of the theory its useful to have but it still tends to work on a "suck it and see" approach. Sometimes you see speed gains and sometimes you see the code go slower by aligning it. You can nearly always align a label by 4 bytes and if its a jump target its safe enough to do but be aware that there are cases where fast code gets slower by aligning the leading labels.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 26, 2010, 01:44:10 AM
Quote from: hutch-- on June 25, 2010, 11:00:11 AM
Alignment in data is a direct read speed issue, put data misaligned across the data size boundaries and the processor must make 2 reads to get it and that makes your read speeds slow. Code alignment is another matter, for all of the theory its useful to have but it still tends to work on a "suck it and see" approach. Sometimes you see speed gains and sometimes you see the code go slower by aligning it. You can nearly always align a label by 4 bytes and if its a jump target its safe enough to do but be aware that there are cases where fast code gets slower by aligning the leading labels.

Thanks hutch, yes i understand aligment its due the prefetch at least on Core2Duo it still 16 bytes per read.
But predecode a NOP vs a prefix could be different or dont?

The aligment option will fill gaps with NOP or redundant opcodes. But reading a little forward the prefix like 64h or 32h seems ignored right away. Does NOP or redundant opcodes are ignored that way? that could avoid wasting a little speed.

(I dont know also if aligment of compiler does set branch prefixes, but i guess they dont)

Also i work in AI algorithms plus security, i wont expect to reinvent the wheel i just do it since seems smart to predict if you are running in a loop. Currently i use a own predecode algo for analyze code blocks (branch, instruction, etc). I think its good addition to take an eye on loop prediction and at least in any AI algo.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 26, 2010, 03:03:19 AM
The trick with code alignment is to understand what you are doing with it. The gains in code alignment are generally small and determined by benchmarking but the cost can be high for no gain at all. If you have a label that code falls through to and you align it to 16 bytes, you have to process the no OP opcodes that fill this space and it can be any combination of db 90h (nop), mov eax, eax or whatever puts the lest number of instructions into the required byte space but it means you can have up to 15 bytes of junk to plough through and that takes time you don't have to waste.

When ypou align code labels, ALWAYS benchmark it to see if you code is either slower or faster. Unless you get a timing gain from it, don't waste the code space aligning it. Code alignment works best on jump targets where you don't have to plough through junk to get there.
Title: Re: Meta branch predictor Core2?
Post by: Rockoon on June 26, 2010, 12:04:09 PM
I was under the impression that all modern predictors assumed that, given no recent history, that any backward branch was always taken (how often are backward branches not taken?)

That in effect, the mis-prediction is always on the last iteration of the loop, not on the first iteration of the loop.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 26, 2010, 06:17:30 PM
Under my predecode algo most of 1 byte prefix will take just 1 clock cycle on read, such cases has:

04 ADD AL / Inmmend8
05 ADD EAX / Inmmend16-32
06 PUSH ES
07 POP  ES
...
14 ADC AL / Inmmend8
15 ADC EAX / Inmmend16-32
16 PUSH SS
17 POP  SS
etc


NOP is one of those cases...

Any other opcode where more than 1 prefix is put (including those branch ones) will delay more.

Ofcourse my algo will never match the original one used on microprocessor, but from the predecode logic the NOP seems the best way of align a code (in therms of helping the predecode)

PS: According to agner paper all opcodes can be readed from predecode in 1 clock cycle... how? i cant figure out. I just can make those with no prefix in 1 clock cycle.
PS2: Also the worst opcode that took on my algo to read are these ones:


F6 w 0 TEST Eb Ib gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F6 w 1 U19 TEST alias Eb Ib gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F6 w 2 NOT Eb gen logical One's Complement Negation
F6 w 3 NEG Eb gen arith binary o..szapc o..szapc Two's Complement Negation
F6 w 4 MUL AX AL Eb gen arith binary o..szapc o......c ...szap. Unsigned Multiply
F6 w 5 IMUL AX AL Eb gen arith binary o..szapc o......c ...szap. Signed Multiply
F6 w 6 DIV AL AH AX Eb gen arith binary o..szapc o..szapc Unsigned Divide
F6 w 7 IDIV AL AH AX Eb gen arith binary o..szapc o..szapc Signed Divide
F7 W 0 TEST Ev Iv gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F7 W 1 U19 TEST alias Ev Iv gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F7 W 2 NOT Ev gen logical One's Complement Negation
F7 W 3 NEG Ev gen arith binary o..szapc o..szapc Two's Complement Negation
F7 W 4 MUL eDX eAX Ev gen arith binary o..szapc o......c ...szap. Unsigned Multiply
F7 w 5 IMUL eDX eAX Ev gen arith binary o..szapc o......c ...szap. Signed Multiply
F7 w 6 DIV eDX eAX Ev gen arith binary o..szapc o..szapc Unsigned Divide
F7 w 7 IDIV eDX eAX Ev gen arith binary o..szapc o..szapc Signed Divide


Those need their own calculation different from the one i use in other kind of opcodes. Also the TEST Reg8 / Inmmend8 is assembled different than TEST AL / Inmmend8. Cant make a way to read them in 1 clock.

I guess its 1 clock since its processor who do the calculation, but on a soft point of view i see very impossible to read all opcodes in 1 clock. An solution could be making a list with all possible sizes for each opcode and possible combinations for that opcode.

Wich would be a list of:

(Opcode*PossibleCombinations)*TotalOpcodes.

Guess thats too big list. I already kill myself making a long list for each opcode including SSE(N) support...

Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 26, 2010, 10:07:36 PM
While I have no good idea of what is going on at the lower levels, within my experience timing NOPs (opcode 90h), at least effectively they execute in much less than 1 clock cycle each.

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================

;------------------------------------------------------------------
; This macro expands to an inline implementation of a generator by
; George Marsaglia, with the specified number of NOPs inserted.
;
; "CONG is a congruential generator with the widely used 69069 as
; multiplier: x(n)=69069x(n-1)+1234567. It has period 2^32.
; The leading half of its 32 bits seem to pass all tests, but bits
; in the last half are too regular."
;------------------------------------------------------------------

cong MACRO cnt
    mov eax, cong_seed
    mov ecx, 69069
    mul ecx
    add eax, 1234567
    nops cnt
    mov cong_seed, eax
    xor edx, edx
    div DWORD PTR [esp+4]
    mov eax, edx
ENDM

;==============================================================================
    .data
        cong_seed dd 1234567
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 3000

    FOR nopcnt,<0,1,2,3,4,8,16,64>
      counter_begin 1000, HIGH_PRIORITY_CLASS
        cong nopcnt
      counter_end
      push eax
      print str$(nopcnt),9
      pop eax
      print str$(eax)," cycles",13,10
    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


Running on a P3:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      30 cycles
64      57 cycles

Title: Re: Meta branch predictor Core2?
Post by: clive on June 26, 2010, 11:02:54 PM
Quote from: MichaelW on June 26, 2010, 10:07:36 PM
While I have no good idea of what is going on at the lower levels, within my experience timing NOPs (opcode 90h), at least effectively they execute in much less than 1 clock cycle each.

Indeed, it is definitely less than 1. There is multiple dispatch of instructions, the pipelines are pretty deep, and instructions which have no consequence can be retired in a much lazier fashion, so the actual throughput/latency will be hard to measure effectively.

I don't think one can model anything useful by looking at individual opcodes, and the length of opcode sequences, the interactions of everything else going on is orders of magnitude more complex than this. Still, if you are going to try to map the whole opcode space, you'd want to automate it, no one in their right mind would do it entirely manually.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 27, 2010, 02:29:36 AM
The assumption of relevance of cycle counting has been tenuous since the i486 with its pipeline. The action has been since that technology, SCHEDULING instructions so they proceed through the pipeline with the minimum number of pipeline stalls. Nor is there any effective correlation between small samples and the effective times of working code, you always need to produce an algorithm that is something like the end task then test it in a context similar to the end task to get any reasonable relationship between the test piece and the algorithm to perform the end result.

There is no substitute for appropriate benchmarking in a context that has a known correlation to the end task, the rest is basically hot air. It is a mistake to keep wheeling out the assumptions of the pre-386 era with single execution units in sequence when for years you have had multiple pipelines with various methods of out of order execution, different techniques for level 1,2,3 caches and various configurations of microcode for different instructions.

I still have 2 PIVs running, a 2.8 gig Northwood and a 3.8 gig Prescott and they behave differently, the Northwood has a much shorter pipeline and does not suffer the same penalty under stalls, the Prescott has a much longer pipeline and while it pairs preferred instructions better, it pays a larger penalty when you stall the longer pipelines.

The Core2 Quad I use has a higher instruction throughput for a given clock speed than the faster PIV and it does not suffer the same problems with opcodes like LEA that the entire PIV family had. Shifts and rotates are still slow on the Core2 Quad but you have the advantage of faster DDR3 memory. The i7 Quad I have is faster with rotates and shifts and for the same memory speed and very close to the same clock speed under load it has a higher instruction throughput than the Core2 Quad.

The point of addressing the last 10 years or so of popular processors is that there are not that many common assumptions across those processor families, prefixed branch prediction is effectively a waste of time where careful instruction scheduling for higher sustainable throughput is the only place where the action is and it works across the range of recent processors reasonably well. Branch reduction in the basic design works for you, occasionally using alternative instructions like CMOVxx and SETxx will help but there is no substitute for laying out the concept with the minimum number of branches in the first place.

NOTE that there is a limit to what you can achieve with instruction twiddling and it is based on the speed difference between memory reads and writes and register reads and writes. One of the most effective optimisation methods is to arrange your code with the minimum number of memory reads and writes and in this context you can usually see decent speed gains if you can get the count down.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 02:55:55 AM
Well talking about redundant opcodes or alignment ones, NOP is the best one. BUT EYE on that other redundant opcodes can use more than 1 byte and still "readed" in 1 clock or less.

I say "readed" since the predecode unit will try to figure out the opcode length for calculate the next position

While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

I have search documents or any paper that explain if microprocessor will avoid those redundant opcodes to be processed in any of the port units, but cant find any info regarding to it (or too bad searching). So i guess they are processed.

Finally seems no measurable gain from setting a NOP, a redundant opcode or a prefix branch.

QuoteOne of the most effective optimisation methods is to arrange your code with the minimum number of memory reads and writes and in this context you can usually see decent speed gains if you can get the count down.

Not many times you can do that hutch, but in case you having many memory read and writes you can always arrenge code at least on the Core2->IntelI7 to produce micro ops. Also for a little more speed up, eliminating the stall dependencies plus out of order execution & using all port processing would speed up the code.

Still after everything you can optimise there still will be people trying to make things faster (not my case). But it is my case to wonder what things are processed more fast than others.

About the predecode, it will read till 16 bytes at least on the Core2Duo, split to µops and calculate next addressing in 1 clock (or less). Wich seems not possible to make in soft land. The more faster i could was 1 clock in some opcodes.
Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 27, 2010, 05:47:50 AM
Quote from: theunknownguy on June 27, 2010, 02:55:55 AM
While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

In my code above if I substitute that instruction for NOP, running on a P3 I get:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      46 cycles
64      124 cycles


Or for the 7-byte NOP that MASM uses lea esp,[esp]:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      49 cycles
64      127 cycles


So it would seem that Clive is right regarding "instructions which have no consequence".
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 05:58:22 AM
Quote from: MichaelW on June 27, 2010, 05:47:50 AM
Quote from: theunknownguy on June 27, 2010, 02:55:55 AM
While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

In my code above if I substitute that instruction for NOP, running on a P3 I get:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      46 cycles
64      124 cycles


Or for the 7-byte NOP that MASM uses lea esp,[esp]:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      49 cycles
64      127 cycles


So it would seem that Clive is right regarding "instructions which have no consequence".


Yes like i say:

QuoteFinally seems no measurable gain from setting a NOP, a redundant opcode or a prefix branch.

But i cant find a paper that descrive if redudant opcodes are avoided, so consider even if they are passed to memory read unit (it should, at least the LEA opcode), you wont experience any difference in clock cycle since isn't measurable. But passing through the memory read unit it would still waste time against the avoided NOP. Time that worth nothing, but worth just an eye for the curiosity mind (and no i am not a freak of speed)  :lol

PS: In case redudant opcodes are avoided i guess NOP would stop been my favorite opcode...  :(
PS2: I think redudant opcodes such has LEA REG32, [SAME REG32] are passed to the memory unit since adding a check for the operand match the destination source would waste more time than just letting it process the instruction instead of check & avoid.
Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 27, 2010, 06:08:55 AM
I have doubts that NOPs, of any length, are "executed". The increased cycles at the higher counts could just be from caching effects.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 06:11:05 AM
Quote from: MichaelW on June 27, 2010, 06:08:55 AM
I have doubts that NOPs are "executed". The increased cycles at the higher counts could just be from caching effects.


They are not "executed" just readed by the predecode unit (get the length... lol 1 byte). While other redudant opcodes need to be executed (predecode-decode-execution units).

Get NOP size on my predecode algo takes 1 clock in soft land, imagine how much delay on microprocessor (less than 1 clock)  :lol

A good test would be to check the pipeline cache on 16 bytes NOPs against 16 byte aligment with redudant opcodes (but i guess nothing measurable again)...
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 27, 2010, 07:02:42 AM
db 90h IS an instruction so it in fact DOES get executed and it DOES take time, its just that later processors are smart enough to know that there is no dependency so it goes through the pipeline with no problems but as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

This is the case with any NO OPERATION instruction.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 07:22:45 AM
Quote from: hutch-- on June 27, 2010, 07:02:42 AM
db 90h IS an instruction so it in fact DOES get executed and it DOES take time, its just that later processors are smart enough to know that there is no dependency so it goes through the pipeline with no problems but as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

This is the case with any NO OPERATION instruction.

Well it should delay time, but the whole deal here is about processor is smart enough for ignore the other redudant opcodes...

What you mean with "executed" i guess you don't mean NOP being decode, splited on uops or even fetched to any execution unit. I am kind of 99% (cant never be sure) that NOP is just predecode and nothing else (and that would add its time like you say), but "executed" i doubt it alot.

Also searching on my google friend look who was asking about multibyte  NOP:

Quote
Multi-byte NOP opcode made official
The latest version of IA-32 Intel® Architecture Software Developer's Manual Volume 2B: Instruction Set Reference, N-Z
(ftp://download.intel.com/design/Pentium4/manuals/25366719.pdf) contains the opcode for a multi-byte NOP instruction. The opcode is
0F 1F mod-000-rm
The multi-byte NOP can have any length up to 9 bytes. Quite useful for alignment.

The manual says that this multi-byte NOP works on all processors with family number 6 or F, which is all Intel processors back to Pentium Pro (except Pentium MMX). I have verified that it works. I was surprised to discover that it works also on an AMD64 processor, although it is not documented in AMD manuals. I didn't find it on any website of undocumented opcodes.

How come that this opcode has been kept secret for so many years? Why is it made official now? How come it works on AMD processors when noone else has discovered it, and AMD recommends the opcode 66 66 66 90 for multibyte NOP?

I guess this is not the right place to ask about AMD processors, but how do I safely detect whether the multi-byte NOP is supported on an AMD processor? There is no use for this opcode unless you have an absolutely safe method of detecting whether it is supported, and this detection method works on all brands.

www.agner.org

Multi-byte NOPs?. So a NOP of 9 bytes according to my theory should be even faster than set 9 separate NOPs. Why?

Well its assembled different and the predecode will add just in time the size of the multi byte NOP (in this case 9 bytes for example) which would take less than just read 9 NOPs per separated (Great Intel !).

A paper that supports the theory i guess:

http://www.ragestorm.net/blogs/?p=14

Finally Peter Ferrie (yes the guy that do security papers on microsoft) write this:

QuotePeter Ferrie says:
June 28, 2007 at 11:00 am

Multi-byte NOPs have existed since at least the Pentium 3, maybe even earlier. They were documented by Intel only recently, though. Since AMD and Intel CPUs share code, AMD CPUs support these instructions, too.

They are useful for when the code flow will reach a loop. The loop should be cache-line aligned for best fetch performance.

LEA is horribly slow on newer CPUs. It also contains a register write. You might not want that write.

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself. See here for a more detailed description of how that can go wrong ;-) -
http://www.symantec.com/enterprise/security_response/weblog/2007/02/x86_fetchdecode_anomalies.html

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself

I guess its enough for say that LEA and other redudants are slower than my beloved NOP (now Multibyte NOP)  :lol

PS: I guess i was not the only guy with such questions on my head like "NOP or LEA REG32/REG32" when both are redudant with allmost no measurable gain... But curiosity always win.
Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 27, 2010, 08:48:45 AM
Quote from: hutch-- on June 27, 2010, 07:02:42 AM
as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

I've gone through multiple version of this code, and I still don't see that effect on a P3.

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================
;------------------------------------------------------------------
; This macro expands to an inline implementation of a generator by
; George Marsaglia, with the specified number of NOPs inserted.
;
; "CONG is a congruential generator with the widely used 69069 as
; multiplier: x(n)=69069x(n-1)+1234567. It has period 2^32.
; The leading half of its 32 bits seem to pass all tests, but bits
; in the last half are too regular."
;------------------------------------------------------------------

cong MACRO cnt
    mov eax, cong_seed
    mov ecx, 69069
    mul ecx
    add eax, 1234567
    nops cnt
    mov cong_seed, eax
    xor edx, edx
    mov ecx, 10
    div ecx
    mov eax, edx
ENDM

;==============================================================================
    .data
      cong_seed dd 0
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 10000

    REPEAT 3

      FOR nopcnt,<0,1,2,3,4,8,16,24,32>

        timer_begin 1, HIGH_PRIORITY_CLASS

          mov ebx, 100000000
          align 16
        @@:
          cong nopcnt
          sub ebx, 1
          jnz @B

        timer_end

        push eax
        print str$(nopcnt),9
        pop eax
        print str$(eax)," ms",13,10

      ENDM

      print chr$(13,10)

    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


0       7927 ms
1       7966 ms
2       7805 ms
3       7204 ms
4       7193 ms
8       7191 ms
16      7193 ms
24      8802 ms
32      9396 ms

0       7924 ms
1       7963 ms
2       7797 ms
3       7199 ms
4       7191 ms
8       7213 ms
16      7199 ms
24      8818 ms
32      9412 ms

0       7926 ms
1       7972 ms
2       7827 ms
3       7199 ms
4       7200 ms
8       7199 ms
16      7214 ms
24      8808 ms
32      9411 ms


As you can see there is no detectable slowdown up through 16 NOPs. On a P3 with only two ports available, if the NOPs are being (fully) executed, it seems to me that the slowdown should start after only a few NOPs are added.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 09:03:06 AM
That is a good test there michael. Dont remember the pipeline limit under a P3 but if there is 16 bytes here is the explication:

- 16 byte prefetched in 1 clock or less (according to agner microstructure)
- 0-16 bytes round 7175ms to 7838ms (NULL bytes extra from the 16 byte per read would make a little more difficult the read, but imperceptible for any clocker)
- 16 bytes NOP (the perfect length for a full cache read, no zero left) (should get the best time)
- 24 (2 cache reads) would take much more milliseconds and 2 clock cycles (or less) on the read (at processor level) 8835 ms
- 32 (3 cache reads) 9601

Arround 1000 ms aprox. For each round of cache read (including the predecode action).

Under LEA REG32, REG32 it should take a little more (test in milliseconds), 16 bytes prefetched also but all of them executed under memory unit.
Multi Byte NOP should take even less that the current test you made.

Has i keep saying NOP are not "executed" just prefetched/predecode in a 16 bytes round (at least on Core2). While LEA REG, REG and allmost all others redundant opcodes should be executed too.

The only good thing of the LEA REG32, REG32 is the 2 bytes that uses and  ESP version uses more bytes. (plus other regs)
This could help the predecode to speed up things, but non measurable again. The only gain i think anybody could experience is with multi nop bytes for aligment.

At least is good to known such in depth things of processor, for the sake of knowledge.
Title: Re: Meta branch predictor Core2?
Post by: hutch-- on June 27, 2010, 10:26:45 AM
Its very easy to prove that nops take time. Here are the results for the following test piece. 3 gig Core2 Quad.


343 no nops
328 1 nop
500 2 nops
672 3 nops
657 4 nops
1109 8 nops
344 no nops
328 1 nop
500 2 nops
672 3 nops
656 4 nops
1125 8 nops
328 no nops
328 1 nop
485 2 nops
671 3 nops
672 4 nops
1110 8 nops
328 no nops
344 1 nop
500 2 nops
656 3 nops
672 4 nops
1109 8 nops
328 no nops
344 1 nop
484 2 nops
672 3 nops
672 4 nops
1109 8 nops
329 no nops
343 1 nop
469 2 nops
672 3 nops
672 4 nops
1109 8 nops
328 no nops
344 1 nop
484 2 nops
657 3 nops
672 4 nops
1109 8 nops
344 no nops
328 1 nop
500 2 nops
672 3 nops
656 4 nops
1109 8 nops
Press any key to continue ...


Here is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

    REPEAT 8


    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    ; nop
    ; nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," no nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    ; nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 1 nop",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 2 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 3 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 4 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 8 nops",13,10



    ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: Meta branch predictor Core2?
Post by: Rockoon on June 27, 2010, 12:24:07 PM
theunknownguy, I think you mean 'µs' (microsecond, 1 millionth of a second), not 'ms' (millisecond, 1 thousandth of a second)

A millisecond is an insanely long amount of time in terms of semi-modern computers (my current computer can execute something like 10 million instructions per millisecond, per core)

Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 27, 2010, 06:16:17 PM
Quote from: Rockoon on June 27, 2010, 12:24:07 PM
theunknownguy, I think you mean 'µs' (microsecond, 1 millionth of a second), not 'ms' (millisecond, 1 thousandth of a second)

A millisecond is an insanely long amount of time in terms of semi-modern computers (my current computer can execute something like 10 million instructions per millisecond, per core)



He is running it on a loop of N^X times, it cant be microsecond... We talking about the test in soft land, not the time of the NOP itself on the processor.

Also ive never say NOP doesnt take time. Predecode any opcode should take its time ofcourse. But NOP are executed? wrong...

LEA is slow? yes & Multi Byte is the best method to align (at least support 9 bytes) yes...

PS: If you wanted to know Rockoon how much delay on prefetch the pipeline its arround 1 clock (or less) per 16 bytes (at least on Core2Duo).
Title: Re: Meta branch predictor Core2?
Post by: MichaelW on June 28, 2010, 01:33:34 AM
It turns out that what I was seeing was the NOPs running in parallel with the other slower instructions. After modifying my code to avoid dependencies, at least for a P3 NOP (90h) appears to have the same timing as other fast instructions, even when those other instructions do have a "consequence".
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 06:32:39 PM
Quote from: MichaelW on June 28, 2010, 01:33:34 AM
It turns out that what I was seeing was the NOPs running in parallel with the other slower instructions. After modifying my code to avoid dependencies, at least for a P3 NOP (90h) appears to have the same timing as other fast instructions, even when those other instructions do have a "consequence".

How you can time michael if something pass for the memory unit port or dont?

Well it should run in parallel:

QuoteThe maximum throughput of the predecoders is 16 bytes or 6 instructions per clock cycle,
whichever is smallest. The throughput of the rest of the pipeline is typically 4 instructions per
clock cycle, or 5 in case of macro-op fusion

I will say again that there is no measurable gain between a LEA and a NOP. But there is a gain (measurable) on using Multi Byte NOP.

Ill stick for now with Peter Ferrie:

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself.
Title: Re: Meta branch predictor Core2?
Post by: redskull on June 29, 2010, 07:10:23 PM
An 0x90 NOP generates the same style uOps as a r/r MOV; that is, a single uOp that can go to any of the basic ALU ports (on Core2, that means 0, 1, or 5).  So, best case, if you are fetching and decoding instruction at full speed, you can execute 3 NOPs per clock cycle (again, on a Core2)

-r
Title: Re: Meta branch predictor Core2?
Post by: Rockoon on June 29, 2010, 07:11:13 PM
The concept of "timing" single instructions is outdated. If you want any reasonable measure of single instruction performance, you need at least 2 values: latency and throughput. I would argue that you could also include its duration within the pipeline, and other factors related to specific stages of the pipeline.

I am pointing this out because the concept of "latency" doesnt really apply to instructions that have no side effects.

In practice, you only measure in-practice code. And idealy, you use a tool like CodeAnalyst or VTUNE to do so, not the RDTSC(P) instruction.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 07:37:22 PM
Quote from: redskull on June 29, 2010, 07:10:23 PM
An 0x90 NOP generates the same style uOps as a r/r MOV; that is, a single uOp that can go to any of the basic ALU ports (on Core2, that means 0, 1, or 5).  So, best case, if you are fetching and decoding instruction at full speed, you can execute 3 NOPs per clock cycle (again, on a Core2)

-r

Does NOP generates uops? what would be the intention of that... (they should be avoided and just predecode)-(Not sure). Also ALU ports are used to arithmetic fuctions, you have other ports for in the case of LEA opcode like port 2 (memory unit on Core2).

Cant find here where the NOP opcode falls in the uops description here:

http://www.ptlsim.org/Documentation/html/node7.html

Still i think the question is already answered (by peter ferrie) that NOP is faster than LEA in a non measurable level.

And i think you ment 3 NOPs (reading agner) but that doesnt seems quiet a "rule":

QuoteThe throughput of the
predecoders is obviously less than 4 instructions per clock cycle if there are less than 4
instructions in each 16-bytes block of code. The average instruction length should therefore
preferably be less than 4 bytes on average.

If there are less than 4 instruction in each 16-bytes block of code

While using for aligment Multi byte NOP seems the best option for help the predecode.

QuoteThe concept of "timing" single instructions is outdated. If you want any reasonable measure of single instruction performance, you need at least 2 values: latency and throughput. I would argue that you could also include its duration within the pipeline, and other factors related to specific stages of the pipeline.

I am pointing this out because the concept of "latency" doesnt really apply to instructions that have no side effects.

In practice, you only measure in-practice code. And idealy, you use a tool like CodeAnalyst or VTUNE to do so, not the RDTSC(P) instruction.

Man i think you are not getting the point, i repeat in each single of my post that there is NO measurable gain between using NOP or LEA has aligment (Michael test proves it).
But my question is far away from something you can measure, in therms of logic and information collected from agner papers + Intel ones and other quotes (since i am not the only guy with that stupid question). NOP seems to be best than other redudant opcodes for aligment (in this case LEA REG32/REG32). And multi byte nop the best way for aligment at all.

PS: I am not an speed maniac, its just that is good to know wich things are best than other in depth. Just curiosity mind...
Title: Re: Meta branch predictor Core2?
Post by: redskull on June 29, 2010, 08:16:52 PM
First off, "measurable gain" is not really what this is about; rarely will swapping instructions produce a measurable gain, especially when you are timing under protected mode.

Quote from: theunknownguy on June 29, 2010, 07:37:22 PM
Does NOP generates uops? what would be the intention of that...

NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

In either case...

LEA generates one uOp, that MUST go into port 0.  NOP generates one uOp that can go to port 0, 1, or 5.  So again, assuming maximum fetching and decoding, the EU can churn through three NOPS a cycle (throughput ot 1/3), whereas LEA uOps must go through port 0, sequentually (throughput of 1).  In a real world sitatuion, there will be many, many more uOps at the reservation station that must go through port 1 as well, so it will wait even more.  Basically, since less uOps must go throuhg port 5 (do any?), NOPS can execute in parallel, while LEA's can't

-r
Title: Re: Meta branch predictor Core2?
Post by: Rockoon on June 29, 2010, 08:30:21 PM
Quote from: redskull on June 29, 2010, 08:16:52 PM
NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

The CPU does not issue NOP's during "stalls" .. that would make stalls even worse by putting pressure on register renaming resources.

Title: Re: Meta branch predictor Core2?
Post by: redskull on June 29, 2010, 08:46:04 PM
Quote from: redskull on June 29, 2010, 08:16:52 PMInformally,
  :bg

Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 09:23:12 PM
Quote from: redskull on June 29, 2010, 08:16:52 PM
First off, "measurable gain" is not really what this is about; rarely will swapping instructions produce a measurable gain, especially when you are timing under protected mode.

Quote from: theunknownguy on June 29, 2010, 07:37:22 PM
Does NOP generates uops? what would be the intention of that...

NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

In either case...

LEA generates one uOp, that MUST go into port 0.  NOP generates one uOp that can go to port 0, 1, or 5.  So again, assuming maximum fetching and decoding, the EU can churn through three NOPS a cycle (throughput ot 1/3), whereas LEA uOps must go through port 0, sequentually (throughput of 1).  In a real world sitatuion, there will be many, many more uOps at the reservation station that must go through port 1 as well, so it will wait even more.  Basically, since less uOps must go throuhg port 5 (do any?), NOPS can execute in parallel, while LEA's can't

-r


First off, "measurable gain" is not really what this is about.

Ofcourse is what i am saying all over the place (damn i think ive put that like 10 times on this thread). Its about just curiosity and using logic for determinate what is the best aligment method (even if it have no measurable gain).

LEA generates one uOp, that MUST go into port 0.

Why to the ALU? (the uOp would explain it) but indeed LEA have to do with memory read, so port 2 is used too...

NOP generates one uOp that can go to port 0, 1, or 5

Where do you get this info? (please some source quote, paper, etc).

Thanks  :U




Title: Re: Meta branch predictor Core2?
Post by: qWord on June 29, 2010, 09:33:04 PM
Quote from: theunknownguy on June 29, 2010, 09:23:12 PM... indeed LEA have to do with memory read, so port 2 is used too...
lea is an arithmetic instruction with registers or immediate values as operands - there is no memory access.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 09:34:15 PM
Quote from: qWord on June 29, 2010, 09:33:04 PM
Quote from: theunknownguy on June 29, 2010, 09:23:12 PM... indeed LEA have to do with memory read, so port 2 is used too...
lea is an arithmetic instruction with registers or immediate values as operands - there is no memory access.

Got me on that, then how it could calculate:

LEA EAX, [EDX]

Without memory access? (performing just a simple exchange?)
Title: Re: Meta branch predictor Core2?
Post by: qWord on June 29, 2010, 09:36:51 PM
Quote from: theunknownguy on June 29, 2010, 09:34:15 PM
Got me on that, then how it could calculate:

LEA EAX, [EDX]

Without memory access?
this is the same as:
mov eax,edx
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 09:38:29 PM
Quote from: qWord on June 29, 2010, 09:36:51 PM
Quote from: theunknownguy on June 29, 2010, 09:34:15 PM
Got me on that, then how it could calculate:

LEA EAX, [EDX]

Without memory access?
this is the same as:
mov eax,edx

what is the real intention of having LEA then? (Regarding to aligment, apart of filling more bytes?)
Title: Re: Meta branch predictor Core2?
Post by: qWord on June 29, 2010, 09:41:24 PM
Quote from: theunknownguy on June 29, 2010, 09:38:29 PMwhat is the real intention of having LEA then?
computing the effective addres of memory locations (using ModRM and SIB byte). A typical example are local variables, which are relative to EBP.
Title: Re: Meta branch predictor Core2?
Post by: theunknownguy on June 29, 2010, 09:43:21 PM
Quote from: qWord on June 29, 2010, 09:41:24 PM
Quote from: theunknownguy on June 29, 2010, 09:38:29 PMwhat is the real intention of having LEA then?
computing the effective addres of memory locations. Typical example are local variables, which are relative to EBP.

Yes, sorry i always correct my post a little after you answer  :lol

But regarding to the aligment:

00401000 >   8D00           LEA EAX,DWORD PTR DS:[EAX]
00401002     8BC0           MOV EAX,EAX

What would be the difference?...
Title: Re: Meta branch predictor Core2?
Post by: qWord on June 29, 2010, 09:50:04 PM
Quote from: theunknownguy on June 29, 2010, 09:43:21 PMWhat would be the difference?...
other opcode, same operation -> nop.
Title: Re: Meta branch predictor Core2?
Post by: redskull on June 29, 2010, 11:03:40 PM
The Agner Fog instruction timing manual, under Core2 (65nm), page 31 - 34

-r
Title: Re: Meta branch predictor Core2?
Post by: clive on June 30, 2010, 01:51:57 AM
Quote from: theunknownguy on June 29, 2010, 09:43:21 PM
00401000 >   8D00           LEA EAX,DWORD PTR DS:[EAX]
00401002     8BC0           MOV EAX,EAX

What would be the difference?...

There are the WORD and DWORD forms of lea reg,[reg+0]

00000000 8D4000                 lea     eax,[eax]
00000003 8D642400               lea     esp,[esp]
00000007 8D5200                 lea     edx,[edx]
0000000A 8D8000000000           lea     eax,[eax]
00000010 8DA42400000000         lea     esp,[esp]
00000017 8D9200000000           lea     edx,[edx]
Title: Re: Meta branch predictor Core2?
Post by: clive on June 30, 2010, 03:54:34 AM
Quote from: theunknownguy
Got me on that, then how it could calculate:

LEA EAX, [EDX]

Without memory access?

what is the real intention of having LEA then? (Regarding to aligment, apart of filling more bytes?)

There are at least 3 forms of LEA that can be used to expand the size of the opcodes, without materially effecting the speed of execution. I'd recommend using a register you are not using so as not to create a dependency.

It is computing the address that would be accessed, without performing the access. The computation is done by the address computation logic, using simple adders and barrel shifters (for 1x, 2x, 4x, 8x). Given the simplicity of pipelining this computation it's hard to imagine it being sent to a complex ALU.

LEA is less relevant these days as the computational costs are fairly minimal, but if we look at the 8086 the costs of recomputing the assorted index modes was higher. As I recall [si+bx] was 2 cycles more than [bx] for the 8086, but the same on a 386, and [si+bx+1] was only one cycle longer on the 386.

For example in C, calculating a pointer address that is used later.

char *ptr;
char Buffer[0x200];
ptr = (char *)&Buffer[0x100]


would convert to
lea eax,byte ptr [edx]

char *ptr;
long LongBuffer[0x200];
int i = 0x100;
ptr = (char *)&LongBuffer[i];


mov eax,0100h
lea eax,dword ptr [edx + eax*4];


It also has the benefit of NOT changing the flags.


  xor ecx,ecx ; clc
  mov ecx,10 ; 40 byte multi precision addition
@@:
  mov eax,[esi]
  lea esi,[esi+4]
  adc eax,[edi]
  mov [edi],eax
  lea edi,[edi+4]
  dec ecx ; NOT sub ecx,1 which destroys carry
  jnz @B