News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Meta branch predictor Core2?

Started by theunknownguy, June 24, 2010, 08:14:28 AM

Previous topic - Next topic

theunknownguy

Quote from: hutch-- on June 25, 2010, 11:00:11 AM
Alignment in data is a direct read speed issue, put data misaligned across the data size boundaries and the processor must make 2 reads to get it and that makes your read speeds slow. Code alignment is another matter, for all of the theory its useful to have but it still tends to work on a "suck it and see" approach. Sometimes you see speed gains and sometimes you see the code go slower by aligning it. You can nearly always align a label by 4 bytes and if its a jump target its safe enough to do but be aware that there are cases where fast code gets slower by aligning the leading labels.

Thanks hutch, yes i understand aligment its due the prefetch at least on Core2Duo it still 16 bytes per read.
But predecode a NOP vs a prefix could be different or dont?

The aligment option will fill gaps with NOP or redundant opcodes. But reading a little forward the prefix like 64h or 32h seems ignored right away. Does NOP or redundant opcodes are ignored that way? that could avoid wasting a little speed.

(I dont know also if aligment of compiler does set branch prefixes, but i guess they dont)

Also i work in AI algorithms plus security, i wont expect to reinvent the wheel i just do it since seems smart to predict if you are running in a loop. Currently i use a own predecode algo for analyze code blocks (branch, instruction, etc). I think its good addition to take an eye on loop prediction and at least in any AI algo.

hutch--

The trick with code alignment is to understand what you are doing with it. The gains in code alignment are generally small and determined by benchmarking but the cost can be high for no gain at all. If you have a label that code falls through to and you align it to 16 bytes, you have to process the no OP opcodes that fill this space and it can be any combination of db 90h (nop), mov eax, eax or whatever puts the lest number of instructions into the required byte space but it means you can have up to 15 bytes of junk to plough through and that takes time you don't have to waste.

When ypou align code labels, ALWAYS benchmark it to see if you code is either slower or faster. Unless you get a timing gain from it, don't waste the code space aligning it. Code alignment works best on jump targets where you don't have to plough through junk to get there.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

I was under the impression that all modern predictors assumed that, given no recent history, that any backward branch was always taken (how often are backward branches not taken?)

That in effect, the mis-prediction is always on the last iteration of the loop, not on the first iteration of the loop.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

theunknownguy

Under my predecode algo most of 1 byte prefix will take just 1 clock cycle on read, such cases has:

04 ADD AL / Inmmend8
05 ADD EAX / Inmmend16-32
06 PUSH ES
07 POP  ES
...
14 ADC AL / Inmmend8
15 ADC EAX / Inmmend16-32
16 PUSH SS
17 POP  SS
etc


NOP is one of those cases...

Any other opcode where more than 1 prefix is put (including those branch ones) will delay more.

Ofcourse my algo will never match the original one used on microprocessor, but from the predecode logic the NOP seems the best way of align a code (in therms of helping the predecode)

PS: According to agner paper all opcodes can be readed from predecode in 1 clock cycle... how? i cant figure out. I just can make those with no prefix in 1 clock cycle.
PS2: Also the worst opcode that took on my algo to read are these ones:


F6 w 0 TEST Eb Ib gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F6 w 1 U19 TEST alias Eb Ib gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F6 w 2 NOT Eb gen logical One's Complement Negation
F6 w 3 NEG Eb gen arith binary o..szapc o..szapc Two's Complement Negation
F6 w 4 MUL AX AL Eb gen arith binary o..szapc o......c ...szap. Unsigned Multiply
F6 w 5 IMUL AX AL Eb gen arith binary o..szapc o......c ...szap. Signed Multiply
F6 w 6 DIV AL AH AX Eb gen arith binary o..szapc o..szapc Unsigned Divide
F6 w 7 IDIV AL AH AX Eb gen arith binary o..szapc o..szapc Signed Divide
F7 W 0 TEST Ev Iv gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F7 W 1 U19 TEST alias Ev Iv gen logical o..szapc o..sz.pc .....a.. o......c Logical Compare
F7 W 2 NOT Ev gen logical One's Complement Negation
F7 W 3 NEG Ev gen arith binary o..szapc o..szapc Two's Complement Negation
F7 W 4 MUL eDX eAX Ev gen arith binary o..szapc o......c ...szap. Unsigned Multiply
F7 w 5 IMUL eDX eAX Ev gen arith binary o..szapc o......c ...szap. Signed Multiply
F7 w 6 DIV eDX eAX Ev gen arith binary o..szapc o..szapc Unsigned Divide
F7 w 7 IDIV eDX eAX Ev gen arith binary o..szapc o..szapc Signed Divide


Those need their own calculation different from the one i use in other kind of opcodes. Also the TEST Reg8 / Inmmend8 is assembled different than TEST AL / Inmmend8. Cant make a way to read them in 1 clock.

I guess its 1 clock since its processor who do the calculation, but on a soft point of view i see very impossible to read all opcodes in 1 clock. An solution could be making a list with all possible sizes for each opcode and possible combinations for that opcode.

Wich would be a list of:

(Opcode*PossibleCombinations)*TotalOpcodes.

Guess thats too big list. I already kill myself making a long list for each opcode including SSE(N) support...


MichaelW

While I have no good idea of what is going on at the lower levels, within my experience timing NOPs (opcode 90h), at least effectively they execute in much less than 1 clock cycle each.

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================

;------------------------------------------------------------------
; This macro expands to an inline implementation of a generator by
; George Marsaglia, with the specified number of NOPs inserted.
;
; "CONG is a congruential generator with the widely used 69069 as
; multiplier: x(n)=69069x(n-1)+1234567. It has period 2^32.
; The leading half of its 32 bits seem to pass all tests, but bits
; in the last half are too regular."
;------------------------------------------------------------------

cong MACRO cnt
    mov eax, cong_seed
    mov ecx, 69069
    mul ecx
    add eax, 1234567
    nops cnt
    mov cong_seed, eax
    xor edx, edx
    div DWORD PTR [esp+4]
    mov eax, edx
ENDM

;==============================================================================
    .data
        cong_seed dd 1234567
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 3000

    FOR nopcnt,<0,1,2,3,4,8,16,64>
      counter_begin 1000, HIGH_PRIORITY_CLASS
        cong nopcnt
      counter_end
      push eax
      print str$(nopcnt),9
      pop eax
      print str$(eax)," cycles",13,10
    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


Running on a P3:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      30 cycles
64      57 cycles

eschew obfuscation

clive

Quote from: MichaelW on June 26, 2010, 10:07:36 PM
While I have no good idea of what is going on at the lower levels, within my experience timing NOPs (opcode 90h), at least effectively they execute in much less than 1 clock cycle each.

Indeed, it is definitely less than 1. There is multiple dispatch of instructions, the pipelines are pretty deep, and instructions which have no consequence can be retired in a much lazier fashion, so the actual throughput/latency will be hard to measure effectively.

I don't think one can model anything useful by looking at individual opcodes, and the length of opcode sequences, the interactions of everything else going on is orders of magnitude more complex than this. Still, if you are going to try to map the whole opcode space, you'd want to automate it, no one in their right mind would do it entirely manually.
It could be a random act of randomness. Those happen a lot as well.

hutch--

The assumption of relevance of cycle counting has been tenuous since the i486 with its pipeline. The action has been since that technology, SCHEDULING instructions so they proceed through the pipeline with the minimum number of pipeline stalls. Nor is there any effective correlation between small samples and the effective times of working code, you always need to produce an algorithm that is something like the end task then test it in a context similar to the end task to get any reasonable relationship between the test piece and the algorithm to perform the end result.

There is no substitute for appropriate benchmarking in a context that has a known correlation to the end task, the rest is basically hot air. It is a mistake to keep wheeling out the assumptions of the pre-386 era with single execution units in sequence when for years you have had multiple pipelines with various methods of out of order execution, different techniques for level 1,2,3 caches and various configurations of microcode for different instructions.

I still have 2 PIVs running, a 2.8 gig Northwood and a 3.8 gig Prescott and they behave differently, the Northwood has a much shorter pipeline and does not suffer the same penalty under stalls, the Prescott has a much longer pipeline and while it pairs preferred instructions better, it pays a larger penalty when you stall the longer pipelines.

The Core2 Quad I use has a higher instruction throughput for a given clock speed than the faster PIV and it does not suffer the same problems with opcodes like LEA that the entire PIV family had. Shifts and rotates are still slow on the Core2 Quad but you have the advantage of faster DDR3 memory. The i7 Quad I have is faster with rotates and shifts and for the same memory speed and very close to the same clock speed under load it has a higher instruction throughput than the Core2 Quad.

The point of addressing the last 10 years or so of popular processors is that there are not that many common assumptions across those processor families, prefixed branch prediction is effectively a waste of time where careful instruction scheduling for higher sustainable throughput is the only place where the action is and it works across the range of recent processors reasonably well. Branch reduction in the basic design works for you, occasionally using alternative instructions like CMOVxx and SETxx will help but there is no substitute for laying out the concept with the minimum number of branches in the first place.

NOTE that there is a limit to what you can achieve with instruction twiddling and it is based on the speed difference between memory reads and writes and register reads and writes. One of the most effective optimisation methods is to arrange your code with the minimum number of memory reads and writes and in this context you can usually see decent speed gains if you can get the count down.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

theunknownguy

Well talking about redundant opcodes or alignment ones, NOP is the best one. BUT EYE on that other redundant opcodes can use more than 1 byte and still "readed" in 1 clock or less.

I say "readed" since the predecode unit will try to figure out the opcode length for calculate the next position

While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

I have search documents or any paper that explain if microprocessor will avoid those redundant opcodes to be processed in any of the port units, but cant find any info regarding to it (or too bad searching). So i guess they are processed.

Finally seems no measurable gain from setting a NOP, a redundant opcode or a prefix branch.

QuoteOne of the most effective optimisation methods is to arrange your code with the minimum number of memory reads and writes and in this context you can usually see decent speed gains if you can get the count down.

Not many times you can do that hutch, but in case you having many memory read and writes you can always arrenge code at least on the Core2->IntelI7 to produce micro ops. Also for a little more speed up, eliminating the stall dependencies plus out of order execution & using all port processing would speed up the code.

Still after everything you can optimise there still will be people trying to make things faster (not my case). But it is my case to wonder what things are processed more fast than others.

About the predecode, it will read till 16 bytes at least on the Core2Duo, split to µops and calculate next addressing in 1 clock (or less). Wich seems not possible to make in soft land. The more faster i could was 1 clock in some opcodes.

MichaelW

Quote from: theunknownguy on June 27, 2010, 02:55:55 AM
While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

In my code above if I substitute that instruction for NOP, running on a P3 I get:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      46 cycles
64      124 cycles


Or for the 7-byte NOP that MASM uses lea esp,[esp]:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      49 cycles
64      127 cycles


So it would seem that Clive is right regarding "instructions which have no consequence".
eschew obfuscation

theunknownguy

Quote from: MichaelW on June 27, 2010, 05:47:50 AM
Quote from: theunknownguy on June 27, 2010, 02:55:55 AM
While redundant opcodes such has:

lea eax, [eax]

Will pass in theory to the memory read unit (after the predecode/port 2) and make a delay (which is not perceptible).

In my code above if I substitute that instruction for NOP, running on a P3 I get:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      46 cycles
64      124 cycles


Or for the 7-byte NOP that MASM uses lea esp,[esp]:

0       30 cycles
1       30 cycles
2       30 cycles
3       30 cycles
4       30 cycles
8       30 cycles
16      49 cycles
64      127 cycles


So it would seem that Clive is right regarding "instructions which have no consequence".


Yes like i say:

QuoteFinally seems no measurable gain from setting a NOP, a redundant opcode or a prefix branch.

But i cant find a paper that descrive if redudant opcodes are avoided, so consider even if they are passed to memory read unit (it should, at least the LEA opcode), you wont experience any difference in clock cycle since isn't measurable. But passing through the memory read unit it would still waste time against the avoided NOP. Time that worth nothing, but worth just an eye for the curiosity mind (and no i am not a freak of speed)  :lol

PS: In case redudant opcodes are avoided i guess NOP would stop been my favorite opcode...  :(
PS2: I think redudant opcodes such has LEA REG32, [SAME REG32] are passed to the memory unit since adding a check for the operand match the destination source would waste more time than just letting it process the instruction instead of check & avoid.

MichaelW

I have doubts that NOPs, of any length, are "executed". The increased cycles at the higher counts could just be from caching effects.
eschew obfuscation

theunknownguy

Quote from: MichaelW on June 27, 2010, 06:08:55 AM
I have doubts that NOPs are "executed". The increased cycles at the higher counts could just be from caching effects.


They are not "executed" just readed by the predecode unit (get the length... lol 1 byte). While other redudant opcodes need to be executed (predecode-decode-execution units).

Get NOP size on my predecode algo takes 1 clock in soft land, imagine how much delay on microprocessor (less than 1 clock)  :lol

A good test would be to check the pipeline cache on 16 bytes NOPs against 16 byte aligment with redudant opcodes (but i guess nothing measurable again)...

hutch--

db 90h IS an instruction so it in fact DOES get executed and it DOES take time, its just that later processors are smart enough to know that there is no dependency so it goes through the pipeline with no problems but as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

This is the case with any NO OPERATION instruction.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

theunknownguy

Quote from: hutch-- on June 27, 2010, 07:02:42 AM
db 90h IS an instruction so it in fact DOES get executed and it DOES take time, its just that later processors are smart enough to know that there is no dependency so it goes through the pipeline with no problems but as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

This is the case with any NO OPERATION instruction.

Well it should delay time, but the whole deal here is about processor is smart enough for ignore the other redudant opcodes...

What you mean with "executed" i guess you don't mean NOP being decode, splited on uops or even fetched to any execution unit. I am kind of 99% (cant never be sure) that NOP is just predecode and nothing else (and that would add its time like you say), but "executed" i doubt it alot.

Also searching on my google friend look who was asking about multibyte  NOP:

Quote
Multi-byte NOP opcode made official
The latest version of IA-32 Intel® Architecture Software Developer's Manual Volume 2B: Instruction Set Reference, N-Z
(ftp://download.intel.com/design/Pentium4/manuals/25366719.pdf) contains the opcode for a multi-byte NOP instruction. The opcode is
0F 1F mod-000-rm
The multi-byte NOP can have any length up to 9 bytes. Quite useful for alignment.

The manual says that this multi-byte NOP works on all processors with family number 6 or F, which is all Intel processors back to Pentium Pro (except Pentium MMX). I have verified that it works. I was surprised to discover that it works also on an AMD64 processor, although it is not documented in AMD manuals. I didn't find it on any website of undocumented opcodes.

How come that this opcode has been kept secret for so many years? Why is it made official now? How come it works on AMD processors when noone else has discovered it, and AMD recommends the opcode 66 66 66 90 for multibyte NOP?

I guess this is not the right place to ask about AMD processors, but how do I safely detect whether the multi-byte NOP is supported on an AMD processor? There is no use for this opcode unless you have an absolutely safe method of detecting whether it is supported, and this detection method works on all brands.

www.agner.org

Multi-byte NOPs?. So a NOP of 9 bytes according to my theory should be even faster than set 9 separate NOPs. Why?

Well its assembled different and the predecode will add just in time the size of the multi byte NOP (in this case 9 bytes for example) which would take less than just read 9 NOPs per separated (Great Intel !).

A paper that supports the theory i guess:

http://www.ragestorm.net/blogs/?p=14

Finally Peter Ferrie (yes the guy that do security papers on microsoft) write this:

QuotePeter Ferrie says:
June 28, 2007 at 11:00 am

Multi-byte NOPs have existed since at least the Pentium 3, maybe even earlier. They were documented by Intel only recently, though. Since AMD and Intel CPUs share code, AMD CPUs support these instructions, too.

They are useful for when the code flow will reach a loop. The loop should be cache-line aligned for best fetch performance.

LEA is horribly slow on newer CPUs. It also contains a register write. You might not want that write.

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself. See here for a more detailed description of how that can go wrong ;-) -
http://www.symantec.com/enterprise/security_response/weblog/2007/02/x86_fetchdecode_anomalies.html

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself

I guess its enough for say that LEA and other redudants are slower than my beloved NOP (now Multibyte NOP)  :lol

PS: I guess i was not the only guy with such questions on my head like "NOP or LEA REG32/REG32" when both are redudant with allmost no measurable gain... But curiosity always win.

MichaelW

Quote from: hutch-- on June 27, 2010, 07:02:42 AM
as a simple test, write a time intensive loop then start adding nops before it and after a few that are swallowed by other delays you will see the code start to slow down.

I've gone through multiple version of this code, and I still don't see that effect on a P3.

;==============================================================================
    include \masm32\include\masm32rt.inc
    .586
    include \masm32\macros\timers.asm
;==============================================================================
;------------------------------------------------------------------
; This macro expands to an inline implementation of a generator by
; George Marsaglia, with the specified number of NOPs inserted.
;
; "CONG is a congruential generator with the widely used 69069 as
; multiplier: x(n)=69069x(n-1)+1234567. It has period 2^32.
; The leading half of its 32 bits seem to pass all tests, but bits
; in the last half are too regular."
;------------------------------------------------------------------

cong MACRO cnt
    mov eax, cong_seed
    mov ecx, 69069
    mul ecx
    add eax, 1234567
    nops cnt
    mov cong_seed, eax
    xor edx, edx
    mov ecx, 10
    div ecx
    mov eax, edx
ENDM

;==============================================================================
    .data
      cong_seed dd 0
    .code
;==============================================================================
start:
;==============================================================================

    invoke Sleep, 10000

    REPEAT 3

      FOR nopcnt,<0,1,2,3,4,8,16,24,32>

        timer_begin 1, HIGH_PRIORITY_CLASS

          mov ebx, 100000000
          align 16
        @@:
          cong nopcnt
          sub ebx, 1
          jnz @B

        timer_end

        push eax
        print str$(nopcnt),9
        pop eax
        print str$(eax)," ms",13,10

      ENDM

      print chr$(13,10)

    ENDM

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start


0       7927 ms
1       7966 ms
2       7805 ms
3       7204 ms
4       7193 ms
8       7191 ms
16      7193 ms
24      8802 ms
32      9396 ms

0       7924 ms
1       7963 ms
2       7797 ms
3       7199 ms
4       7191 ms
8       7213 ms
16      7199 ms
24      8818 ms
32      9412 ms

0       7926 ms
1       7972 ms
2       7827 ms
3       7199 ms
4       7200 ms
8       7199 ms
16      7214 ms
24      8808 ms
32      9411 ms


As you can see there is no detectable slowdown up through 16 NOPs. On a P3 with only two ports available, if the NOPs are being (fully) executed, it seems to me that the slowdown should start after only a few NOPs are added.
eschew obfuscation