News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Meta branch predictor Core2?

Started by theunknownguy, June 24, 2010, 08:14:28 AM

Previous topic - Next topic

theunknownguy

That is a good test there michael. Dont remember the pipeline limit under a P3 but if there is 16 bytes here is the explication:

- 16 byte prefetched in 1 clock or less (according to agner microstructure)
- 0-16 bytes round 7175ms to 7838ms (NULL bytes extra from the 16 byte per read would make a little more difficult the read, but imperceptible for any clocker)
- 16 bytes NOP (the perfect length for a full cache read, no zero left) (should get the best time)
- 24 (2 cache reads) would take much more milliseconds and 2 clock cycles (or less) on the read (at processor level) 8835 ms
- 32 (3 cache reads) 9601

Arround 1000 ms aprox. For each round of cache read (including the predecode action).

Under LEA REG32, REG32 it should take a little more (test in milliseconds), 16 bytes prefetched also but all of them executed under memory unit.
Multi Byte NOP should take even less that the current test you made.

Has i keep saying NOP are not "executed" just prefetched/predecode in a 16 bytes round (at least on Core2). While LEA REG, REG and allmost all others redundant opcodes should be executed too.

The only good thing of the LEA REG32, REG32 is the 2 bytes that uses and  ESP version uses more bytes. (plus other regs)
This could help the predecode to speed up things, but non measurable again. The only gain i think anybody could experience is with multi nop bytes for aligment.

At least is good to known such in depth things of processor, for the sake of knowledge.

hutch--

Its very easy to prove that nops take time. Here are the results for the following test piece. 3 gig Core2 Quad.


343 no nops
328 1 nop
500 2 nops
672 3 nops
657 4 nops
1109 8 nops
344 no nops
328 1 nop
500 2 nops
672 3 nops
656 4 nops
1125 8 nops
328 no nops
328 1 nop
485 2 nops
671 3 nops
672 4 nops
1110 8 nops
328 no nops
344 1 nop
500 2 nops
656 3 nops
672 4 nops
1109 8 nops
328 no nops
344 1 nop
484 2 nops
672 3 nops
672 4 nops
1109 8 nops
329 no nops
343 1 nop
469 2 nops
672 3 nops
672 4 nops
1109 8 nops
328 no nops
344 1 nop
484 2 nops
657 3 nops
672 4 nops
1109 8 nops
344 no nops
328 1 nop
500 2 nops
672 3 nops
656 4 nops
1109 8 nops
Press any key to continue ...


Here is the test piece.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .data?
      value dd ?

    .data
      item dd 0

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    push esi

    REPEAT 8


    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    ; nop
    ; nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," no nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    ; nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 1 nop",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    ; nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 2 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    ; nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 3 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 4 nops",13,10

    invoke GetTickCount
    push eax
    mov esi, 1000000000
  align 4
  @@:
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    sub esi, 1
    jnz @B
    invoke GetTickCount
    pop ecx
    sub eax, ecx
    print str$(eax)," 8 nops",13,10



    ENDM

    pop esi

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockoon

theunknownguy, I think you mean 'µs' (microsecond, 1 millionth of a second), not 'ms' (millisecond, 1 thousandth of a second)

A millisecond is an insanely long amount of time in terms of semi-modern computers (my current computer can execute something like 10 million instructions per millisecond, per core)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

theunknownguy

Quote from: Rockoon on June 27, 2010, 12:24:07 PM
theunknownguy, I think you mean 'µs' (microsecond, 1 millionth of a second), not 'ms' (millisecond, 1 thousandth of a second)

A millisecond is an insanely long amount of time in terms of semi-modern computers (my current computer can execute something like 10 million instructions per millisecond, per core)



He is running it on a loop of N^X times, it cant be microsecond... We talking about the test in soft land, not the time of the NOP itself on the processor.

Also ive never say NOP doesnt take time. Predecode any opcode should take its time ofcourse. But NOP are executed? wrong...

LEA is slow? yes & Multi Byte is the best method to align (at least support 9 bytes) yes...

PS: If you wanted to know Rockoon how much delay on prefetch the pipeline its arround 1 clock (or less) per 16 bytes (at least on Core2Duo).

MichaelW

It turns out that what I was seeing was the NOPs running in parallel with the other slower instructions. After modifying my code to avoid dependencies, at least for a P3 NOP (90h) appears to have the same timing as other fast instructions, even when those other instructions do have a "consequence".
eschew obfuscation

theunknownguy

Quote from: MichaelW on June 28, 2010, 01:33:34 AM
It turns out that what I was seeing was the NOPs running in parallel with the other slower instructions. After modifying my code to avoid dependencies, at least for a P3 NOP (90h) appears to have the same timing as other fast instructions, even when those other instructions do have a "consequence".

How you can time michael if something pass for the memory unit port or dont?

Well it should run in parallel:

QuoteThe maximum throughput of the predecoders is 16 bytes or 6 instructions per clock cycle,
whichever is smallest. The throughput of the rest of the pipeline is typically 4 instructions per
clock cycle, or 5 in case of macro-op fusion

I will say again that there is no measurable gain between a LEA and a NOP. But there is a gain (measurable) on using Multi Byte NOP.

Ill stick for now with Peter Ferrie:

The NOPs are fast because MODR/M resolution happens in parallel to the fetch itself.

redskull

An 0x90 NOP generates the same style uOps as a r/r MOV; that is, a single uOp that can go to any of the basic ALU ports (on Core2, that means 0, 1, or 5).  So, best case, if you are fetching and decoding instruction at full speed, you can execute 3 NOPs per clock cycle (again, on a Core2)

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Rockoon

The concept of "timing" single instructions is outdated. If you want any reasonable measure of single instruction performance, you need at least 2 values: latency and throughput. I would argue that you could also include its duration within the pipeline, and other factors related to specific stages of the pipeline.

I am pointing this out because the concept of "latency" doesnt really apply to instructions that have no side effects.

In practice, you only measure in-practice code. And idealy, you use a tool like CodeAnalyst or VTUNE to do so, not the RDTSC(P) instruction.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

theunknownguy

Quote from: redskull on June 29, 2010, 07:10:23 PM
An 0x90 NOP generates the same style uOps as a r/r MOV; that is, a single uOp that can go to any of the basic ALU ports (on Core2, that means 0, 1, or 5).  So, best case, if you are fetching and decoding instruction at full speed, you can execute 3 NOPs per clock cycle (again, on a Core2)

-r

Does NOP generates uops? what would be the intention of that... (they should be avoided and just predecode)-(Not sure). Also ALU ports are used to arithmetic fuctions, you have other ports for in the case of LEA opcode like port 2 (memory unit on Core2).

Cant find here where the NOP opcode falls in the uops description here:

http://www.ptlsim.org/Documentation/html/node7.html

Still i think the question is already answered (by peter ferrie) that NOP is faster than LEA in a non measurable level.

And i think you ment 3 NOPs (reading agner) but that doesnt seems quiet a "rule":

QuoteThe throughput of the
predecoders is obviously less than 4 instructions per clock cycle if there are less than 4
instructions in each 16-bytes block of code. The average instruction length should therefore
preferably be less than 4 bytes on average.

If there are less than 4 instruction in each 16-bytes block of code

While using for aligment Multi byte NOP seems the best option for help the predecode.

QuoteThe concept of "timing" single instructions is outdated. If you want any reasonable measure of single instruction performance, you need at least 2 values: latency and throughput. I would argue that you could also include its duration within the pipeline, and other factors related to specific stages of the pipeline.

I am pointing this out because the concept of "latency" doesnt really apply to instructions that have no side effects.

In practice, you only measure in-practice code. And idealy, you use a tool like CodeAnalyst or VTUNE to do so, not the RDTSC(P) instruction.

Man i think you are not getting the point, i repeat in each single of my post that there is NO measurable gain between using NOP or LEA has aligment (Michael test proves it).
But my question is far away from something you can measure, in therms of logic and information collected from agner papers + Intel ones and other quotes (since i am not the only guy with that stupid question). NOP seems to be best than other redudant opcodes for aligment (in this case LEA REG32/REG32). And multi byte nop the best way for aligment at all.

PS: I am not an speed maniac, its just that is good to know wich things are best than other in depth. Just curiosity mind...

redskull

First off, "measurable gain" is not really what this is about; rarely will swapping instructions produce a measurable gain, especially when you are timing under protected mode.

Quote from: theunknownguy on June 29, 2010, 07:37:22 PM
Does NOP generates uops? what would be the intention of that...

NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

In either case...

LEA generates one uOp, that MUST go into port 0.  NOP generates one uOp that can go to port 0, 1, or 5.  So again, assuming maximum fetching and decoding, the EU can churn through three NOPS a cycle (throughput ot 1/3), whereas LEA uOps must go through port 0, sequentually (throughput of 1).  In a real world sitatuion, there will be many, many more uOps at the reservation station that must go through port 1 as well, so it will wait even more.  Basically, since less uOps must go throuhg port 5 (do any?), NOPS can execute in parallel, while LEA's can't

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Rockoon

Quote from: redskull on June 29, 2010, 08:16:52 PM
NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

The CPU does not issue NOP's during "stalls" .. that would make stalls even worse by putting pressure on register renaming resources.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

redskull

Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Quote from: redskull on June 29, 2010, 08:16:52 PM
First off, "measurable gain" is not really what this is about; rarely will swapping instructions produce a measurable gain, especially when you are timing under protected mode.

Quote from: theunknownguy on June 29, 2010, 07:37:22 PM
Does NOP generates uops? what would be the intention of that...

NOP's still have to execute; if not, how would the computer "stall"?  Informally, whenever the CPU has to wait around for data to become available, it executes a NOP; if the NOP took no time at all, then stalling would do no good either.  "Doing nothing" from a CPU standpoint is not the same as "not doing anything"

In either case...

LEA generates one uOp, that MUST go into port 0.  NOP generates one uOp that can go to port 0, 1, or 5.  So again, assuming maximum fetching and decoding, the EU can churn through three NOPS a cycle (throughput ot 1/3), whereas LEA uOps must go through port 0, sequentually (throughput of 1).  In a real world sitatuion, there will be many, many more uOps at the reservation station that must go through port 1 as well, so it will wait even more.  Basically, since less uOps must go throuhg port 5 (do any?), NOPS can execute in parallel, while LEA's can't

-r


First off, "measurable gain" is not really what this is about.

Ofcourse is what i am saying all over the place (damn i think ive put that like 10 times on this thread). Its about just curiosity and using logic for determinate what is the best aligment method (even if it have no measurable gain).

LEA generates one uOp, that MUST go into port 0.

Why to the ALU? (the uOp would explain it) but indeed LEA have to do with memory read, so port 2 is used too...

NOP generates one uOp that can go to port 0, 1, or 5

Where do you get this info? (please some source quote, paper, etc).

Thanks  :U





qWord

Quote from: theunknownguy on June 29, 2010, 09:23:12 PM... indeed LEA have to do with memory read, so port 2 is used too...
lea is an arithmetic instruction with registers or immediate values as operands - there is no memory access.
FPU in a trice: SmplMath
It's that simple!

theunknownguy

Quote from: qWord on June 29, 2010, 09:33:04 PM
Quote from: theunknownguy on June 29, 2010, 09:23:12 PM... indeed LEA have to do with memory read, so port 2 is used too...
lea is an arithmetic instruction with registers or immediate values as operands - there is no memory access.

Got me on that, then how it could calculate:

LEA EAX, [EDX]

Without memory access? (performing just a simple exchange?)