Fastest Absolute Function

dioxin · August 24, 2010, 11:10:32 AM

Lingo (and others),

QuoteBut I haven't RETURN instruction in my code, hence I haven't a misprediction penalty too

You miss the point.

CALL, JMP and RETURN are all branch instructions so they are subject to branch prediction optimisations.
CALL 1234567 is 100% predictable, the address to jump to is in the instruction.
JMP edx is 100% unpredictable. There is no way for the CPU to know in advance what the address will be.
RET is also 100% unpredictable. There is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.

As an optimisation, the branch prediction logic maintains a private stack of return addresses.
Each CALL has the appropriate return address stacked and then, when the RET instruction is encountered the known address is popped off the private stack and the RET becomes a 100% predictable branch.
This works until someone fails to match the CALLs and RETs when the private stack gets out of sync and all returns revert to be 100% mispredicted. Not just the last return but all stacked returns are now mispredicted. The stack is typically 16 items deep allowing the last 16 CALL-RETURN pairs to be predicted correctly.

Paul.

lingo · August 24, 2010, 12:21:41 PM

NightWare,

"i don't know where this stupidity come from, but there is NO special mecanism."

"This stupidity come from" Intel 64 and IA-32 Architectures Optimization Reference Manual from November 2009.

Hutch and dioxin,

"you mess up the CALL RET pairing."
and
"Not just the last return but all stacked returns are now mispredicted."

I agree but I don't care what will hapend AFTER the FINISH of my program.

dioxin · August 24, 2010, 12:33:35 PM

Lingo,

Quotebut I don't care what will hapend AFTER the FINISH of my program.

Your program isn't finished until your JMP is complete. RET will complete quicker.

Paul.

MichaelW · August 24, 2010, 12:57:38 PM

This is an attempt to determine if the problem is with the jmp ecx being mispredicted. I couldn't see any good way to make the destination operand a constant, so I settled for a uniform return address instead of one that changes on each call.

Code Select


;====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;====================================================================
    .data
    .code
;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
abs proc dwNum:DWORD
    mov eax, [esp+4]
    cdq
    xor eax,edx
    sub eax,edx
    ret 4
abs endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
absL proc dwNum:DWORD
    pop ecx
    pop eax
    cdq
    xor eax,edx
    sub eax,edx
    jmp ecx
absL endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================
start:
;====================================================================

    invoke abs, 0
    print str$(eax),13,10
    invoke abs, 1
    print str$(eax),13,10
    invoke abs, -1
    print str$(eax),13,10
    invoke abs, 2147483647
    print str$(eax),13,10
    invoke abs, -2147483648
    print str$(eax),13,10,13,10

    invoke absL, 0
    print str$(eax),13,10
    invoke absL, 1
    print str$(eax),13,10
    invoke absL, -1
    print str$(eax),13,10
    invoke absL, 2147483647
    print str$(eax),13,10
    invoke absL, -2147483648
    print str$(eax),13,10,13,10

    invoke Sleep, 3000

    REPEAT 3

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke abs, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke absL, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          mov ebx, 4
        @@:
          invoke abs, -1
          dec ebx
          jnz @B
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          mov ebx, 4
        @@:
          invoke absL, -1
          dec ebx
          jnz @B
      counter_end
      print str$(eax)," cycles",13,10

      print chr$(13,10)

    ENDM

    inkey "Press any key to exit..."
    exit

;====================================================================
end start

Running on a P3:

Code Select


22 cycles
60 cycles
25 cycles
30 cycles

21 cycles
60 cycles
24 cycles
30 cycles

21 cycles
60 cycles
25 cycles
30 cycles

NightWare · August 24, 2010, 01:54:12 PM

Quote from: dioxin on August 24, 2010, 11:10:32 AM
CALL 1234567 is 100% predictable, the address to jump to is in the instruction.
JMP edx is 100% unpredictable. There is no way for the CPU to know in advance what the address will be.
RET is also 100% unpredictable. There is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.

:bdg it seams that you have the sens of humour... call 1234567 IS NOT PREDICTABLE and AT 100% (or near, unless it has been used recently...). predictions is an attempt of the cpu to read the needed page of code, nothing else... here your 100% prediction is just the EXECUTION of the instruction... a misprediction cost ~80 clock cycles, and not the reading of the address from the cache or the stack.
jmp (here) and ret are 100% predictable (or near) coz the return address HAS BEEN STORED (full physic location) in the cache by the call instruction.

MichaelW, with a loop you can't see the mispredictions, they're absorbed by the loop... the cache read it once, or two (here the mispredictions could exists), and once it's in the cache you can't see them again, it's divided by the numbre of loop, it disappear far after the dot...

dedndave · August 24, 2010, 02:01:32 PM

for the value 80000000h, it returns 80000000h
that is correct - it is now an UNSIGNED INTEGER VALUE
if you like, you can zero the EDX register and call it a 33-bit or 64-bit signed integer value or whatever - lol
but the result in EAX will be the same

there is no way to represent +2147483648 with a SIGNED 32 bit integer

hutch-- · August 24, 2010, 02:45:23 PM

This is what Intel have to say about mismatching CALL and RET.

Quote
3.4.1.4 Inlining, Calls and Returns
The return address stack mechanism augments the static and dynamic predictors to
optimize specifically for calls and returns. It holds 16 entries, which is large enough
to cover the call depth of most programs. If there is a chain of more than 16 nested
calls and more than 16 returns in rapid succession, performance may degrade.
The trace cache in Intel NetBurst microarchitecture maintains branch prediction
information for calls and returns. As long as the trace with the call or return remains
in the trace cache and the call and return targets remain unchanged, the depth limit
of the return address stack described above will not impede performance.
To enable the use of the return stack mechanism, calls and returns must be matched
in pairs. If this is done, the likelihood of exceeding the stack depth in a manner that
will impact performance is very low.

dedndave · August 24, 2010, 03:01:17 PM

the impact of the messed up trace stack should be measurable
although, that is really a subject for a different thread

lingo · August 24, 2010, 03:41:10 PM

Hutch and Paul,
I know that but all these negative events may appear when the CPU executes NEXT instruction RET, hence long time after my last JMP.

daydreamer · August 24, 2010, 04:23:22 PM

what about a SSE parallel ABS function?

Code Select

.data
align 16
zero           real4 0.0,0.0,0.0,0.0
one         real4 1.0,1.0,1.0,1.0
minustwo    real4 -2.0,-2.0,-2.0,-2.0
.code
;ABSPS XMM0
movaps xmm2,xmm0;backup xmm0
cmpltps xmm0,zero ;check if less than zero
pandps xmm0,minustwo
movaps xmm1,one
subps xmm1,xmm0 ;conditional 1.0-(2.0 if less than zero)
movaps xmm0,xmm2;restore xmm0
mulps xmm0,xmm1 ;multiply with two negative numbers become positive

Rockoon · August 24, 2010, 04:58:05 PM

Quote from: lingo on August 24, 2010, 03:41:10 PM
Hutch and Paul,
I know that but all these negative events may appear when the CPU executes NEXT instruction RET, hence long time after my last JMP.

Lets take your argument to the extreme and suppose that the performance penalty is a billion clock cycles (a good percentage of an entire second even on modern gear)

Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'

If it matters in the billion-cycle penalty case, how can we ignore without justification the few-cycles penalty case?

Rockoon · August 24, 2010, 05:01:47 PM

Quote from: daydreamer on August 24, 2010, 04:23:22 PM
what about a SSE parallel ABS function?

Why cant you just clear the high bit of the floats with an ANDPS on the value 7FFFFFFFh?

Am I missing something to do with IEEE encodings of special values? (NaN, etc..) ?

dioxin · August 24, 2010, 05:09:03 PM

Lingo,

Quotemay appear when the CPU executes NEXT instruction RET

No, they appear on your JMP, not the next one. Your jmp is not predictable on its first execution. RET is.

Nightware,

Quotecall 1234567 IS NOT PREDICTABLE

Of course it's predictbale. I can predict it now even if you haven't yet written the program to use it.
That instruction will CALL the address 1234567. Try it and see if I predicted it correctly. The CPU can do just as good a job at predicting it.

Quotejmp (here) and ret are 100% predictable (or near) coz the return address HAS BEEN STORED (full physic location)

Not true. JMP ecx is NOT predicatable because the CPU does not have advance knowledge of the content of ecx.
We're not talking about cache misses here, the code might well be already cached. We're talking about how the CPU knows which instruction to run next.
JMP ecx is not resolvable until the jump instruction is executed so all the speculative execution stuff that the CPU can do will fail.
RET is resolvable well in advanve because the private stack of RET addresses in the branch prediction circuitry has a copy of the address.

QuoteThere is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.

Did I not say that earlier? That's why the CPU keeps a private stack of the last 16 return addresses needed from the last 16 CALLs. The address IS known because it's taken from the private CALL-RETURN stack, not the CPU stack.
This mechanism works provided you match CALL and RET. As soon as you stop pairing CALL and RET then the private CALL-RETURN stack will get out of sync and will mispredict the return.

Michael's program demonstrates the problem. He gets considerably quicker execution with RET than with pop and jmp.
You can also try something like this:

Code Select

call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc


TestFunc:
RET

TestFunc:
pop eax
jmp eax

Time how long it takes to do the 10 CALLs to the first version of TestFunc, then change it to the 2nd version and try again.
You can't do the 10 CALLs in a loop because the return address is then constant so the normal branch prediction logic follows the previous path and always gets it right after the first go.

On this simple code I get the POP-JMP version taking 5 times longer than the RET version. (106 vs 22clks)

Michael's code gives a 3 to 1 difference. Either way, they show JMP instead of RET wastes a lot of time.

Paul.

lingo · August 24, 2010, 05:17:33 PM

"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol

redskull · August 24, 2010, 05:28:12 PM

Quote from: dioxin on August 24, 2010, 05:09:03 PMThe CPU can do just as good a job at predicting it.

I am not so sure this is true. I'm unaware of anything that says your aveage CPU is smart enough to determine where an unconditional jump target is the first time through. The BTB doesn't get filled in until the *after* the jump, so for the first time the branch is evaulated, it uses the static predicition, even if you have something as simple as "JMP 12345". If you have anything that says different, I'd like to see it.

-r

News:

Fastest Absolute Function