The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Twister on August 23, 2010, 09:31:34 PM

Title: Fastest Absolute Function
Post by: Twister on August 23, 2010, 09:31:34 PM
Now we are going to get into some real competition. :bdg

abs proc uses edx dwNum:DWORD
mov eax, dwNum
mov edx, eax
and eax, 0F0000000h
cmp eax, edx
je @F
xor edx, 0F0000000h
xchg eax, edx
@@:
ret
abs endp
Title: Re: Fastest Absolute Function
Post by: lingo on August 23, 2010, 09:52:08 PM
Why you use  "uses edx"? Hutch, it's contagious ...:lol
        pop ecx
        pop eax 
        cdq
        xor eax,edx 
        sub eax,edx
        jmp ecx
Title: Re: Fastest Absolute Function
Post by: dioxin on August 23, 2010, 10:28:50 PM
Quotejmp ecx
You aren't really advocating that, for speed, you should pop a return address off the stack and jump to it, are you? That messes up the branch prediction mechanism which has short cuts for a paired CALL-RETURN.

Title: Re: Fastest Absolute Function
Post by: clive on August 23, 2010, 11:41:57 PM
Quote from: GTX
abs proc uses edx dwNum:DWORD
mov eax, dwNum
mov edx, eax
and eax, 0F0000000h
cmp eax, edx
je @F
xor edx, 0F0000000h
xchg eax, edx
@@:
ret
abs endp


Whole boat load of fail there.

C:\MASM>test42 80000000
80000000 80000000

C:\MASM>test42 f0000000
F0000000 F0000000

C:\MASM>test42 FFFFFFFF
FFFFFFFF 0FFFFFFF

C:\MASM>test42 0
00000000 00000000

C:\MASM>test42 1
00000001 F0000001

C:\MASM>test42 2
00000002 F0000002
Title: Re: Fastest Absolute Function
Post by: hutch-- on August 23, 2010, 11:45:33 PM
Both objections are correct, EDX is a transient register so it does not need to be preserved. Paul's comment is also correct, CALL / RET are paired in hardware so jumping back to the following address in the calling proc messes up that pairing.
Title: Re: Fastest Absolute Function
Post by: dedndave on August 24, 2010, 12:48:06 AM
there is so little code involved, i think it would be better implemented as a macro
the CALL/RET overhead isn't needed
assuming the value is already in EAX, the basic code is:

        cdq
        xor     eax,edx
        sub     eax,edx

i think that's 5 bytes - same length as a CALL   :bg
if you want to preserve EDX, then it's 7 bytes (not really necessary, but i can see times when it would be nice)
abs     MACRO
        push    edx
        cdq
        xor     eax,edx
        sub     eax,edx
        pop     edx
abs     ENDM


maybe you can have your cake and eat it, too
abs     MACRO
        push    edx
abs_edx MACRO
        cdq
        xor     eax,edx
        sub     eax,edx
abs_edx ENDM
        pop     edx
abs     ENDM


note: before the code-nazi accuses me of "stealing" his code
i think that is a well-known snippet
i previously published essentially the same code in my ling ling kai fang routines
notice how i did not accuse him - lol
Title: Re: Fastest Absolute Function
Post by: lingo on August 24, 2010, 02:10:36 AM
dioxin and Hutch,

"That messes up the branch prediction mechanism"
It is a little bit wrong but I agree with you in general...Why? Pls, read this:

"B.5.3.3 Mispredicted Returns
19. Mispredicted Return Instruction Rate: BR_RET_MISSP_EXEC/BR_RET_EXEC
The processor has a special mechanism that tracks CALL-RETURN pairs. The
processor assumes that every CALL instruction has a matching RETURN instruction.
If a RETURN instruction restores a return address, which is not the one stored during
the matching CALL, the code incurs a misprediction penalty.
"


But I haven't RETURN instruction in my code, hence I haven't a misprediction penalty too;  :lol
and I'm not responsible for the testing program which has somewhere the matching/non-matching RETURN instruction... :lol
Title: Re: Fastest Absolute Function
Post by: hutch-- on August 24, 2010, 04:05:27 AM
 :bg

Lingo,

Now you have me worried, even if you don't have a RET, the caller must CALL and with your jump to the return address you mess up the CALL RET pairing.
Title: Re: Fastest Absolute Function
Post by: sinsi on August 24, 2010, 05:11:47 AM
>because, it doesn't
For 1 value it doesn't seem to work
        cdq
        xor     eax,edx
        sub     eax,edx

Try it for eax=80000000h
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 24, 2010, 05:22:51 AM
Perhaps I'm missing some key point here. How could branch prediction be an issue for code where there are no conditional branches?
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 24, 2010, 05:51:17 AM
Well, I apparently am missing something.

;====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;====================================================================
    .data
    .code
;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
abs proc dwNum:DWORD
    mov eax, [esp+4]
    cdq
    xor eax,edx
    sub eax,edx
    ret 4
abs endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
absL proc dwNum:DWORD
    pop ecx
    pop eax
    cdq
    xor eax,edx
    sub eax,edx
    jmp ecx
absL endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================
start:
;====================================================================

    invoke abs, 0
    print ustr$(eax),13,10
    invoke abs, 1
    print ustr$(eax),13,10
    invoke abs, -1
    print ustr$(eax),13,10
    invoke abs, 2147483647
    print ustr$(eax),13,10
    invoke abs, -2147483648
    print ustr$(eax),13,10,13,10

    invoke absL, 0
    print ustr$(eax),13,10
    invoke absL, 1
    print ustr$(eax),13,10
    invoke absL, -1
    print ustr$(eax),13,10
    invoke absL, 2147483647
    print ustr$(eax),13,10
    invoke absL, -2147483648
    print ustr$(eax),13,10,13,10

    invoke Sleep, 3000

    REPEAT 3

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke abs, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke absL, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

    ENDM
    print chr$(13,10)

    inkey "Press any key to exit..."
    exit
   
;====================================================================
end start


22 cycles
60 cycles
21 cycles
60 cycles
21 cycles
60 cycles

Title: Re: Fastest Absolute Function
Post by: sinsi on August 24, 2010, 06:02:55 AM
Is that your venerable PIII?
Q6600

15 cycles
13 cycles
15 cycles
13 cycles
15 cycles
13 cycles

I think agner has something about branch prediction, mostly 'ignore it on newer cpus'.

You cheat by using ustr$ since we are testing signed numbers.
80000000h gives a different result (hint: it starts with '-')
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 24, 2010, 06:18:22 AM
QuoteIs that your venerable PIII?

Yes, so what I'm missing here is a modern computer.

Title: Re: Fastest Absolute Function
Post by: NightWare on August 24, 2010, 07:46:06 AM
Quote from: lingo on August 24, 2010, 02:10:36 AM
"B.5.3.3 Mispredicted Returns
19. Mispredicted Return Instruction Rate: BR_RET_MISSP_EXEC/BR_RET_EXEC
The processor has a special mechanism that tracks CALL-RETURN pairs. The
processor assumes that every CALL instruction has a matching RETURN instruction.
If a RETURN instruction restores a return address, which is not the one stored during
the matching CALL, the code incurs a misprediction penalty.
"

::) i don't know where this stupidity come from, but there is NO special mecanism... the call instruction simply store the return address in the cache. so unless you need an enormous number of address in your function there is no possible misprediction (and if you have an enormous number of address then YOU WILL HAVE MISPREDICTIONS anyway). nothing to see with the use of "ret" instruction or not (in itself). so there is no misprediction with "jmp ecx"... but "jmp ecx" IS ALSO A STUPIDITY, lingo, stop with that... i'm pretty sure that all beginners quit their functions like that now... :'(
guys, USE 1 BYTE instructions, if they're there, there is a reason !!!
Title: Re: Fastest Absolute Function
Post by: dedndave on August 24, 2010, 08:33:09 AM
well, RET 4 is not a single byte
and JMP reg32 is probably the fastest of all near branches
but any speed advantage is probably more related to how the stack parms are loaded at the beginning of the routine
        pop     ecx
        pop     eax
.
.
        jmp     ecx

is logically faster than
        push    ebp
        mov     ebp,esp
        mov     eax,[ebp+8]
.
.
        pop     ebp
        ret     4

i doubt it saves all that much time
it makes the code, shall we say "non-standard", though
for this simple function (which i still hold is better as a macro), it seems ok
but for more complicated functions, there is a disadvantage in that the parm is gone
it may be accessed only once, and it doesn't necessarily work well for functions with a few parms
and in many cases, it is better to have the parm on the stack to access again

i do have a problem with calling the standard method "stupid" or "idiotic"
there is nothing wrong with learning to write functions in a straightforward mannar
do we really care if a few bytes are saved ? - not with todays computers
do we really care if a few clock cycles are saved ? - same answer, i think

if you stick a function inside a loop and call it 10 million times, then it matters
but, i may be inclined to put that loop inside the proc and call it with an iteration parm
well - that is if you are really concerned about speed

this is a good place for a macro, guys
the code of the function is the same length as CALL - doh !

note - i realized my error and edited my previous posts
it was either too much pot in the previous century or not enough coffee in this one
Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 11:10:32 AM
Lingo (and others),
   
QuoteBut I haven't RETURN instruction in my code, hence I haven't a misprediction penalty too
You miss the point.

CALL, JMP and RETURN are all branch instructions so they are subject to branch prediction optimisations.
CALL 1234567 is 100% predictable, the address to jump to is in the instruction.
JMP edx is 100% unpredictable. There is no way for the CPU to know in advance what the address will be.
RET is also 100% unpredictable. There is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.

As an optimisation, the branch prediction logic maintains a private stack of return addresses.
Each CALL has the appropriate return address stacked and then, when the RET instruction is encountered the known address is popped off the private stack and the RET becomes a 100% predictable branch.
This works until someone fails to match the CALLs and RETs when the private stack gets out of sync and all returns revert to be 100% mispredicted. Not just the last return but all stacked returns are now mispredicted. The stack is typically 16 items deep allowing the last 16 CALL-RETURN pairs to be predicted correctly.

Paul.

Title: Re: Fastest Absolute Function
Post by: lingo on August 24, 2010, 12:21:41 PM
NightWare,

"i don't know where this stupidity come from, but there is NO special mecanism."

"This stupidity come from"  Intel 64 and IA-32 Architectures Optimization Reference Manual from November 2009.

Hutch and dioxin,

"you mess up the CALL RET pairing."
and
"Not just the last return but all stacked returns are now mispredicted."

I agree but I don't care what will hapend AFTER the FINISH of my program.


Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 12:33:35 PM
Lingo,
Quotebut I don't care what will hapend AFTER the FINISH of my program.
Your program isn't finished until your JMP is complete. RET will complete quicker.

Paul.
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 24, 2010, 12:57:38 PM
This is an attempt to determine if the problem is with the jmp ecx being mispredicted. I couldn't see any good way to make the destination operand a constant, so I settled for a uniform return address instead of one that changes on each call.

;====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;====================================================================
    .data
    .code
;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
abs proc dwNum:DWORD
    mov eax, [esp+4]
    cdq
    xor eax,edx
    sub eax,edx
    ret 4
abs endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
absL proc dwNum:DWORD
    pop ecx
    pop eax
    cdq
    xor eax,edx
    sub eax,edx
    jmp ecx
absL endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;====================================================================
start:
;====================================================================

    invoke abs, 0
    print str$(eax),13,10
    invoke abs, 1
    print str$(eax),13,10
    invoke abs, -1
    print str$(eax),13,10
    invoke abs, 2147483647
    print str$(eax),13,10
    invoke abs, -2147483648
    print str$(eax),13,10,13,10

    invoke absL, 0
    print str$(eax),13,10
    invoke absL, 1
    print str$(eax),13,10
    invoke absL, -1
    print str$(eax),13,10
    invoke absL, 2147483647
    print str$(eax),13,10
    invoke absL, -2147483648
    print str$(eax),13,10,13,10

    invoke Sleep, 3000

    REPEAT 3

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke abs, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          REPEAT 4
              invoke absL, -1
          ENDM
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          mov ebx, 4
        @@:
          invoke abs, -1
          dec ebx
          jnz @B
      counter_end
      print str$(eax)," cycles",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
          mov ebx, 4
        @@:
          invoke absL, -1
          dec ebx
          jnz @B
      counter_end
      print str$(eax)," cycles",13,10

      print chr$(13,10)

    ENDM

    inkey "Press any key to exit..."
    exit

;====================================================================
end start


Running on a P3:

22 cycles
60 cycles
25 cycles
30 cycles

21 cycles
60 cycles
24 cycles
30 cycles

21 cycles
60 cycles
25 cycles
30 cycles

Title: Re: Fastest Absolute Function
Post by: NightWare on August 24, 2010, 01:54:12 PM
Quote from: dioxin on August 24, 2010, 11:10:32 AM
CALL 1234567 is 100% predictable, the address to jump to is in the instruction.
JMP edx is 100% unpredictable. There is no way for the CPU to know in advance what the address will be.
RET is also 100% unpredictable. There is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.
:bdg it seams that you have the sens of humour... call 1234567 IS NOT PREDICTABLE and AT 100% (or near, unless it has been used recently...). predictions is an attempt of the cpu to read the needed page of code, nothing else... here your 100% prediction is just the EXECUTION of the instruction... a misprediction cost ~80 clock cycles, and not the reading of the address from the cache or the stack.
jmp (here) and ret are 100% predictable (or near) coz the return address HAS BEEN STORED (full physic location) in the cache by the call instruction.

MichaelW, with a loop you can't see the mispredictions, they're absorbed by the loop... the cache read it once, or two (here the mispredictions could exists), and once it's in the cache you can't see them again, it's divided by the numbre of loop, it disappear far after the dot...
Title: Re: Fastest Absolute Function
Post by: dedndave on August 24, 2010, 02:01:32 PM
for the value 80000000h, it returns 80000000h
that is correct - it is now an UNSIGNED INTEGER VALUE
if you like, you can zero the EDX register and call it a 33-bit or 64-bit signed integer value or whatever - lol
but the result in EAX will be the same

there is no way to represent +2147483648 with a SIGNED 32 bit integer
Title: Re: Fastest Absolute Function
Post by: hutch-- on August 24, 2010, 02:45:23 PM
This is what Intel have to say about mismatching CALL and RET.

Quote
3.4.1.4 Inlining, Calls and Returns
The return address stack mechanism augments the static and dynamic predictors to
optimize specifically for calls and returns. It holds 16 entries, which is large enough
to cover the call depth of most programs. If there is a chain of more than 16 nested
calls and more than 16 returns in rapid succession, performance may degrade.
The trace cache in Intel NetBurst microarchitecture maintains branch prediction
information for calls and returns. As long as the trace with the call or return remains
in the trace cache and the call and return targets remain unchanged, the depth limit
of the return address stack described above will not impede performance.
To enable the use of the return stack mechanism, calls and returns must be matched
in pairs. If this is done, the likelihood of exceeding the stack depth in a manner that
will impact performance is very low.
Title: Re: Fastest Absolute Function
Post by: dedndave on August 24, 2010, 03:01:17 PM
the impact of the messed up trace stack should be measurable
although, that is really a subject for a different thread
Title: Re: Fastest Absolute Function
Post by: lingo on August 24, 2010, 03:41:10 PM
Hutch and Paul,
I know that but all these negative events may appear when the CPU executes NEXT instruction RET, hence  long time after my last JMP.
Title: Re: Fastest Absolute Function
Post by: daydreamer on August 24, 2010, 04:23:22 PM
what about a SSE parallel ABS function?
.data
align 16
zero           real4 0.0,0.0,0.0,0.0
one         real4 1.0,1.0,1.0,1.0
minustwo    real4 -2.0,-2.0,-2.0,-2.0
.code
;ABSPS XMM0
movaps xmm2,xmm0;backup xmm0
cmpltps xmm0,zero ;check if less than zero
pandps xmm0,minustwo
movaps xmm1,one
subps xmm1,xmm0 ;conditional 1.0-(2.0 if less than zero)
movaps xmm0,xmm2;restore xmm0
mulps xmm0,xmm1 ;multiply with two negative numbers become positive
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 24, 2010, 04:58:05 PM
Quote from: lingo on August 24, 2010, 03:41:10 PM
Hutch and Paul,
I know that but all these negative events may appear when the CPU executes NEXT instruction RET, hence  long time after my last JMP.

Lets take your argument to the extreme and suppose that the performance penalty is a billion clock cycles (a good percentage of an entire second even on modern gear)

Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'

If it matters in the billion-cycle penalty case, how can we ignore without justification the few-cycles penalty case?
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 24, 2010, 05:01:47 PM
Quote from: daydreamer on August 24, 2010, 04:23:22 PM
what about a SSE parallel ABS function?

Why cant you just clear the high bit of the floats with an ANDPS on the value 7FFFFFFFh?

Am I missing something to do with IEEE encodings of special values? (NaN, etc..) ?
Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 05:09:03 PM
Lingo,
Quotemay appear when the CPU executes NEXT instruction RET
No, they appear on your JMP, not the next one. Your jmp is not predictable on its first execution. RET is.


Nightware,
Quotecall 1234567 IS NOT PREDICTABLE
Of course it's predictbale. I can predict it now even if you haven't yet written the program to use it.
That instruction will CALL the address 1234567. Try it and see if I predicted it correctly. The CPU can do just as good a job at predicting it.

Quotejmp (here) and ret are 100% predictable (or near) coz the return address HAS BEEN STORED (full physic location)
Not true. JMP ecx is NOT predicatable because the CPU does not have advance knowledge of the content of ecx.
We're not talking about cache misses here, the code might well be already cached. We're talking about how the CPU knows which instruction to run next.
JMP ecx is not resolvable until the jump instruction is executed so all the speculative execution stuff that the CPU can do will fail.
RET is resolvable well in advanve because the private stack of RET addresses in the branch prediction circuitry has a copy of the address.

QuoteThere is no way for the CPU to know in advance what the address will be on top of the stack until the stack is popped.
Did I not say that earlier? That's why the CPU keeps a private stack of the last 16 return addresses needed from the last 16 CALLs. The address IS known because it's taken from the private CALL-RETURN stack, not the CPU stack.
This mechanism works provided you match CALL and RET. As soon as you stop pairing CALL and RET then the private CALL-RETURN stack will get out of sync and will mispredict the return.

Michael's program demonstrates the problem. He gets considerably quicker execution with RET than with pop and jmp.
You can also try something like this:

call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc
call TestFunc


TestFunc:
RET

TestFunc:
pop eax
jmp eax


Time how long it takes to do the 10 CALLs to the first version of TestFunc, then change it to the 2nd version and try again.
You can't do the 10 CALLs in a loop because the return address is then constant so the normal branch prediction logic follows the previous path and always gets it right after the first go.

On this simple code I get the POP-JMP version taking 5 times longer than the RET version. (106 vs 22clks)

Michael's code gives a 3 to 1 difference. Either way, they show JMP instead of RET wastes a lot of time.

Paul.




Title: Re: Fastest Absolute Function
Post by: lingo on August 24, 2010, 05:17:33 PM
"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol
Title: Re: Fastest Absolute Function
Post by: redskull on August 24, 2010, 05:28:12 PM
Quote from: dioxin on August 24, 2010, 05:09:03 PMThe CPU can do just as good a job at predicting it.

I am not so sure this is true.  I'm unaware of anything that says your aveage CPU is smart enough to determine where an unconditional jump target is the first time through.  The BTB doesn't get filled in until the *after* the jump, so for the first time the branch is evaulated, it uses the static predicition, even if you have something as simple as "JMP 12345".  If you have anything that says different, I'd like to see it.

-r
Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 05:34:35 PM
Redskull,
CALL 1234567 is unconditional and goes to a fixed address. The CPU knows this fixed address so it can start fetching subsequent instruction with 100% certainty. Such braches are 100% predicted by the CPU.
I'm NOT saying the same about JMP ecx, you'd better ask Lingo why he thinks that one is predictable.

Paul.
Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 05:40:00 PM
Redskull,
QuoteIf you have anything that says different, I'd like to see it.
See the Intel 64 and IA-32 Architectures Optimization Reference Manual
Order Number 248966-016
Section 2.1.2.1 Branch Prediction Unit

I'd copy the section here but I can't cut and paste from the document.

Paul.
Title: Re: Fastest Absolute Function
Post by: redskull on August 24, 2010, 06:11:29 PM
3.4.1.3  Static Prediction
Branches that do not have a history in the BTB (see Section 3.4.1, "Branch Prediction Optimization") are predicted using a static prediction algorithm. Pentium 4, PentiumM, Intel Core Solo and Intel Core Duo processors have similar static predic-
tion algorithms that:
predict unconditional branches to be taken
• predict indirect branches to be NOT taken

The first time the CPU encounters JMP 12345, there is no BTB history, and it predicts (correctly) the branch be taken to the target.  When the jump is actually retired, the BTB is updated with the address (12345).  Between the time the same jump is executed again, more jumps fill up the BTB and bump out the original one, and the new one is unlucky enough to have the same set value.  The next time the first JMP comes around, the CPU sees what it thinks is a BTB entry for it, and uses the stored target adress from the other jump to start fetching.  Would this not cause a misprediction for a unconditional direct jump?

99.9% is not 100%

-r
Title: Re: Fastest Absolute Function
Post by: dioxin on August 24, 2010, 06:26:48 PM
Redskull,
Quoteand uses the stored target adress from the other jump to start fetching.
Why would it do that? The other jump is another jump in another place in code. If there is history of this jump then that history would be used. If there is no history of this jump then the rules are followed from the beginning again.


Quote from the above reference:
QuoteThe Branch Prediction Unit makes the following types of predictions:
..
Direct calls and jumps. Targets are read as a target array without regarding the taken or not taken prediction.

Indirect calls and jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.
Paul.
Title: Re: Fastest Absolute Function
Post by: redskull on August 24, 2010, 06:38:04 PM
Quote from: dioxin on August 24, 2010, 06:26:48 PMWhy would it do that? The other jump is another jump in another place in code. If there is history of this jump then that history would be used. If there is no history of this jump then the rules are followed from the beginning again.

Because it doesn't use the entire 32 bits of the branch instruction address to identify the BTB entry.  Other jumps at certain set multiples will have identical entries.

r
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 24, 2010, 06:58:26 PM
I wonder whether the "penalty" is worth arguing. Here is a testbed with three different calling variants: C style, a hybrid vararg (i.e. no add esp, nnn after the call) using a jmp dword ptr [], and two procs with fixed number of args:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
6       cycles for CcallP  3
10      cycles for CcallP  6
6       cycles for VarArgP 3
10      cycles for VarArgP 6
5       cycles for FixArgP 3
8       cycles for FixArgP 6

6       cycles for CcallP  3
10      cycles for CcallP  6
6       cycles for VarArgP 3
10      cycles for VarArgP 6
5       cycles for FixArgP 3
8       cycles for FixArgP 6

Title: Re: Fastest Absolute Function
Post by: clive on August 24, 2010, 07:06:06 PM
Quote from: redskull
Because it doesn't use the entire 32 bits of the branch instruction address to identify the BTB entry.  Other jumps at certain set multiples will have identical entries.

Quote from: redskull
The first time the CPU encounters JMP 12345, there is no BTB history, and it predicts (correctly) the branch be taken to the target.  When the jump is actually retired, the BTB is updated with the address (12345).

Isn't it tracking the SOURCE instruction location (with some granularity), not the TARGET? Knowing the target address is fun and all, but isn't that irrelevant to the taken/not taken metrics. Hopefully the code at the TARGET is cached, but either way it's the filling of the pipeline with the correct execution path that is saving you the 20-30 cycle stall, and the speculative prefetch down both paths that gets the cache lines populated.

Quote from: jj2007I wonder whether the "penalty" is worth arguing.

Depends, I've seen the JMP ECX construct expose the pipeline stall on an Atom, something along the lines of the return address being at, or within a couple of instructions, to another branch. If you get repetitively nailed by this it's going to be a bigger problem.
Title: Re: Fastest Absolute Function
Post by: redskull on August 24, 2010, 07:22:30 PM
Quote from: clive on August 24, 2010, 07:06:06 PM
Isn't it tracking the SOURCE instruction location (with some granularity), not the TARGET?

It is my understanding it tracks both.  It does the source to track the entries itself (and to predict sets larger than 1 jump), but also the target for helping out with indirect jumps.  That way, it can assign an unique BTB entry for each target as it's encountered, not just source, and it can hazard a guess as towhere it will go.  I have no firm evidence to back that up, however.

-r
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 24, 2010, 09:47:06 PM
Quote from: lingo on August 24, 2010, 05:17:33 PM
"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol

You seem to have a real problem answering my Yes or No questions. Do they make you uncomfortable?
Title: Re: Fastest Absolute Function
Post by: clive on August 24, 2010, 09:52:42 PM
Quote from: redskullIt is my understanding it tracks both.

It strikes me that doing that for run of the mill Jxx branches would add a hideous amount of silicon for little appreciable benefit, it always knows where the drop through case and target are implicitly. I'm not even sure it would help for jump table like constructs (switch/case), where it's always going to branch somewhere but with significantly more than 2 possible outcomes.
Title: Re: Fastest Absolute Function
Post by: clive on August 24, 2010, 09:59:13 PM
Quote from: Rockoon
Quote from: lingo
"Would you still maintain that it does not matter because it happens a 'long time after my last JMP?'"

After me, a flood of chaos.... :lol

You seem to have a real problem answering my Yes or No questions. Do they make you uncomfortable?

Actually I thought lingo's answer was actually funnier, and more accurate, than the Yes/No response you were looking for. Something along the lines of Other/Don't Know/Don't Care which all the stupid political polls or customer survey forms have. A bit like answering the race question on the census with Nascar or Formula 1. Do Hispanics like Nascar? And how should that have any impact on the apportionment of federal tax dollars?
Title: Re: Fastest Absolute Function
Post by: redskull on August 24, 2010, 11:12:03 PM
Quote from: clive on August 24, 2010, 09:52:42 PM
It strikes me that doing that for run of the mill Jxx branches would add a hideous amount of silicon for little appreciable benefit, it always knows where the drop through case and target are implicitly.

True, but the tide these days is turning to abstract base classes and virtual functions.  If compilers are turning every call into an indirect one, than all the other branch prediction silicon is for naught.  Of course, it's all conjecture based on my limited knowledge of undocumented features, so who knows what's really going on.

-r
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 25, 2010, 12:08:42 AM
Quote from: NightWare on August 24, 2010, 01:54:12 PM
MichaelW, with a loop you can't see the mispredictions, they're absorbed by the loop... the cache read it once, or two (here the mispredictions could exists), and once it's in the cache you can't see them again, it's divided by the numbre of loop, it disappear far after the dot...

That is why I used a loop. I reasoned that if there were mispredictions for the jmp ecx code in the repeated calls, I could eliminate some of them by doing the calls from a loop where the jump destination was the same for every call. For small repeat and loop counts it worked the way I expected. But as I increased the counts I expected the loop cycles for the jmp ecx code to eventually drop below the loop cycles for the ret code, and for some reason that I don't understand, this did not happen.
Title: Re: Fastest Absolute Function
Post by: daydreamer on August 25, 2010, 10:26:29 AM
Quote from: Rockoon on August 24, 2010, 05:01:47 PM
Quote from: daydreamer on August 24, 2010, 04:23:22 PM
what about a SSE parallel ABS function?

Why cant you just clear the high bit of the floats with an ANDPS on the value 7FFFFFFFh?

Am I missing something to do with IEEE encodings of special values? (NaN, etc..) ?
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit
it sucks to see this thread sidetrack into branch prediction debate
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 25, 2010, 01:15:56 PM
Quote from: daydreamer on August 25, 2010, 10:26:29 AM
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit
it sucks to see this thread sidetrack into branch prediction debate

If the float is a valid number, the clearing should work. But I thought the OP wanted dwords...?

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
14      cycles for AbsLingo, pos
9       cycles for AbsJJp proc, pos
1       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

9       cycles for AbsLingo, neg
9       cycles for AbsJJp proc, neg
3       cycles for mov arg, AbsJJ(arg), neg
3       cycles for SetAbsJJ arg, neg

11      cycles for AbsLingo, pos
8       cycles for AbsJJp proc, pos
1       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

9       cycles for AbsLingo, neg
9       cycles for AbsJJp proc, neg
1       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


Lingo's code is a bit bloated but for once it does not throw exceptions :bg

AbsJJ MACRO arg  ; use as mov MyNewDword, Abs(OldDword)
  mov eax, arg
  test eax, eax
  .if Sign?
neg eax
  .endif
  EXITM <eax>
ENDM

SetAbsJJ MACRO arg  ; use as SetAbsJJ MyDword
  ifdifi <eax>, <arg>
mov eax, arg
  endif
  test eax, eax
  .if Sign?
neg arg
  .endif
ENDM
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 25, 2010, 01:53:08 PM
Quote from: daydreamer on August 25, 2010, 10:26:29 AM
I test that and see if that works, just not sure about IEEE encodings is that simple as just clear highbit

Well I am quite certain that for valid float numbers (both normal and denormal!) that just clearing the high bit works. Both 32-bit and 64-bit IEEE encoding does not use two's complement.. it just uses the sign bit as a flag.

I honestly dont know if the high bit is set in the special error condition values (Not A Number, etc..) tho .. if its always set, then problems could arise.. if its only set when sign means something (+Infinity, -Infinity, ...) then no problems should ever arise that werent already problems.
Title: Re: Fastest Absolute Function
Post by: lingo on August 26, 2010, 03:26:53 PM
"The thieves are always liars"

"Lingo's code is a bit bloated but for once it does not throw exceptions"

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg

4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


--- ok ---


Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 03:53:45 PM

:dance:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

6       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-2      cycles for mov arg, AbsJJ(arg), neg
-3      cycles for SetAbsJJ arg, neg

6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-1      cycles for SetAbsJJ arg, pos

6       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-2      cycles for mov arg, AbsJJ(arg), neg
-3      cycles for SetAbsJJ arg, neg


Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 04:11:22 PM
Quote from: Rockoon on August 26, 2010, 03:53:45 PM

:dance:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

Minus three cycles is pretty fast, where can I buy that CPU?

Seriously: Does it change if you use SetProcessAffinityMask? If not, we'd have to look at MichaelW's calibration loop.
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 26, 2010, 05:07:15 PM
Processors have gotten so fast now that occasionally they skip back in time :bg

Assuming that everything is running on a single core, results like this indicate that something interfered with the reference loop, making the overhead count larger than the test count. This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.

Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 06:33:51 PM
Quote from: jj2007 on August 26, 2010, 04:11:22 PM
Quote from: Rockoon on August 26, 2010, 03:53:45 PM

:dance:

AMD Phenom(tm) II X6 1055T Processor (SSE3)
6       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-2      cycles for mov arg, AbsJJ(arg), pos
-3      cycles for SetAbsJJ arg, pos

Minus three cycles is pretty fast, where can I buy that CPU?

Seriously: Does it change if you use SetProcessAffinityMask? If not, we'd have to look at MichaelW's calibration loop.

No change. The results are consistent across many runs, with affinity and without.

The timing code is using CPUID as the out-of-order serializing fence, surrounding RDTSC with two such fences, but thats not actually good enough!

In more direct terms, these are the possibilities:

(A) All of the preceding instructions are completed before CPUID begins.
(B) None of the following instructions are begun before CPUID ends.
(C) Both (A) and (B)

For CPUID to be a foolproof fence that bars all pairings with it (rather than just an out-of-order fence), it must obey (C), but that is not part of the specification. The specification demands (A) or (B) but that does not preclude the code being timed to pair up with the CPUID instruction itself on one end or the other, which is what appears to be happening on my AMD.

Basically, CPUID is an out-of-order fence, but is not by specification a pairing fence.

Agner Fog uses the same CPUID serialization technique for his timing code, but he repeats the code being timed a hundred times (by default) between start and end RDSTC's, so pairing effects with CPUID would be at most 1% of the total count. There, he is more concerned with code throughput than he is code latency... with latency being deduced from throughput, secondary execution unit knowledge, and reasoning about the dependency chains.
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 06:37:14 PM
Quote from: MichaelW on August 26, 2010, 05:07:15 PM
This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.

You can try to change the SleepMs variable, but a delay of 2,000 ms delivers exactly the same results as a delay of 50ms:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
4       cycles for AbsLingo, pos
3       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 06:39:36 PM
Quote from: MichaelW on August 26, 2010, 05:07:15 PM
Processors have gotten so fast now that occasionally they skip back in time :bg

Assuming that everything is running on a single core, results like this indicate that something interfered with the reference loop, making the overhead count larger than the test count. This is why I insert a delay of several seconds before I start timing, to allow time for the system activities involved in launching an app to subside.



Unfortunately this explanation isnt valid based on the timing macros structure. The baseline is calibrated repeatedly as each counter_begin "call" re-evaluates the baseline, changing the codes REPEAT to 10 yields the same "irrational" results on all repetitions.


Title: Re: Fastest Absolute Function
Post by: dedndave on August 26, 2010, 06:54:43 PM
i always wanted a computer that was done BEFORE i hit enter - lol
select a single core - REALTIME_PRIORITY_CLASS - keep the test bursts brief
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 06:58:12 PM
Quote from: lingo on August 26, 2010, 03:26:53 PM
"The thieves are always liars"

"Lingo's code is a bit bloated but for once it does not throw exceptions"

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: :lol

While I am happy to see that you liked my code, I am very sorry that you don't get the "correct" results. Maybe because you forgot to insert some of your magic bytes before progstart?

But you are right that the testbed should be neutral, so I added the a16 macro that aligns the code and places 16 rets before progstart. The results are now exactly the same for both identical progs (3 cycles), but I am afraid your original algo is just utterly, painfully slow:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
8       cycles for AbsLingo, pos
3       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos


But it does not raise exceptions :U
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 07:05:05 PM
Quote from: dedndave on August 26, 2010, 06:54:43 PM
select a single core

Done, see zip under your post.
Title: Re: Fastest Absolute Function
Post by: lingo on August 26, 2010, 07:33:29 PM
"The thieves are always liars"[/U]

I changed my code with JJ's code in his "test" program
and received different numbers for equal algos: 
ntel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg

4       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
0       cycles for mov arg, AbsJJ(arg), pos
0       cycles for SetAbsJJ arg, pos

4       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg


--- ok ---
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 07:54:42 PM
AMD Phenom(tm) II X6 1055T Processor (SSE3)
7       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-3      cycles for mov arg, AbsJJ(arg), pos
-2      cycles for SetAbsJJ arg, pos

7       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-3      cycles for mov arg, AbsJJ(arg), neg
-2      cycles for SetAbsJJ arg, neg

7       cycles for AbsLingo, pos
2       cycles for AbsJJp proc, pos
-3      cycles for mov arg, AbsJJ(arg), pos
-2      cycles for SetAbsJJ arg, pos

7       cycles for AbsLingo, neg
2       cycles for AbsJJp proc, neg
-3      cycles for mov arg, AbsJJ(arg), neg
-2      cycles for SetAbsJJ arg, neg


Nice and consistent.. still, this timing method just isnt practical. Hasnt been for years.
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 08:10:19 PM
Unfortunately, the latest AMD CodeAnalyst no longer includes a Pipeline Simulation mode. With that mode I could have at least shown that the code was indeed pairing up with CPUID on my AMD.

Yay for downgrades?

Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 09:14:30 PM
Quote from: lingo on August 26, 2010, 07:33:29 PM
I changed my code with JJ's code in his "test" program
and received different numbers for equal algos:

Lingo, you repeat yourself, and that problem has been dealt with in reply #54. Your attachment has old code.
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 26, 2010, 09:17:06 PM
Quote from: Rockoon on August 26, 2010, 08:10:19 PMcode was indeed pairing up with CPUID on my AMD.

How would that affect/not affect the reference loop? Would we need instructions that do pair up with CPUID in the ref loop?
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 26, 2010, 11:40:15 PM
Quote from: jj2007 on August 26, 2010, 09:17:06 PM
Quote from: Rockoon on August 26, 2010, 08:10:19 PMcode was indeed pairing up with CPUID on my AMD.

How would that affect/not affect the reference loop? Would we need instructions that do pair up with CPUID in the ref loop?

There is no reference loop. Its a straight...

CPUID
RDTSC
nothing here
CPUID
RDTSC

...sequence

Essentially, its only measuring the time it takes to execute the timing overhead.

The out of order execution prevention probably happens in the pipeline itself by not pushing u-ops forward into the scheduler, but unless both sides of CPUID are fenced, then CPUID's u-ops themselves can get paired with either the instructions before or after.

Title: Re: Fastest Absolute Function
Post by: dedndave on August 27, 2010, 12:55:22 AM
so um - how would you suggest we measure the reference measurement time ?
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 27, 2010, 02:56:00 AM
Quote from: Rockoon on August 26, 2010, 11:40:15 PM
There is no reference loop. Its a straight...

CPUID
RDTSC
nothing here
CPUID
RDTSC

...sequence

Essentially, its only measuring the time it takes to execute the timing overhead.

If you are speaking of the counter_begin code, there is a reference loop:

        xor   eax, eax        ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter

        push  edx             ;; Preserve high-order 32 bits of start count
        push  eax             ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor   eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop
        sub   __counter__loop__counter__, 1
        jnz   @B

        xor   eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        . . .


The pairing problem has been discussed before. The cycle count for CPUID varies with the input value, so for consistency it's necessary to use the same value for each execution. Even if CPUID cannot pair with the surrounding instructions, the instruction used to set the input value can. The bottom line is that these macros cannot reasonably be used to time code that executes in only a few cycles.
Title: Re: Fastest Absolute Function
Post by: dedndave on August 27, 2010, 03:49:22 AM
i suppose you could run 1 iteration of the test code for a reference measurement (or 2, perhaps)
then run 10001 iterations (10000 requested, for example) to run the test
then take the difference in times
and divide by 10000 to find the avg test time

but, i am not sure that will solve the granularity issue with the CPUID instruction
it takes, let's say, 80 clock cycles and can be +/- a few cycles
that means that you may not be able to measure times with more resolution than a few cylces

the problem seems to be that CPUID is a serializing instruction used to force inline execution of RDTSC
that mechanism, alone, is going to wobble a few cycles - lol
Michael has tried using other serializing instructions, and has found CPUID to yield the best results
Title: Re: Fastest Absolute Function
Post by: jj2007 on August 27, 2010, 06:41:28 AM
Quote from: Rockoon on August 26, 2010, 11:40:15 PM
There is no reference loop. Its a straight...
...
Essentially, its only measuring the time it takes to execute the timing overhead.

The out of order execution prevention probably happens in the pipeline itself by not pushing u-ops forward into the scheduler, but unless both sides of CPUID are fenced, then CPUID's u-ops themselves can get paired with either the instructions before or after.

Right. I had not looked at it for ages.
Usually, for suspiciously short timings, a REPEAT 100 inside the counter_ structure helps to get more reasonable values:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1769    cycles for 100*AbsLingo, pos
741     cycles for 100*AbsJJp proc, pos
304     cycles for mov arg, 100*AbsJJ(arg), pos
248     cycles for 100*SetAbsJJ arg, pos

8       cycles for AbsLingo, neg
4       cycles for AbsJJp proc, neg
0       cycles for mov arg, AbsJJ(arg), neg
0       cycles for SetAbsJJ arg, neg
Title: Re: Fastest Absolute Function
Post by: lingo on August 27, 2010, 11:11:46 AM
"The thieves are always liars"[/U]
Bla..bla..bla. As you saw your test program is proven crap, hence your results too... :lol
Title: Re: Fastest Absolute Function
Post by: dioxin on August 27, 2010, 03:28:46 PM
Michael,
from the timing macros:
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop

..
..
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      label:                  ;; Start test loop


Does this not leave room for a few clks of error because the CPUID finishes and then executes an unknown number of padding NOPs to get the alignment for the next instruction. The amount of padding varies from 0 to 15 bytes for an alignment of 16 so the reference loop and the real loop may well have differnt padding included in the timing. As far as I can see there is no attempt to keep the padding identical for the reference loop and the timing loop.

Paul.
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 28, 2010, 12:54:34 AM
Hi Paul,

Yes it does, or at least probably does. Assembled with ML 6.15 there are 45 bytes between the labels, when it should be some multiple of 16. The alignment padding will consist of 1 or 2 relatively fast instructions, so I would think the error would be at most 2 cycles. After 24 hours on my feet I'm not thinking too clearly, but I think it should be possible to determine the distance at assembly time and control it without polluting either loop by adding an appropriate number of nops after the RDTSC following the reference loop.

Does the alignment of the labels even matter for anything after the P3?
Title: Re: Fastest Absolute Function
Post by: hutch-- on August 28, 2010, 07:39:32 AM
How about something like this ?


    align 16

    jmp lbl0    ; 0EB0Eh

    nop         ; 12 nops
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop

  lbl0:

    rdtsc       ; 0F31

  lbl_16_BYTE_align:
Title: Re: Fastest Absolute Function
Post by: MichaelW on August 28, 2010, 10:29:48 AM
What I had in mind was to modify the counter_begin macro such that the distance between the loop labels would be a fixed multiple of 16 bytes, regardless of the MASM (or JWASM) version used to assemble the code. Doing it cleanly would require the distance to a forward label, and so far I have not been able to make this work. So for a quick test I changed the macro to display the number of padding bytes, so they can be inserted manually. Running on a P3, which generally produces very consistent cycle counts, in the short time I had to play with it I could not see any significant difference.

Edit:

I corrected an error in my modified macro, and moved everything into the attachment. I tested the original macro against the modified macro, at random alignments, and to maximize the consistency I used REALTIME_PRIORITY_CLASS and avoided accessing the console during the test. I used an empty timing loop to make any differences in cycle counts between the reference and timing loops obvious. Running on a P3 I cannot detect any significant differences with the empty timing loop, and only very small differences with one or a few instructions in the loop. Depending on your assembler, you may need to change the number of padding nops for the reference and/or timing loops.
Title: Re: Fastest Absolute Function
Post by: Rockoon on August 29, 2010, 01:43:33 PM
To cover the specifics of modern chips, you really want 64-byte alignment to deal with the largest cache lines in use today.

This can be important because even on full cache hits, most modern CPU's take an extra cycle to decode instructions which straddle cache lines .. that is specifically that their decoder can only see 1 cache line at a time.. its "window" to memory is a cache line and even with full-on cache hits, it takes 1 cycle for the decoder to "switch" to the next "window."

The simple solution is to stop trying to time such short snippets of code, because quite frankly they are too short to get useful results even if you could time them perfectly anyways .. in such short code snippets.. throughput is usually more important than latency.