Performance Improvement: Top down or bottom up ?

Started by James Ladd, June 05, 2006, 04:28:19 AM

Previous topic - Next topic

James Ladd

All,

I think I read somewhere here that jumping to the bottom
of some code and executing it from the bottom up was
actually faster than from the top down?

Has anyone done some testing on this and found any proof
other than the initial thought that it ensures code is in
the cache?

Also, GoASM supports branch hinting and Im wondering if the same
is supported in MASM and if not directly, can it be supported
indirectly ?

Rgs, James

donkey

Branch hinting is supported in the Pentium 4 and later, it is simply a mod prefix on the Jcc instruction so yes, you can insert it directly in MASM. I have never heard that "bottom up" execution is faster.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

1rDirEctoALgran0

I suppose that :
A) "jumping to the bottom of some code" is a jump to a label situated after the current instruction.
and
B) "executing it from the bottom up" is a jump to a label situated before the current instruction.
GoAsm assemble the case A in 2 bytes and the case B in 5 bytes. (and +2 bytes if you add HINT.BRANCH or HINT.NOBRANCH)
So the case B is slower than the case A. The cache increases also the difference.
I think that there is no optimisation in the case B yet, perhaps in the future...

Regards.

Patrick

James Ladd

Thanks for the responses.
Nice to see you back and online again Donkey.

I probably wasnt clear as Im not suggesting the techniques be used together.
They were seperate questions.

Can someone show my the MASM code for doing a branch hint ?

Rgs, James.

donkey

Hi James,


The hint bytes are inserted just before a conditional jump...

2Eh - hint that the branch will not occur most of the time.
3Eh - hint that the branch will occur most of the time.

In MASM you would use DB to insert the appropriate byte

db 2Eh
jnz @F
xor eax,eax
@@:

However, it is good to remember that using branch hints will cause an exception on all processors before the P4.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

James Ladd

Donkey or someone,

Is there some conditional assembly I can use to know if Im compiling for a P4 ?

donkey

You can use #IFDEF/#ENDIF

P4ONLY = TRUE

#IFDEF P4ONLY
  hint.nobranch
#ENDIF
jz >
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

James,

The general idea with branch prediction is that a conditional jump is predicted backwards unless it has been run enough times to be in the branch prediction buffer.

Jumping into a loop at a label that is not at the beginning of the loop is very common and it is usually in code that has been manually jump optimised to use the default branch prediction and usualy in conjunction with branch reduction. You will se some improvement if you get all of the conditional jumps properly predicted but its relatively minor in timing terms.

Where you will see inprovements that are useful is reducing memory reads and writes as they are far slower than direct register reads and writes. This means making as many registers available as possible and very carefully reusing them to avoid memory reads and writes in critical code.

As a general hint, preserve all of the necessary registers and if it is of benefit, also remove the stack frame as it also frees up EBP, start with LOCAL values until you get the code up and running then carefully replace the locals with registers while removing the memory reads and writes and you will come close in terms of top speed.

Also note that an optimisation on one processor often makes it slower on another so you have to learn to do some averaging across both Intel and AMD hardware.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zooba

'WHILE' style loops are often implemented with the condition at the end and a jump to the condition before executing the code:

(in pseudo C/ASM)
while(eax < 200)
{
    eax++;
}


often compiles to:
    jmp CheckCondition
StartLoop:
    inc eax
CheckCondition:
    cmp eax, 200
    jl  StartLoop


rather than:
StartLoop:
    cmp eax, 200
    jnl EndLoop
    inc eax
    jmp StartLoop


The first optimises for the case where the loop runs at least once, whereas the second is for the case where the loop doesn't run. Branch prediction, while realistically having little influence, will also default to correct predictions for the first one (jl goes backwards) while the second one will cause trouble.

Possibly the first implementation is similar to a 'DO-WHILE' loop, making the compiler's job easier.

Cheers,

Zooba :U

donkey

Hi zooba,

In the second example the default branch prediction will be that the jnl is not taken so there is still a maximum of 1 unpredicted branch and cache flush, essentially the same as in the first example. Intel Branch prediction defaults are:

Forward conditional branches are never taken
Backward conditional branches are always taken

Using branch hints can in some cases increase speed by 20-30 percent, no small feat if you have some complex branches in your program like a switch/case type block where some cases are rare and some are common.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

James Ladd

Great responses from All, Thanks.

I am still wondering if there is a default define that I can use to detect if I am assembling for
a P4 ?


1rDirEctoALgran0

No there is no default define to detect a P4 at assembly time. :'(
Meanwhile you can use the CPUID instruction at execution time.
You will be able to know exactly the type of processor running.

Patrick

James Ladd

Dang.

Guess I'll have to add it as a define from the command line then. Oh well.

donkey

Hi James,

You can define your own conditional compilation for a P4 if you like. You would do it this way...

#define FORP4ONLY

CODE SECTION

#IFDEF FORP4ONLY
hint.branch jz >>.target
#ELSE
jz >>.target
#ENDIF


to remove P4 specific code just comment out the line, or as you said you can include the define on the command line using /d FORP4ONLY

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

James Ladd

would anyone know how how to do this in Gnu Assembler?