Branchless?

Alloy · February 26, 2006, 09:03:48 PM

Is branchless code faster than cached loops?

sluggy · February 27, 2006, 12:12:16 AM

We'll see how this thread goes, we may need to move it to either the Lab or the Colosseum, depends on how things go :bg

IMO, branchless code has to be faster, as it is the same code but without the compares and jumps, which means you have less cycles because of less instructions and it also means no jump mispredictions.

This is all fine and good, but then you have the code readability/maintainability issue. I have only ever coded unrolled loops that had up to 10 iterations (and they were short loops) - my personal opinion is that beyond this the code gets to hard to read, for very little benefit.

zooba · February 27, 2006, 08:49:37 AM

Depends on the code. It's very unlikely you can replace a loop with branchless code as you must then provide for each possible iteration.

However, in some cases a single branch can be much much more efficient than 'branchless' code. For example, adding a 32-bit value to a 128-bit number:

Branchless:

Code Select

add dwNumber[0], eax
adc dwNumber[4], 0
adc dwNumber[8], 0
adc dwNumber[12], 0

With Branches:

Code Select

add dwNumber[0], eax
jnc @F
inc dwNumber[4]
jnc @F
inc dwNumber[8]
jnc @F
inc dwNumber[12]
@@:

As usual, it all comes down to the application. A boolean-NOT operation on an integer is slightly more efficient in the branchless code:

Branchless:

Code Select

neg eax
sbb eax, eax

With Branch:

Code Select

or  eax, eax
jz  @eax_is_zero
xor eax, eax
jmp @done

@eax_is_zero:
mov eax, -1    ; alternatives (dec/sub/not) don't improve by much

@done:

Cheers,

Zooba :U

Tedd · February 27, 2006, 12:06:38 PM

Looping code is not necessarily bad.
The problem with long sequences of instructions is that they're only decoded once. The decoder caches the instruction sizes to speed up decoding, so decoding the same code second time around is much faster, but in sequential code every instruction must be decoded first-time.
Obviously the benefits of this aren't too significant with a few loops, or probably with short loops.

dsouza123 · February 27, 2006, 03:18:21 PM

With Branches: has a bug,
inc doesn't affect the carry flag, the zero flag is the one to check.

Code Select


add dwNumber[0], eax
jnc @F
inc dwNumber[4]
jnz @F
inc dwNumber[8]
jnz @F
inc dwNumber[12]
@@:

zooba · February 27, 2006, 10:40:54 PM

Quote from: dsouza123 on February 27, 2006, 03:18:21 PM
With Branches: has a bug,
inc doesn't affect the carry flag, the zero flag is the one to check.
Code Select Expand
add dwNumber[0], eax jnc @F inc dwNumber[4] jnz @F inc dwNumber[8] jnz @F inc dwNumber[12] @@:

Oops... :red

I 'optimised' that on the fly from 128bit + 128bit code :eek

Alloy · February 28, 2006, 02:38:51 AM

I don't find branchless that hard to read, the unrolled code can be placed in a procedure or an include file. James Leiterman has a chapter on it in his 32/64-bit 80x86 Assembly Language Architecture book with some interesting optimizations to avoid branches. I just wasn't sure and don't have a way of testing that using an instruction cache would be quicker than precached loop code.

News:

Branchless?

Alloy

sluggy

zooba

Tedd

dsouza123

zooba

Alloy