News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Branchless?

Started by Alloy, February 26, 2006, 09:03:48 PM

Previous topic - Next topic

Alloy

Is branchless code faster than cached loops?
We all used to be something else. Nature has always recycled.

sluggy

We'll see how this thread goes, we may need to move it to either the Lab or the Colosseum, depends on how things go  :bg

IMO, branchless code has to be faster, as it is the same code but without the compares and jumps, which means you have less cycles because of less instructions and it also means no jump mispredictions.

This is all fine and good, but then you have the code readability/maintainability issue. I have only ever coded unrolled loops that had up to 10 iterations (and they were short loops) - my personal opinion is that beyond this the code gets to hard to read, for very little benefit.

zooba

Depends on the code. It's very unlikely you can replace a loop with branchless code as you must then provide for each possible iteration.

However, in some cases a single branch can be much much more efficient than 'branchless' code. For example, adding a 32-bit value to a 128-bit number:

Branchless:
add dwNumber[0], eax
adc dwNumber[4], 0
adc dwNumber[8], 0
adc dwNumber[12], 0


With Branches:
add dwNumber[0], eax
jnc @F
inc dwNumber[4]
jnc @F
inc dwNumber[8]
jnc @F
inc dwNumber[12]
@@:


As usual, it all comes down to the application. A boolean-NOT operation on an integer is slightly more efficient in the branchless code:

Branchless:
neg eax
sbb eax, eax


With Branch:
or  eax, eax
jz  @eax_is_zero
xor eax, eax
jmp @done

@eax_is_zero:
mov eax, -1    ; alternatives (dec/sub/not) don't improve by much

@done:


Cheers,

Zooba :U

Tedd

Looping code is not necessarily bad.
The problem with long sequences of instructions is that they're only decoded once. The decoder caches the instruction sizes to speed up decoding, so decoding the same code second time around is much faster, but in sequential code every instruction must be decoded first-time.
Obviously the benefits of this aren't too significant with a few loops, or probably with short loops.
No snowflake in an avalanche feels responsible.

dsouza123

With Branches: has a bug,
inc doesn't affect the carry flag, the zero flag is the one to check.

add dwNumber[0], eax
jnc @F
inc dwNumber[4]
jnz @F
inc dwNumber[8]
jnz @F
inc dwNumber[12]
@@:

zooba

Quote from: dsouza123 on February 27, 2006, 03:18:21 PM
With Branches: has a bug,
inc doesn't affect the carry flag, the zero flag is the one to check.

add dwNumber[0], eax
jnc @F
inc dwNumber[4]
jnz @F
inc dwNumber[8]
jnz @F
inc dwNumber[12]
@@:


Oops...  :red

I 'optimised' that on the fly from 128bit + 128bit code :eek

Alloy

   I don't find branchless that hard to read, the unrolled code can be placed in a procedure or an include file. James Leiterman has a chapter on it in his 32/64-bit 80x86 Assembly Language Architecture book with some interesting optimizations to avoid branches. I just wasn't sure and don't have a way of testing that using an instruction cache would be quicker than precached loop code.
We all used to be something else. Nature has always recycled.