News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Improve code with lots of branching

Started by a_h, January 19, 2006, 07:31:44 PM

Previous topic - Next topic

a_h

Hi all!

Currently I try to speed up code that looks like this


for(i=0;;i++)
{
code=getCode();

if (code>16000)
   tab=table[code>>12-16]
else if (code>8000)
   tab=table[code>>8-8]
else if (code>4000)
  tab=table[code>>6-4]
else if (code>2000)
  tab=table[code>>4-4]
else if (code>500)
  tab=table[code-4]
else
<error>
}


I would be very happy if someone has an idea how to improve that code - I'm searching for a faster solution for a P4. Actually I thought about using cmov, but I see no nice way to rewrite the branching part in assembly (using cmov).

Thanks for your help!

Regards, Hannes

gabor

Hi!

Before anything else I wanted to say that such control structures like switch or if-elsif-elsif...else can be very slow for the last cases. I mean if there is a switch with 50 cases then the 50th case may always have to wait until the other 49 is tested. If there are typical cases that will happen in say 80% of all cases then I think it is a good idea to bring those cases infront of the others. If average performance is important, like when all cases have the same possibility then this solution won't bring nothing.  :'(

I would do it this way in asm:
The code the base of the condition is in eax. The shift count is in ecx and the the value that is substracted is in edx. After the array index is calculated the value to be passed to tab can be easily retrieved.

The shifts are to the right, so it is a division by 2^n,isn't it?


@Start:
        call getCode  ; I suggest this proc returnes code in eax
        mov    ebx,eax
        mov    ebx,eax
@1:
        mov    ecx,12
        mov    edx,16
        cmp    eax,16000
        ja @End
@2:
        mov    cl,8
        mov    dl,8
        cmp    eax,8000
        ja @End
@3:
        mov    cl,6
        mov    dl,4
        cmp    eax,4000
        ja @End
@4:
        mov    cl,dl   ; edx is already 4
        cmp    eax,2000
        ja @End
@5:
        xor    cl,cl
        cmp    eax,500
        jna @Error
@End:
        shr    eax,cl
        sub    eax,edx    ; if the index in eax fits into a byte range (the table is like this table[256]) this can be sub al,dl
        mov    tab,[offset table+eax]
        jmp @Start
@Error:
        ...


By placing the slow index calculation to a common place and making the cases as small as possible I tried to mprove average performance: all cases should run fast. This way the cmov instruction is not needed. I guess it is not that fast on the P4, but I am not sure of this.
My idea shows just one point of view. I know there are other better solutions and many ideas for instruction flow optimization, so I can hardly wait the other posts.

Greets, Gábor

Tedd

Unless there is some kind of pattern that can be used to calculate the shift/offset, then it is quite difficult to do this kind of thing any other way. All I can suggest is nesting the range tests, which will reduce the number of tests per case. Though in this example there aren't so many that it will make too much difference.

for(i=0;;i++)
{
code=getCode();

if (code > 4000) {
  if (code>16000)
    tab=table[code>>12-16]
  else if (code>8000)
     tab=table[code>>8-8]
  else //if (code>4000)
    tab=table[code>>6-4]
} else {
  else if (code>2000)
    tab=table[code>>4-4]
  else if (code>500)
    tab=table[code-4]
  else //(code<=500)
    <error>
  }
}


Assigning the shift and offset values, and then performing the shift and table lookup at the end will not really speed up the code, as each one is only performed once per iteration anyway. However, it will go to make the code smaller, which means it can better fit into cache, and may provide a small speedup that way.
No snowflake in an avalanche feels responsible.

Petroizki

In my opinion, simple switch "table" would make things smoother, or at least less forward jumps!:
call getCode

cmp eax, 16000
ja @over_16000

cmp eax, 8000
ja @over_8000

cmp eax, 4000
ja @over_4000

cmp eax, 2000
ja @over_2000

cmp eax, 500
ja @over_500

<error>
...

@continue:
mov eax, table[eax - 4]
<eax == tab>
...

@over_16000:
shr eax, 12
sub eax, 12
jmp @continue
@over_8000:
shr eax, 8
sub eax, 4
jmp @continue
@over_4000:
shr eax, 6
jmp @continue
@over_2000:
shr eax, 4
jmp @continue
@over_500:
jmp @continue


Branch elimination, without cmov:  :eek
I guess this could be made faster somehow.  :wink
call getCode

cmp eax, 500
jna @error

; code > 500
mov edx, eax
cmp eax, 2001
sbb ecx, ecx
and edx, ecx
or eax, ecx

; code > 2000
cmp eax, 4001
mov ebx, eax
sbb ecx, ecx
shr ebx, 4
or eax, ecx
and ebx, ecx
add edx, ebx

; code > 4000
cmp eax, 8001
mov ebx, eax
sbb ecx, ecx
shr ebx, 6
or eax, ecx
and ebx, ecx
add edx, ebx

; code > 8000
cmp eax, 16001
mov ebx, eax
sbb ecx, ecx
shr ebx, 8
or eax, ecx
sub ebx, 4
and ebx, ecx
add edx, ebx

; code > 16000
shr eax, 12
cmp edx, 1
sbb ecx, ecx
sub eax, 12
and eax, ecx
add edx, eax

mov eax, table[edx - 4]
<eax == tab>
...

@error:
<error>
...


I don't have time to see which one would be faster on my comp.

a_h

Thanks for all your replies!

Unfortunately I still hadn't time to try them all out!

The big problem of my original code are mispredicted branches so I hoped to avoid branching somehow. What do you think of something like this


if (code>=16)
__asm
{
mov esi, code
bsr eax, esi; // find index of highest 1 bit
mov edx, [pDCTtabNonI+4*eax-16] // adjusted addr of righ table
mov ecx, [DCTShiftTab+4*eax-16]    // amount to shift code
shr esi, CL
imul esi, 3 // size of entries
add edx, esi // addr of proper entry of proper table
mov tab, edx // save
}



with 2 lookup tables?

What do you think? Cheers, Hannes

Tedd

Well it doesn't jump :U

How about replacing
imul esi,3
with
lea esi,[esi+2*esi]
It's a handy little cheat :bg
No snowflake in an avalanche feels responsible.

a_h

Well, I still hadn't time to test all of your ideas; however, the code in my previous post needs 1/3 less time (P4) than the original branching code. I'm curious how the other ideas perform!

Cheers, Hannes

PS: thanks Tedd for reminding me of that one!