Print Page - Improve code with lots of branching

Title: Improve code with lots of branching
Post by: a_h on January 19, 2006, 07:31:44 PM

Hi all!

Currently I try to speed up code that looks like this


for(i=0;;i++)
{
code=getCode();

if (code>16000)
   tab=table[code>>12-16]
else if (code>8000)
   tab=table[code>>8-8]
else if (code>4000)
  tab=table[code>>6-4]
else if (code>2000)
  tab=table[code>>4-4]
else if (code>500)
  tab=table[code-4]
else
 <error>
}

I would be very happy if someone has an idea how to improve that code - I'm searching for a faster solution for a P4. Actually I thought about using cmov, but I see no nice way to rewrite the branching part in assembly (using cmov).

Thanks for your help!

Regards, Hannes

Title: Re: Improve code with lots of branching
Post by: gabor on January 20, 2006, 07:58:45 AM

Hi!

Before anything else I wanted to say that such control structures like switch or if-elsif-elsif...else can be very slow for the last cases. I mean if there is a switch with 50 cases then the 50th case may always have to wait until the other 49 is tested. If there are typical cases that will happen in say 80% of all cases then I think it is a good idea to bring those cases infront of the others. If average performance is important, like when all cases have the same possibility then this solution won't bring nothing. :'(

I would do it this way in asm:
The code the base of the condition is in eax. The shift count is in ecx and the the value that is substracted is in edx. After the array index is calculated the value to be passed to tab can be easily retrieved.

The shifts are to the right, so it is a division by 2^n,isn't it?

Code Select


@Start:
        call getCode  ; I suggest this proc returnes code in eax
        mov    ebx,eax
        mov    ebx,eax
@1:
        mov    ecx,12
        mov    edx,16
        cmp    eax,16000
        ja @End
@2:
        mov    cl,8
        mov    dl,8
        cmp    eax,8000
        ja @End
@3:
        mov    cl,6
        mov    dl,4
        cmp    eax,4000
        ja @End
@4:
        mov    cl,dl   ; edx is already 4
        cmp    eax,2000
        ja @End
@5:
        xor    cl,cl
        cmp    eax,500
        jna @Error
@End:
        shr    eax,cl
        sub    eax,edx    ; if the index in eax fits into a byte range (the table is like this table[256]) this can be sub al,dl
        mov    tab,[offset table+eax]
        jmp @Start
@Error:
        ...

By placing the slow index calculation to a common place and making the cases as small as possible I tried to mprove average performance: all cases should run fast. This way the cmov instruction is not needed. I guess it is not that fast on the P4, but I am not sure of this.
My idea shows just one point of view. I know there are other better solutions and many ideas for instruction flow optimization, so I can hardly wait the other posts.

Greets, Gábor

Title: Re: Improve code with lots of branching
Post by: Tedd on January 20, 2006, 01:52:36 PM

Unless there is some kind of pattern that can be used to calculate the shift/offset, then it is quite difficult to do this kind of thing any other way. All I can suggest is nesting the range tests, which will reduce the number of tests per case. Though in this example there aren't so many that it will make too much difference.

Code Select

for(i=0;;i++)
{
code=getCode();

if (code > 4000) {
  if (code>16000)
    tab=table[code>>12-16]
  else if (code>8000)
     tab=table[code>>8-8]
  else //if (code>4000)
    tab=table[code>>6-4]
} else {
  else if (code>2000)
    tab=table[code>>4-4]
  else if (code>500)
    tab=table[code-4]
  else //(code<=500)
    <error>
  }
}

Assigning the shift and offset values, and then performing the shift and table lookup at the end will not really speed up the code, as each one is only performed once per iteration anyway. However, it will go to make the code smaller, which means it can better fit into cache, and may provide a small speedup that way.

Title: Re: Improve code with lots of branching
Post by: Petroizki on January 20, 2006, 03:18:12 PM

In my opinion, simple switch "table" would make things smoother, or at least less forward jumps!:

Code Select

	call getCode

	cmp eax, 16000
	ja @over_16000

	cmp eax, 8000
	ja @over_8000

	cmp eax, 4000
	ja @over_4000

	cmp eax, 2000
	ja @over_2000

	cmp eax, 500
	ja @over_500

	<error>
	...

@continue:
	mov eax, table[eax - 4]
	<eax == tab>
	...

@over_16000:
	shr eax, 12
	sub eax, 12
	jmp @continue
@over_8000:
	shr eax, 8
	sub eax, 4
	jmp @continue
@over_4000:
	shr eax, 6
	jmp @continue
@over_2000:
	shr eax, 4
	jmp @continue
@over_500:
	jmp @continue

Branch elimination, without cmov: :eek
I guess this could be made faster somehow. :wink

Code Select

	call getCode

	cmp eax, 500
	jna @error
	
	; code > 500
	mov edx, eax
	cmp eax, 2001
	sbb ecx, ecx
	and edx, ecx
	or eax, ecx
	
	; code > 2000
	cmp eax, 4001
	mov ebx, eax
	sbb ecx, ecx
	shr ebx, 4
	or eax, ecx
	and ebx, ecx
	add edx, ebx
	
	; code > 4000
	cmp eax, 8001
	mov ebx, eax
	sbb ecx, ecx
	shr ebx, 6
	or eax, ecx
	and ebx, ecx
	add edx, ebx
	
	; code > 8000
	cmp eax, 16001
	mov ebx, eax
	sbb ecx, ecx
	shr ebx, 8
	or eax, ecx
	sub ebx, 4
	and ebx, ecx
	add edx, ebx
	
	; code > 16000
	shr eax, 12
	cmp edx, 1
	sbb ecx, ecx
	sub eax, 12
	and eax, ecx
	add edx, eax
	
	mov eax, table[edx - 4]
	<eax == tab>
	...
	
@error:
	<error>
	...

I don't have time to see which one would be faster on my comp.

Title: Re: Improve code with lots of branching
Post by: a_h on January 24, 2006, 06:51:05 PM

Thanks for all your replies!

Unfortunately I still hadn't time to try them all out!

The big problem of my original code are mispredicted branches so I hoped to avoid branching somehow. What do you think of something like this

Code Select


if (code>=16)
		__asm
		{
			mov		esi, code
			bsr		eax, esi;		// find index of highest 1 bit
			mov		edx, [pDCTtabNonI+4*eax-16] // adjusted addr of righ table
			mov		ecx, [DCTShiftTab+4*eax-16]		   // amount to shift code
			shr		esi, CL				
			imul	esi, 3				// size of entries
			add		edx, esi			// addr of proper entry of proper table
			mov		tab, edx			// save
		}

with 2 lookup tables?

What do you think? Cheers, Hannes

Title: Re: Improve code with lots of branching
Post by: Tedd on January 25, 2006, 12:25:11 PM

Well it doesn't jump :U

How about replacing

Code Select

imul esi,3
with

Code Select

lea esi,[esi+2*esi]
It's a handy little cheat :bg

Title: Re: Improve code with lots of branching
Post by: a_h on January 27, 2006, 10:45:59 PM

Well, I still hadn't time to test all of your ideas; however, the code in my previous post needs 1/3 less time (P4) than the original branching code. I'm curious how the other ideas perform!

Cheers, Hannes

PS: thanks Tedd for reminding me of that one!

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: a_h on January 19, 2006, 07:31:44 PM