Hi all!
Currently I try to speed up code that looks like this
for(i=0;;i++)
{
code=getCode();
if (code>16000)
tab=table[code>>12-16]
else if (code>8000)
tab=table[code>>8-8]
else if (code>4000)
tab=table[code>>6-4]
else if (code>2000)
tab=table[code>>4-4]
else if (code>500)
tab=table[code-4]
else
<error>
}
I would be very happy if someone has an idea how to improve that code - I'm searching for a faster solution for a P4. Actually I thought about using cmov, but I see no nice way to rewrite the branching part in assembly (using cmov).
Thanks for your help!
Regards, Hannes
Hi!
Before anything else I wanted to say that such control structures like switch or if-elsif-elsif...else can be very slow for the last cases. I mean if there is a switch with 50 cases then the 50th case may always have to wait until the other 49 is tested. If there are typical cases that will happen in say 80% of all cases then I think it is a good idea to bring those cases infront of the others. If average performance is important, like when all cases have the same possibility then this solution won't bring nothing. :'(
I would do it this way in asm:
The code the base of the condition is in eax. The shift count is in ecx and the the value that is substracted is in edx. After the array index is calculated the value to be passed to tab can be easily retrieved.
The shifts are to the right, so it is a division by 2^n,isn't it?
@Start:
call getCode ; I suggest this proc returnes code in eax
mov ebx,eax
mov ebx,eax
@1:
mov ecx,12
mov edx,16
cmp eax,16000
ja @End
@2:
mov cl,8
mov dl,8
cmp eax,8000
ja @End
@3:
mov cl,6
mov dl,4
cmp eax,4000
ja @End
@4:
mov cl,dl ; edx is already 4
cmp eax,2000
ja @End
@5:
xor cl,cl
cmp eax,500
jna @Error
@End:
shr eax,cl
sub eax,edx ; if the index in eax fits into a byte range (the table is like this table[256]) this can be sub al,dl
mov tab,[offset table+eax]
jmp @Start
@Error:
...
By placing the slow index calculation to a common place and making the cases as small as possible I tried to mprove average performance: all cases should run fast. This way the cmov instruction is not needed. I guess it is not that fast on the P4, but I am not sure of this.
My idea shows just one point of view. I know there are other better solutions and many ideas for instruction flow optimization, so I can hardly wait the other posts.
Greets, Gábor
Unless there is some kind of pattern that can be used to calculate the shift/offset, then it is quite difficult to do this kind of thing any other way. All I can suggest is nesting the range tests, which will reduce the number of tests per case. Though in this example there aren't so many that it will make too much difference.
for(i=0;;i++)
{
code=getCode();
if (code > 4000) {
if (code>16000)
tab=table[code>>12-16]
else if (code>8000)
tab=table[code>>8-8]
else //if (code>4000)
tab=table[code>>6-4]
} else {
else if (code>2000)
tab=table[code>>4-4]
else if (code>500)
tab=table[code-4]
else //(code<=500)
<error>
}
}
Assigning the shift and offset values, and then performing the shift and table lookup at the end will not really speed up the code, as each one is only performed once per iteration anyway. However, it will go to make the code smaller, which means it can better fit into cache, and may provide a small speedup that way.
In my opinion, simple switch "table" would make things smoother, or at least less forward jumps!:
call getCode
cmp eax, 16000
ja @over_16000
cmp eax, 8000
ja @over_8000
cmp eax, 4000
ja @over_4000
cmp eax, 2000
ja @over_2000
cmp eax, 500
ja @over_500
<error>
...
@continue:
mov eax, table[eax - 4]
<eax == tab>
...
@over_16000:
shr eax, 12
sub eax, 12
jmp @continue
@over_8000:
shr eax, 8
sub eax, 4
jmp @continue
@over_4000:
shr eax, 6
jmp @continue
@over_2000:
shr eax, 4
jmp @continue
@over_500:
jmp @continue
Branch elimination, without cmov: :eek
I guess this could be made faster somehow. :wink
call getCode
cmp eax, 500
jna @error
; code > 500
mov edx, eax
cmp eax, 2001
sbb ecx, ecx
and edx, ecx
or eax, ecx
; code > 2000
cmp eax, 4001
mov ebx, eax
sbb ecx, ecx
shr ebx, 4
or eax, ecx
and ebx, ecx
add edx, ebx
; code > 4000
cmp eax, 8001
mov ebx, eax
sbb ecx, ecx
shr ebx, 6
or eax, ecx
and ebx, ecx
add edx, ebx
; code > 8000
cmp eax, 16001
mov ebx, eax
sbb ecx, ecx
shr ebx, 8
or eax, ecx
sub ebx, 4
and ebx, ecx
add edx, ebx
; code > 16000
shr eax, 12
cmp edx, 1
sbb ecx, ecx
sub eax, 12
and eax, ecx
add edx, eax
mov eax, table[edx - 4]
<eax == tab>
...
@error:
<error>
...
I don't have time to see which one would be faster on my comp.
Thanks for all your replies!
Unfortunately I still hadn't time to try them all out!
The big problem of my original code are mispredicted branches so I hoped to avoid branching somehow. What do you think of something like this
if (code>=16)
__asm
{
mov esi, code
bsr eax, esi; // find index of highest 1 bit
mov edx, [pDCTtabNonI+4*eax-16] // adjusted addr of righ table
mov ecx, [DCTShiftTab+4*eax-16] // amount to shift code
shr esi, CL
imul esi, 3 // size of entries
add edx, esi // addr of proper entry of proper table
mov tab, edx // save
}
with 2 lookup tables?
What do you think? Cheers, Hannes
Well it doesn't jump :U
How about replacing
imul esi,3
with
lea esi,[esi+2*esi]
It's a handy little cheat :bg
Well, I still hadn't time to test all of your ideas; however, the code in my previous post needs 1/3 less time (P4) than the original branching code. I'm curious how the other ideas perform!
Cheers, Hannes
PS: thanks Tedd for reminding me of that one!