What is the most efficient way to unroll loops in assembly?
I have an inner loop that will be (usually) between 5 and 20 iterations. This number will be constant for each call of the inner loop, but unknown until run-time.
If I knew the size of the loop before-hand I could do something like this:
doThing(4)
doThing(3)
doThing(2)
doThing(1)
doThing(0)
I know C++ allows use of switch-case statements to unroll the loop efficiently.
I am wondering
(1) is there a better way to do it in assembly
(2) if not, how would I build a "skip-table" like the C++ compiler does automatically
(3) Can I take advantage of the fact that the inner loop executes the same number of iterations each time?
you can still unroll it
you pay a penalty for testing if you are done with each iteration
but, you can still get an advantage by "widening" the range of the index values
the inner loop that runs the same number of iterations can probably be unrolled, nicely :P
here is an example of an unrolled loop
it is loosely based on a design by Randy Hyde
however, i managed to squeeze a bit of speed out of it with some rather signifigant changes
one thing i watched for when i was working on it was branch distance
i tried to keep all the branches that are executed repeatedly in the SHORT range (+127 to -128 bytes)
if i unroll it one more time, some of the repeated branches become NEAR and speed is lost
so - keeping the code small for each iteration is important
notice the indexes that are added to EAX to unroll - this is what i am refering to as "widening"
([eax], [eax+4], [eax+8], etc)
these indexes happen in place of updating the pointer register
;*****************************************************************
OPTION PROLOGUE:None
OPTION EPILOGUE:None
ALIGN 16
SzLen PROC lpszStr:LPSTR
;SzLen - Get Length of Zero-Terminated String
;DednDave - Dave Sheldon - 8-2010
mov edx,[esp+4]
mov ecx,7F7F7F7Fh
test dl,3
push ebx
jz SzLen2
mov ebx,[edx]
xor eax,eax
or bl,bl
jz SzLen0
inc eax
or bh,bh
jz SzLen0
test dl,2
jnz SzLen1
inc eax
test ebx,0FF0000h
jnz SzLen1
SzLen0: pop ebx
ret 4
SzLen1: or dl,3
inc edx
SzLen2: push esi
push edi
mov esi,01010101h
mov edi,80808080h
SzLen3: mov eax,edx
mov ebx,[eax]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+4]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+8]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+12]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+16]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+20]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+24]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+28]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jnz SzLen4
mov ebx,[eax+32]
add edx,4
and ebx,ecx
sub ebx,esi
and ebx,edi
jz SzLen3
SzLen4: mov ebx,[edx-4]
lea eax,[edx-4]
or bl,bl
jz SzLen5
inc eax
or bh,bh
jz SzLen5
inc eax
test ebx,0FF0000h
jz SzLen5
inc eax
test ebx,0FF000000h
jnz SzLen3
SzLen5: pop edi
pop esi
sub eax,[esp+8]
pop ebx
ret 4
SzLen ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;*****************************************************************
You will find it has much to do with the Thing you want to Do. Depending on the code you do occasionally get decent improvements with unrolling, generally in cases where the loop overhead is a factor but it is often the case that unrolling does between little and nothing and in some situations where you get no speed gain from unrolling, you also get a cache penalty because of the excessive size of the loop code.
Set up a timing method and try variations while being aware that often such variations vary from one machine to another.
Hi,
Untested, but should be close to what would work. Unless
of course it is completely wrong.
HTH,
Steve N.
.DATA
; Make a jump table holding the addresses of labels that access
; an unrolled loop.
JumpTable DD Label0, Label1, Label2, Label3, Label4
.CODE
; Then use the size to enter the unrolled loop at the right place.
; (Make sure you don't exceed the size of your loop!)
MOV ESI,LoopSize
SHL ESI,2 ; Change index to DWORD array address.
JMP [JumpTable+ESI]
Label4:
doThing(4)
Label3:
doThing(3)
Label2:
doThing(2)
Label1:
doThing(1)
Label0:
doThing(0)
@dedndave
Could you explain the SHORT and NEAR in a little bit more detail?
Is the op-code few bytes for jumping a short distance and longer for jumping a longer distance? Is the speed difference noticable?
Also, how can I determine if the compiler is doing a SHORT or NEAR jump? i.e. how can I determine the size of the assembled code in between my jump statements?
@FORTRANS
If I am using x64, do I need to change the JumpTable pointers to 64 bit, or are they relative addresses so that 32 bit will work just as well?
well - short branches are 2 bytes in size and near branches are 5 bytes
but, it's not really the size of the branch that matters - it's the speed
i have noticed that short branches in loops are signifigantly faster, at least on a P4
also - near branches "prefer" 16-aligned target addresses - short branches don't seem to care
one way to see if near or short branches are being generated is to use the "short" specifier :P
jz short target
if the assembler is unable to reach the target with a short branch, it will spit out an error - lol
another trick is to temporarily disable instructions for newer processors
that usually doesn't work too well, because you want to use newer instructions elsewhere in the loop
but, on the 286 and older processors, all conditional branches had to be short
.286
jz target
.586
again, the assembler will spit out an error if target is too far away
the way i usually do it is to generate an assembler listing and examine the code
you can do this by adding the "/Fl" switch to the MASM command line
you can see if short or near branches are used, and how far off you are
this method also gives you the opportunity to review other generated code
sometimes, you see something in the generated code that doesn't seem obvious in the source
when you use the "Fl" switch, the assembler spits out all kinds of stuff
it will create a huge LST file
you can reduce the size by using this at the beginning of the source...
.XCREF
.NOLIST
INCLUDE \masm32\include\masm32rt.inc
.LIST
now - let me point you at some good reading...
http://www.agner.org/optimize/
The main thing to watch for when unrolling is that your loop still fits into cache, if not then you're going to lose the benefit of unrolling. So there's a limit to the number of times you should unroll a loop, which is dependent on the size of that loop (a shorter loops can be unrolled more times than a longer one.)
Indirectly, this is why 'short' jumps are better - they imply a smaller loop which is more likely to fit into cache. (They're also encoded as 2 bytes instead of 3, which arguably causes a small penalty, but this is only on the first loop iteration; a few extra bytes spent encoding jumps means a few less bytes for useful instructions while still fitting into cache.)
Also, then, the start of your loop should be aligned to cache-line size, so that you have the maximum remaining space to fit your loop into. And just to make things easy, cache-line size is entirely dependent on the cpu model!
Quote from: ASMManiac on February 16, 2012, 04:48:00 PM
@FORTRANS
If I am using x64, do I need to change the JumpTable pointers to 64 bit, or are they relative addresses so that 32 bit will work just as well?
Hi,
Well, you can ask in the 64-bit subforum. I have not
looked at 64-bit code enough to say what is valid in long
mode.
Regards,
Steve N.
I am having trouble creating the jump table while using labels created with a macro.
It says: Symbol not defined Loop1_0.
It seems like the assembler is trying to find Loop1_0 in the source but can't until it has evalueated the macro for loop.
I am using Jwasm.
Here is my code:
.data
JumpTable1 DQ Loop1_0, Loop1_1, Loop1_2, Loop1_3
.code
minFilt1_16A PROC uses r10 r11 r12 r13 r14 rax
... other code...
@@:
FOR arg, <0,1,2,3,4>
@CatStr(Loop1_, <&arg>, <:>)
movdqu xmm1, [SRC + j - &arg * 16]
pminub xmm0, xmm1
movdqu [blk1_128 + j - &arg * 16], xmm0
ENDM
... other code....
minFilt1_16A
declare the labels with two colons
QuoteLoop1_0::