Code alignment, how to do it and what is the benefit ?

Started by dsouza123, February 14, 2006, 10:49:03 PM

Previous topic - Next topic

V Coder

I have tested:

@@:
inc eax {single byte instruction}
sub ecx, 1
jnz @B


On the Pentium III, this has basically no effect. Identical timings as for without inc eax. On the Pentium 4, the clocks increase to:

Loop Align Test:

0000_3B50 for 0 bytes.
0000_3B80 for 1 byte.
0000_3B60 for 2 bytes.
0000_3B80 for 3 bytes.
0000_3B64 for 4 bytes.
0000_3B84 for 5 bytes.
0000_3B64 for 6 bytes.
0000_7288 for 7 bytes.
0000_72A4 for 8 bytes.
0000_7224 for 9 bytes.
0000_733C for 10 bytes.
0000_7398 for 11 bytes.
0000_73A0 for 12 bytes.
0000_7380 for 13 bytes.
0000_7354 for 14 bytes.
0000_73AC for 15 bytes.

In other words, same effect at the same byte displacements for both Pentium III and Pentium 4. Well not actually, the Pentium 4 times increase too much...

However, the Pentium 4 code timing do not increase any further when I use an 8 byte loop as follows:
@@:
movd edx, mm0 {three byte instruction}
sub ecx, 1
jnz @B


On the other hand, the Pentium III timings change:
Loop Align Test:

0000_5DA2 for 0 bytes.
0000_4E58 for 1 byte.
0000_4E58 for 2 bytes.
0000_4E58 for 3 bytes.
0000_4E59 for 4 bytes.
0000_4E59 for 5 bytes.
0000_4E59 for 6 bytes.
0000_4E5A for 7 bytes.
0000_4E5A for 8 bytes.
0000_7565 for 9 bytes.
0000_7574 for 10 bytes.
0000_7568 for 11 bytes.
0000_7568 for 12 bytes.
0000_7567 for 13 bytes.
0000_7569 for 14 bytes.
0000_7569 for 15 bytes.

Yes the perfectly aligned 8 byte loop is slower than that offset by 1-8 bytes!!! Also, the misaligned effect now extends from 9-15 bytes instead of 12-15 bytes.

Interpretation/Recommendation:
The Pentium III still manages to pair (triple) the instructions. The Pentium 4 executes/decodes instructions one at a time???

Align - it will probably help your code, but test to determine exactly how much if at all.

EduardoS

V Coder,
Here i test the mis-aligned code but without crossing a page boundary, and they took de same time as the aligned code, so the code aligned isn't so important, but avoid crossing page boundaries is important,
... and code alignment help to avoid corsses...

V Coder

On the Athlon:
Loop Align Test:

0000_4E3D for 0 bytes.
0000_4E57 for 1 byte.
0000_4F66 for 2 bytes.
0000_4E4F for 3 bytes.
0000_4F9F for 4 bytes.
0000_4E52 for 5 bytes.
0000_4F91 for 6 bytes.
0000_4E56 for 7 bytes.
0000_4E5A for 8 bytes.
0000_4E50 for 9 bytes.
0000_4E54 for 10 bytes.
0000_4E51 for 11 bytes.
0000_7570 for 12 bytes.
0000_756D for 13 bytes.
0000_756B for 14 bytes.
0000_756A for 15 bytes.

with inc eax
Loop Align Test:

0000_4E3F for 0 bytes.
0000_4E5D for 1 byte.
0000_4FB5 for 2 bytes.
0000_4E54 for 3 bytes.
0000_4F77 for 4 bytes.
0000_4E5D for 5 bytes.
0000_4F70 for 6 bytes.
0000_4E58 for 7 bytes.
0000_4FD0 for 8 bytes.
0000_4E55 for 9 bytes.
0000_4E55 for 10 bytes.
0000_7570 for 11 bytes.
0000_756D for 12 bytes.
0000_7571 for 13 bytes.
0000_7570 for 14 bytes.
0000_7570 for 15 bytes.

With movd mm0, edx
Loop Align Test:

0000_69A7 for 0 bytes.
0000_61EF for 1 byte.
0000_62F5 for 2 bytes.
0000_61E3 for 3 bytes.
0000_62D9 for 4 bytes.
0000_6559 for 5 bytes.
0000_6552 for 6 bytes.
0000_65B2 for 7 bytes.
0000_674D for 8 bytes.
0000_7A8D for 9 bytes.
0000_79AA for 10 bytes.
0000_7944 for 11 bytes.
0000_7954 for 12 bytes.
0000_7950 for 13 bytes.
0000_7A90 for 14 bytes.
0000_79DC for 15 bytes.

Now the Athlon happily handles everything that is aligned to fit completely within the 16 byte boundary with no penalty. (Well actually, it executes the mmx instruction in the same cycle as the sub, but it has a longer latency for the mmx result, thus the longer duration of even the 1-8 displacement compared to the previous tests. A long integer instruction would probably have executed in the same time as the previous tests. Both Pentium III and Athlon execute the jnz in a separate cycle from the sub, whereas the Pentium 4 apparently executes the jnz in the same cycle.) Being in the same 16 byte boundary, the jnz does not need a separate decode cycle. Note again also the effect on 0 displacement.

So, code for the Athlon (which can decode/execute up to three integer instructions per clock cycle), and everything will be optimal for other processors. That is, Ensure the targets of jumps, branches and calls avoid 16 byte boundaries - let the first three instructions from the target of a jump, call or branch all fit within a 16 byte boundary.

I optimized a compute bound program based on this information with very long (46-63 instruction) loops, and got one or two percent speed improvement as a result.