Code alignment, how to do it and what is the benefit ?

Started by dsouza123, February 14, 2006, 10:49:03 PM

Previous topic - Next topic

dsouza123

  From postings in masmforum and items in the MASM Reference help file
both data (mostly) and code alignment are mentioned.
The reasons for data alignment on 4, 8 or (prefered) 16 byte boundaries are explained,
issues of 32-bit access, single read/write instead of two and not straddling page boundaries,
also certain instructions work best or only with aligned data (SSE2).

  How is code aligned ? .align statement ? nops ?  something else ?
Is it done once or throughout the code ?
Are there utilities/program options to detect and/or fix alignment problems, code and/or data ?

What are the benefits of code alignment ?
What are the effects of not aligning code ?

Anyone have tips for alignment and/or examples showing good code alignment ?

Any other issues with code alignment missed ?

EduardoS

to align code just

align 16

MASM will put nops, lea eax,
  • , whatever need to align the code.

    Most processors read 16 bytes from L1 code cache, decode them, schedule and execute, modern processors can decode up to 3 instructions per clock if these instructions are in the same 16 byte page, so if your code is aligned the processor will decode faster (it can help in some algos).

MichaelW

I have done tests that showed very significant effects for data alignment, but not for code alignment. Running this code on a P3:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT equ 10000000
    REPEAT_COUNT equ 100

    invoke Sleep,1000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 4
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 4
        nop
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 4
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 4
        nop
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call alignedproc
      ENDM
    counter_end
    push  eax
    print "call aligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call misalignedproc
      ENDM
    counter_end
    push  eax
    print "call misaligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
alignedproc proc
    ret
alignedproc endp
align 4
nop
misalignedproc proc
    ret
misalignedproc endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

I get these results with a variation of no more than one cycle.

conditional jump back to aligned label : 206 cycles
conditional jump back to misaligned label : 206 cycles
jump forward to aligned label : 197 cycles
jump forward to misaligned label : 197 cycles
call aligned procedure : 424 cycles
call misaligned procedure : 425 cycles


I tested only align 4 and (align 4) + 1, but in the past I have had similar results for other alignments. Perhaps there is a better way to perform the timing and/or other code that will be more sensitive to alignment.



[attachment deleted by admin]
eschew obfuscation

lingo

"..I tested only align 4 and (align 4) + 1, but in the past I have had similar results for other alignments.
Perhaps there is a better way to perform the timing and/or other code that will be more sensitive to alignment."


P3 is very sensitive to code alignment  but you did it in wrong way  :lol
Why?
Because you align BEFORE THE PROC rather than the LOOPS in the proc.

See chapter "15. Instruction fetch (PPro, PII and PIII" from
"How to optimize for the Pentium family of microprocessors" by A.Fog

Regards,
Lingo

MichaelW

Quote from: lingo on February 15, 2006, 03:46:22 PM
Because you align BEFORE THE PROC rather than the LOOPS in the proc.

I don't understand. I did align before the procedures, and the loops were not in procedures. But you are right that I did not do it correctly, at least for a P3. If I modify the code so the first instruction after the conditional jump and in the misaligned procedure crosses a 16-byte boundary, and so the first instruction after the forward jump is a jump that crosses a 16-byte boundary, then I do see a substantial effect.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT equ 1000000
    REPEAT_COUNT equ 1000

    invoke Sleep,1000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 16
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 16
        nops 15
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 16
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 16
        nops 15
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call alignedproc
      ENDM
    counter_end
    push  eax
    print "call aligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call misalignedproc
      ENDM
    counter_end
    push  eax
    print "call misaligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 16
alignedproc proc
    xor   eax, eax
    ret
alignedproc endp
align 16
nops 15
misalignedproc proc
    xor   eax, eax
    ret
misalignedproc endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


conditional jump back to aligned label : 2019 cycles
conditional jump back to misaligned label : 3036 cycles
jump forward to aligned label : 6064 cycles
jump forward to misaligned label : 19946 cycles
call aligned procedure : 7963 cycles
call misaligned procedure : 8996 cycles

eschew obfuscation

u

Also, if you match loop alignment with small loop size, you can get extra speed. MichaelW once made my jaw drop when the following code of his performed two times faster than a 370+ lines "optimize each case"-type proc of mine for adding two float arrays (DestFloat += SrcFloat):

XMMLoop proc src:DWORD, dst:DWORD, cnt:DWORD
    push edi
    push esi
    mov ecx, [esp+20]
    mov esi, [esp+12]
    mov edi, [esp+16]
    align 16
    @@:
        movaps XMM0, [edi+ecx*4-16]
        addps XMM0, [esi+ecx*4-16]
        movaps [edi+ecx*4-16], XMM0
    sub ecx, 4
    jnz @B
    pop esi
    pop edi
    ret 12
XMMLoop endp

This one takes 1 cycle per added float. If you make the loop any longer here, the performance drops, despite my expectations, on top of it. I guess in tiny loops the cpu needn't decode the last few instructions again, if they're the same (and taken from the same address).
Please use a smaller graphic in your signature.

daydreamer

after reading agner fogs manual:
if your code gonna be in a library, should you also be align the whole lib in a 8kb block and not crossing such a boundary?


TOTGEBOREN

I ran twice and got better results for misaligned data!

AXP-M (Barton) 2400+ 1,8GHz

Mark Jones

Quote from: AMD Athlon XP 2500+
conditional jump back to aligned label : 2073 cycles
conditional jump back to misaligned label : 3118 cycles
jump forward to aligned label : 2073 cycles
jump forward to misaligned label : 7212 cycles
call aligned procedure : 10861 cycles
call misaligned procedure : 11982 cycles
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

daydreamer

Quote from: TOTGEBOREN on February 16, 2006, 07:45:34 PM
I ran twice and got better results for misaligned data!

AXP-M (Barton) 2400+ 1,8GHz
same here  AXP 3000+ only 2ghz
:eek
conditional jump back to aligned label : 228 cycles
conditional jump back to misaligned label : 232 cycles
jump forward to aligned label : 989 cycles
jump forward to misaligned label : 908 cycles
call aligned procedure : 2616 cycles
call misaligned procedure : 1051 cycles



u

AXP 2000+ (1.6GHz)

conditional jump back to aligned label : 214 cycles
conditional jump back to misaligned label : 212 cycles
jump forward to aligned label : 909 cycles
jump forward to misaligned label : 907 cycles
call aligned procedure : 2390 cycles
call misaligned procedure : 1045 cycles


This is hilarious ^^
[edit]: this is with the older, incorrect version of the benchmark  :red
Please use a smaller graphic in your signature.

dioxin

From the Athlon Optimization Guide:
Quote
Align Branch Targets in Program Hot Spots:
In program hot spots (as determined by either profiling or loop
nesting analysis), place branch targets at or near the beginning
of 16-byte aligned code windows. This guideline improves
performance inside hotspots by maximizing the number of
instruction fills into the instruction-byte queue and preserves Icache
space in branch-intensive code outside such hotspots.

I'm sure I also read somewhere that instructions which straddle 2 cache lines are decoded as slow VectorPath instructions instead of fast DirectPath instructions

Paul

EduardoS

In a XP 2000+

conditional jump back to aligned label : 2014 cycles
conditional jump back to misaligned label : 3023 cycles
jump forward to aligned label : 2001 cycles
jump forward to misaligned label : 7047 cycles
call aligned procedure : 10444 cycles
call misaligned procedure : 11444 cycles


Also, i make a small change on the routine, the jump forward was taking 30 bytes between jumps for misaligned and 14 for aligned, make 14 for both, and align the procedure calls:


conditional jump back to aligned label : 2016 cycles
conditional jump back to misaligned label : 3016 cycles
jump forward to aligned label : 2000 cycles
jump forward to misaligned label : 3006 cycles
call aligned procedure : 4541 cycles
call misaligned procedure : 6040 cycles



[attachment deleted by admin]

V Coder

#13
*** REMEMBER THAT MASM has align bug - the code used depends on the distance and at a distance of 5 or 12 bytes the instruction clears the carry flag by (ill-advisedly) using add.

http://win.asmcommunity.net/board/index.php?topic=22291.0

MASM uses for alignment padding
 0 bytes = nothing
 1 byte  = NOP {90}
 2 bytes = MOV edi, edi {8BFF}
 3 bytes = lea ecx, [ecx+0x0] {8D4900}
 4 bytes = lea esp, [esp+0x0] {8D642400}
 5 bytes = add eax, 0x0 {0500000000}
 6 bytes = lea ebx, [ebx+0x0] {8D9B00000000}
 7 bytes = lea esp, [esp+0x0] {8DA42400000000}
 8 bytes = 7 + 1
 9 bytes = 7 + 2
10 bytes = 7 + 3
11 bytes = 7 + 4
12 bytes = 7 + 5
13 bytes = 7 + 6
14 bytes = 7 + 7
15 bytes = 7 + 7 + 1
{16 bytes = nothing}

This align bug caused serious problems in my program:

..init loop variables
..start processing loop data {carry may be set...}
**align 16 **{inadvertently cleared carry}
..loop:
..process data
..jns loop
..finish processing loop data

In addition,  program has not been able to benefit from aligning the loops, which are executed up to six times (Pentium III/Pentium 4 version) and 16 times (Athlon version can't use align because of the above). Is it because the loop is only executed max six times, and the align instruction may have its own penalty?

..setup mmx constants
..jmp start
**I never tried aligning here. I should.
..init loop variables
**align here does not appear to help.
..loop:
..process data
..jns loop
..if yy then output result; jmp start
..if zz then jmp init loop variables
..start
..check for exit signal, etc
..init data variables
..jmp init loop variables.

I guess the best I can hope for is by aligning 'init loop variables'.

V Coder

Quote from: MichaelW on February 16, 2006, 02:17:50 AMIf I modify the code so the first instruction after the conditional jump and in the misaligned procedure crosses a 16-byte boundary, and so the first instruction after the forward jump is a jump that crosses a 16-byte boundary, then I do see a substantial effect.
We are here comparing the best case (aligned to 16 byte boundary) with the worst case (misaligned to cross a 16 byte boundary). What happens on average case, eg when misaligned by 1-14 instead of 15 bytes.

Edit: From my testing this becomes a problem in the 2 instruction loop for byte displacements of 12-15 bytes on the Pentium III, and 7-15 on the Pentium 4. Will test the Athlon tomorrow. What about if three or more instructions are present?

// RDTSC Check Program by V Coder
// Written in HLA
//

program Testloopalign;
#include( "stdlib.hhf" );
#include( "w.hhf" ) // Standard windows stuff.

static
align (4);
curr_pro: uns32;
prio_cls: uns32;
curr_th: uns32;
th_prio: uns32;


#macro startall;
#asm
          push edx          ; reserve space on stack
          push eax          ;

          rdtsc
          mov [esp], eax    ; instead of pushing eax & edx as below
          mov [esp+4], edx  ; mov uses less clock cycles than push


; routine to test
mov ecx, 10000
#endasm
#endmacro;

#macro endall;
#asm
; routine to test
     @@:
        sub   ecx, 1
        jnz   @B
; end routine

          rdtsc             ; Apparently rdtsc takes 13 cycles
          sub eax, [esp]
          sbb edx, [esp+4]
;         sub eax, 0eh      ; compensation for the rdtsc and the push, push if used after
;         sbb edx, 0        ; 14 cycles on a Pentium MMX
                            ;  9 cycles on a K6-2
;         sub eax, 0eh      ; compensation for the rdtsc and the mov, mov if used after
;         sbb edx, 0        ; 14 cycles on a Pentium MMX
                            ;  9 cycles on a K6-2

          add esp,8         ; remove edx, eax from stack
#endasm
#endmacro;


begin Testloopalign;

console.cls();
console.gotoxy(4, 15);
stdout.put ( nl "Loop Align Test:", nl nl);

     w.GetCurrentProcess();
     mov (eax, curr_pro);
     w.GetPriorityClass(curr_pro);     
     mov (eax, prio_cls);   
//     stdout.put(nl "Process Priority Class: ", prio_cls);
     w.SetPriorityClass(curr_pro, w.HIGH_PRIORITY_CLASS);     
     mov (eax, prio_cls);   
//     stdout.put(nl "Process Priority Class: ", prio_cls);

     w.GetCurrentThread();
     mov (eax, curr_th);
     w.GetThreadPriority (curr_th);
     mov (eax, th_prio);   
//     stdout.put(nl "Thread Priority: ", th_prio);
     w.SetThreadPriority (curr_th, w.THREAD_PRIORITY_HIGHEST);
     w.GetThreadPriority (curr_th);
     mov (eax, th_prio);   
//     stdout.put(nl "Thread Priority: ", th_prio);



startall;
#asm
align 16
#endasm
endall;
          stdout.put ( (type dword eax), " for 0 bytes." nl);

startall;
#asm
align 16
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 1 byte." nl);

startall;
#asm
align 16
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 2 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 3 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 4 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 5 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 6 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 7 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 8 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 9 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 10 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 11 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 12 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 13 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 14 bytes." nl);

startall;
#asm
align 16
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 15 bytes." nl);

end Testloopalign;


Results: Pentium 4 *Hex values are clock cycles
Loop Align Test:

0000_2FB0 for 0 bytes.
0000_2FD0 for 1 byte.
0000_2FAC for 2 bytes.
0000_2FD0 for 3 bytes.
0000_2FB8 for 4 bytes.
0000_2FCC for 5 bytes.
0000_2FBC for 6 bytes.
0000_42B4 for 7 bytes.
0000_45F0 for 8 bytes.
0000_45D4 for 9 bytes.
0000_45BC for 10 bytes.
0000_457C for 11 bytes.
0000_45C0 for 12 bytes.
0000_4598 for 13 bytes.
0000_45E8 for 14 bytes.
0000_45C8 for 15 bytes.

Results: Pentium III
Loop Align Test:

0000_4E52 for 0 bytes.
0000_4E57 for 1 byte.
0000_4E57 for 2 bytes.
0000_4E57 for 3 bytes.
0000_4E57 for 4 bytes.
0000_4E58 for 5 bytes.
0000_4E55 for 6 bytes.
0000_4E58 for 7 bytes.
0000_4E59 for 8 bytes.
0000_4E59 for 9 bytes.
0000_4E5A for 10 bytes.
0000_4E59 for 11 bytes.
0000_7566 for 12 bytes.
0000_7567 for 13 bytes.
0000_7569 for 14 bytes.
0000_7571 for 15 bytes.