Print Page - Code alignment, how to do it and what is the benefit ?

Title: Code alignment, how to do it and what is the benefit ?
Post by: dsouza123 on February 14, 2006, 10:49:03 PM

From postings in masmforum and items in the MASM Reference help file
both data (mostly) and code alignment are mentioned.
The reasons for data alignment on 4, 8 or (prefered) 16 byte boundaries are explained,
issues of 32-bit access, single read/write instead of two and not straddling page boundaries,
also certain instructions work best or only with aligned data (SSE2).

How is code aligned ? .align statement ? nops ? something else ?
Is it done once or throughout the code ?
Are there utilities/program options to detect and/or fix alignment problems, code and/or data ?

What are the benefits of code alignment ?
What are the effects of not aligning code ?

Anyone have tips for alignment and/or examples showing good code alignment ?

Any other issues with code alignment missed ?

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: EduardoS on February 14, 2006, 11:20:21 PM

to align code just

Code Select


align 16

MASM will put nops, lea eax,

, whatever need to align the code.

Most processors read 16 bytes from L1 code cache, decode them, schedule and execute, modern processors can decode up to 3 instructions per clock if these instructions are in the same 16 byte page, so if your code is aligned the processor will decode faster (it can help in some algos).

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: MichaelW on February 15, 2006, 12:11:24 AM

I have done tests that showed very significant effects for data alignment, but not for code alignment. Running this code on a P3:

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT equ 10000000
    REPEAT_COUNT equ 100

    invoke Sleep,1000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 4
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 4
        nop
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 4
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 4
        nop
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call alignedproc
      ENDM
    counter_end
    push  eax
    print "call aligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call misalignedproc
      ENDM
    counter_end
    push  eax
    print "call misaligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
alignedproc proc
    ret
alignedproc endp
align 4
nop
misalignedproc proc
    ret
misalignedproc endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

I get these results with a variation of no more than one cycle.

Code Select


conditional jump back to aligned label : 206 cycles
conditional jump back to misaligned label : 206 cycles
jump forward to aligned label : 197 cycles
jump forward to misaligned label : 197 cycles
call aligned procedure : 424 cycles
call misaligned procedure : 425 cycles

I tested only align 4 and (align 4) + 1, but in the past I have had similar results for other alignments. Perhaps there is a better way to perform the timing and/or other code that will be more sensitive to alignment.

[attachment deleted by admin]

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: lingo on February 15, 2006, 03:46:22 PM

"..I tested only align 4 and (align 4) + 1, but in the past I have had similar results for other alignments.
Perhaps there is a better way to perform the timing and/or other code that will be more sensitive to alignment."

P3 is very sensitive to code alignment but you did it in wrong way :lol
Why?
Because you align BEFORE THE PROC rather than the LOOPS in the proc.

See chapter "15. Instruction fetch (PPro, PII and PIII" from
"How to optimize for the Pentium family of microprocessors" by A.Fog

Regards,
Lingo

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: MichaelW on February 16, 2006, 02:17:50 AM

Quote from: lingo on February 15, 2006, 03:46:22 PM
Because you align BEFORE THE PROC rather than the LOOPS in the proc.

I don't understand. I did align before the procedures, and the loops were not in procedures. But you are right that I did not do it correctly, at least for a P3. If I modify the code so the first instruction after the conditional jump and in the misaligned procedure crosses a 16-byte boundary, and so the first instruction after the forward jump is a jump that crosses a 16-byte boundary, then I do see a substantial effect.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT equ 1000000
    REPEAT_COUNT equ 1000

    invoke Sleep,1000

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 16
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        mov   ebx, REPEAT_COUNT
        align 16
        nops 15
      @@:
        sub   ebx, 1
        jnz   @B
    counter_end
    push  eax
    print "conditional jump back to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 16
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to aligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        jmp   @F
        align 16
        nops 15
      @@:
      ENDM
    counter_end
    push  eax
    print "jump forward to misaligned label : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call alignedproc
      ENDM
    counter_end
    push  eax
    print "call aligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        call misalignedproc
      ENDM
    counter_end
    push  eax
    print "call misaligned procedure : "
    pop   eax
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 16
alignedproc proc
    xor   eax, eax
    ret
alignedproc endp
align 16
nops 15
misalignedproc proc
    xor   eax, eax
    ret
misalignedproc endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


conditional jump back to aligned label : 2019 cycles
conditional jump back to misaligned label : 3036 cycles
jump forward to aligned label : 6064 cycles
jump forward to misaligned label : 19946 cycles
call aligned procedure : 7963 cycles
call misaligned procedure : 8996 cycles

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: u on February 16, 2006, 03:59:48 AM

Also, if you match loop alignment with small loop size, you can get extra speed. MichaelW once made my jaw drop when the following code of his performed two times faster than a 370+ lines "optimize each case"-type proc of mine for adding two float arrays (DestFloat += SrcFloat):

Code Select


XMMLoop proc src:DWORD, dst:DWORD, cnt:DWORD
    push edi
    push esi
    mov ecx, [esp+20]
    mov esi, [esp+12]
    mov edi, [esp+16]
    align 16
    @@:
        movaps XMM0, [edi+ecx*4-16]
        addps XMM0, [esi+ecx*4-16]
        movaps [edi+ecx*4-16], XMM0
    sub ecx, 4
    jnz @B
    pop esi
    pop edi
    ret 12
XMMLoop endp

This one takes 1 cycle per added float. If you make the loop any longer here, the performance drops, despite my expectations, on top of it. I guess in tiny loops the cpu needn't decode the last few instructions again, if they're the same (and taken from the same address).

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: daydreamer on February 16, 2006, 05:21:25 PM

after reading agner fogs manual:
if your code gonna be in a library, should you also be align the whole lib in a 8kb block and not crossing such a boundary?

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: TOTGEBOREN on February 16, 2006, 07:45:34 PM

I ran twice and got better results for misaligned data!

AXP-M (Barton) 2400+ 1,8GHz

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: Mark Jones on February 16, 2006, 08:28:33 PM

Quote from: AMD Athlon XP 2500+
conditional jump back to aligned label : 2073 cycles
conditional jump back to misaligned label : 3118 cycles
jump forward to aligned label : 2073 cycles
jump forward to misaligned label : 7212 cycles
call aligned procedure : 10861 cycles
call misaligned procedure : 11982 cycles

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: daydreamer on February 16, 2006, 08:49:15 PM

Quote from: TOTGEBOREN on February 16, 2006, 07:45:34 PM
I ran twice and got better results for misaligned data!

AXP-M (Barton) 2400+ 1,8GHz

same here AXP 3000+ only 2ghz
:eek

Code Select

conditional jump back to aligned label : 228 cycles
conditional jump back to misaligned label : 232 cycles
jump forward to aligned label : 989 cycles
jump forward to misaligned label : 908 cycles
call aligned procedure : 2616 cycles
call misaligned procedure : 1051 cycles

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: u on February 17, 2006, 01:47:19 PM

AXP 2000+ (1.6GHz)

Code Select


conditional jump back to aligned label : 214 cycles
conditional jump back to misaligned label : 212 cycles
jump forward to aligned label : 909 cycles
jump forward to misaligned label : 907 cycles
call aligned procedure : 2390 cycles
call misaligned procedure : 1045 cycles

This is hilarious ^^
[edit]: this is with the older, incorrect version of the benchmark :red

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: dioxin on February 17, 2006, 06:46:33 PM

From the Athlon Optimization Guide:

Quote
Align Branch Targets in Program Hot Spots:
In program hot spots (as determined by either profiling or loop
nesting analysis), place branch targets at or near the beginning
of 16-byte aligned code windows. This guideline improves
performance inside hotspots by maximizing the number of
instruction fills into the instruction-byte queue and preserves Icache
space in branch-intensive code outside such hotspots.

I'm sure I also read somewhere that instructions which straddle 2 cache lines are decoded as slow VectorPath instructions instead of fast DirectPath instructions

Paul

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: EduardoS on February 17, 2006, 08:50:02 PM

In a XP 2000+

Code Select


conditional jump back to aligned label : 2014 cycles
conditional jump back to misaligned label : 3023 cycles
jump forward to aligned label : 2001 cycles
jump forward to misaligned label : 7047 cycles
call aligned procedure : 10444 cycles
call misaligned procedure : 11444 cycles

Also, i make a small change on the routine, the jump forward was taking 30 bytes between jumps for misaligned and 14 for aligned, make 14 for both, and align the procedure calls:

Code Select


conditional jump back to aligned label : 2016 cycles
conditional jump back to misaligned label : 3016 cycles
jump forward to aligned label : 2000 cycles
jump forward to misaligned label : 3006 cycles
call aligned procedure : 4541 cycles
call misaligned procedure : 6040 cycles

[attachment deleted by admin]

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: V Coder on February 20, 2006, 01:41:13 AM

*** REMEMBER THAT MASM has align bug - the code used depends on the distance and at a distance of 5 or 12 bytes the instruction clears the carry flag by (ill-advisedly) using add.

http://win.asmcommunity.net/board/index.php?topic=22291.0

MASM uses for alignment padding
0 bytes = nothing
1 byte = NOP {90}
2 bytes = MOV edi, edi {8BFF}
3 bytes = lea ecx, [ecx+0x0] {8D4900}
4 bytes = lea esp, [esp+0x0] {8D642400}
5 bytes = add eax, 0x0 {0500000000}
6 bytes = lea ebx, [ebx+0x0] {8D9B00000000}
7 bytes = lea esp, [esp+0x0] {8DA42400000000}
8 bytes = 7 + 1
9 bytes = 7 + 2
10 bytes = 7 + 3
11 bytes = 7 + 4
12 bytes = 7 + 5
13 bytes = 7 + 6
14 bytes = 7 + 7
15 bytes = 7 + 7 + 1
{16 bytes = nothing}

This align bug caused serious problems in my program:

..init loop variables
..start processing loop data {carry may be set...}
**align 16 **{inadvertently cleared carry}
..loop:
..process data
..jns loop
..finish processing loop data

In addition, program has not been able to benefit from aligning the loops, which are executed up to six times (Pentium III/Pentium 4 version) and 16 times (Athlon version can't use align because of the above). Is it because the loop is only executed max six times, and the align instruction may have its own penalty?

..setup mmx constants
..jmp start
**I never tried aligning here. I should.
..init loop variables
**align here does not appear to help.
..loop:
..process data
..jns loop
..if yy then output result; jmp start
..if zz then jmp init loop variables
..start
..check for exit signal, etc
..init data variables
..jmp init loop variables.

I guess the best I can hope for is by aligning 'init loop variables'.

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: V Coder on February 20, 2006, 02:49:23 AM

Quote from: MichaelW on February 16, 2006, 02:17:50 AMIf I modify the code so the first instruction after the conditional jump and in the misaligned procedure crosses a 16-byte boundary, and so the first instruction after the forward jump is a jump that crosses a 16-byte boundary, then I do see a substantial effect.

We are here comparing the best case (aligned to 16 byte boundary) with the worst case (misaligned to cross a 16 byte boundary). What happens on average case, eg when misaligned by 1-14 instead of 15 bytes.

Edit: From my testing this becomes a problem in the 2 instruction loop for byte displacements of 12-15 bytes on the Pentium III, and 7-15 on the Pentium 4. Will test the Athlon tomorrow. What about if three or more instructions are present?

Code Select

// RDTSC Check Program by V Coder
// Written in HLA
//

program Testloopalign;
#include( "stdlib.hhf" );
#include( "w.hhf" )			// Standard windows stuff.

static
align (4);
	curr_pro:	uns32;
	prio_cls:	uns32;
	curr_th:	 uns32;
	th_prio:	uns32;


#macro startall;
#asm
          push edx          ; reserve space on stack
          push eax          ;

          rdtsc
          mov [esp], eax    ; instead of pushing eax & edx as below
          mov [esp+4], edx  ; mov uses less clock cycles than push


; routine to test
	mov ecx, 10000
#endasm
#endmacro;

#macro endall;
#asm
; routine to test
     @@:
        sub   ecx, 1
        jnz   @B
; end routine

          rdtsc             ; Apparently rdtsc takes 13 cycles
          sub eax, [esp]
          sbb edx, [esp+4]
;         sub eax, 0eh      ; compensation for the rdtsc and the push, push if used after
;         sbb edx, 0        ; 14 cycles on a Pentium MMX
                            ;  9 cycles on a K6-2
;         sub eax, 0eh      ; compensation for the rdtsc and the mov, mov if used after
;         sbb edx, 0        ; 14 cycles on a Pentium MMX
                            ;  9 cycles on a K6-2

          add esp,8         ; remove edx, eax from stack
#endasm
#endmacro;


begin Testloopalign;

	console.cls();
	console.gotoxy(4, 15);
	stdout.put ( nl "Loop Align Test:", nl nl);

     w.GetCurrentProcess();
     mov (eax, curr_pro);
     w.GetPriorityClass(curr_pro);     
     mov (eax, prio_cls);    
//     stdout.put(nl "Process Priority Class: ", prio_cls);
     w.SetPriorityClass(curr_pro, w.HIGH_PRIORITY_CLASS);     
     mov (eax, prio_cls);    
//     stdout.put(nl "Process Priority Class: ", prio_cls);

     w.GetCurrentThread();
     mov (eax, curr_th);
     w.GetThreadPriority (curr_th);
     mov (eax, th_prio);    
//     stdout.put(nl "Thread Priority: ", th_prio);
     w.SetThreadPriority (curr_th, w.THREAD_PRIORITY_HIGHEST);
     w.GetThreadPriority (curr_th);
     mov (eax, th_prio);    
//     stdout.put(nl "Thread Priority: ", th_prio);



startall;
#asm
	align 16
#endasm
endall;
          stdout.put ( (type dword eax), " for 0 bytes." nl);

startall;
#asm
	align 16
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 1 byte." nl);

startall;
#asm
	align 16
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 2 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 3 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 4 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 5 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 6 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 7 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 8 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 9 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 10 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 11 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 12 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 13 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 14 bytes." nl);

startall;
#asm
	align 16
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
#endasm
endall;
          stdout.put ( (type dword eax), " for 15 bytes." nl);

end Testloopalign;

Results: Pentium 4 *Hex values are clock cycles
Loop Align Test:

0000_2FB0 for 0 bytes.
0000_2FD0 for 1 byte.
0000_2FAC for 2 bytes.
0000_2FD0 for 3 bytes.
0000_2FB8 for 4 bytes.
0000_2FCC for 5 bytes.
0000_2FBC for 6 bytes.
0000_42B4 for 7 bytes.
0000_45F0 for 8 bytes.
0000_45D4 for 9 bytes.
0000_45BC for 10 bytes.
0000_457C for 11 bytes.
0000_45C0 for 12 bytes.
0000_4598 for 13 bytes.
0000_45E8 for 14 bytes.
0000_45C8 for 15 bytes.

Results: Pentium III
Loop Align Test:

0000_4E52 for 0 bytes.
0000_4E57 for 1 byte.
0000_4E57 for 2 bytes.
0000_4E57 for 3 bytes.
0000_4E57 for 4 bytes.
0000_4E58 for 5 bytes.
0000_4E55 for 6 bytes.
0000_4E58 for 7 bytes.
0000_4E59 for 8 bytes.
0000_4E59 for 9 bytes.
0000_4E5A for 10 bytes.
0000_4E59 for 11 bytes.
0000_7566 for 12 bytes.
0000_7567 for 13 bytes.
0000_7569 for 14 bytes.
0000_7571 for 15 bytes.

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: V Coder on February 20, 2006, 03:59:41 AM

I have tested:

Code Select

@@:
inc eax {single byte instruction}
sub ecx, 1
jnz @B

On the Pentium III, this has basically no effect. Identical timings as for without inc eax. On the Pentium 4, the clocks increase to:

Loop Align Test:

0000_3B50 for 0 bytes.
0000_3B80 for 1 byte.
0000_3B60 for 2 bytes.
0000_3B80 for 3 bytes.
0000_3B64 for 4 bytes.
0000_3B84 for 5 bytes.
0000_3B64 for 6 bytes.
0000_7288 for 7 bytes.
0000_72A4 for 8 bytes.
0000_7224 for 9 bytes.
0000_733C for 10 bytes.
0000_7398 for 11 bytes.
0000_73A0 for 12 bytes.
0000_7380 for 13 bytes.
0000_7354 for 14 bytes.
0000_73AC for 15 bytes.

In other words, same effect at the same byte displacements for both Pentium III and Pentium 4. Well not actually, the Pentium 4 times increase too much...

However, the Pentium 4 code timing do not increase any further when I use an 8 byte loop as follows:

Code Select

@@:
movd edx, mm0 {three byte instruction}
sub ecx, 1
jnz @B

On the other hand, the Pentium III timings change:
Loop Align Test:

0000_5DA2 for 0 bytes.
0000_4E58 for 1 byte.
0000_4E58 for 2 bytes.
0000_4E58 for 3 bytes.
0000_4E59 for 4 bytes.
0000_4E59 for 5 bytes.
0000_4E59 for 6 bytes.
0000_4E5A for 7 bytes.
0000_4E5A for 8 bytes.
0000_7565 for 9 bytes.
0000_7574 for 10 bytes.
0000_7568 for 11 bytes.
0000_7568 for 12 bytes.
0000_7567 for 13 bytes.
0000_7569 for 14 bytes.
0000_7569 for 15 bytes.

Yes the perfectly aligned 8 byte loop is slower than that offset by 1-8 bytes!!! Also, the misaligned effect now extends from 9-15 bytes instead of 12-15 bytes.

Interpretation/Recommendation:
The Pentium III still manages to pair (triple) the instructions. The Pentium 4 executes/decodes instructions one at a time???

Align - it will probably help your code, but test to determine exactly how much if at all.

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: EduardoS on February 20, 2006, 10:50:32 AM

V Coder,
Here i test the mis-aligned code but without crossing a page boundary, and they took de same time as the aligned code, so the code aligned isn't so important, but avoid crossing page boundaries is important,
... and code alignment help to avoid corsses...

Title: Re: Code alignment, how to do it and what is the benefit ?
Post by: V Coder on February 20, 2006, 10:14:38 PM

On the Athlon:
Loop Align Test:

0000_4E3D for 0 bytes.
0000_4E57 for 1 byte.
0000_4F66 for 2 bytes.
0000_4E4F for 3 bytes.
0000_4F9F for 4 bytes.
0000_4E52 for 5 bytes.
0000_4F91 for 6 bytes.
0000_4E56 for 7 bytes.
0000_4E5A for 8 bytes.
0000_4E50 for 9 bytes.
0000_4E54 for 10 bytes.
0000_4E51 for 11 bytes.
0000_7570 for 12 bytes.
0000_756D for 13 bytes.
0000_756B for 14 bytes.
0000_756A for 15 bytes.

with inc eax
Loop Align Test:

0000_4E3F for 0 bytes.
0000_4E5D for 1 byte.
0000_4FB5 for 2 bytes.
0000_4E54 for 3 bytes.
0000_4F77 for 4 bytes.
0000_4E5D for 5 bytes.
0000_4F70 for 6 bytes.
0000_4E58 for 7 bytes.
0000_4FD0 for 8 bytes.
0000_4E55 for 9 bytes.
0000_4E55 for 10 bytes.
0000_7570 for 11 bytes.
0000_756D for 12 bytes.
0000_7571 for 13 bytes.
0000_7570 for 14 bytes.
0000_7570 for 15 bytes.

With movd mm0, edx
Loop Align Test:

0000_69A7 for 0 bytes.
0000_61EF for 1 byte.
0000_62F5 for 2 bytes.
0000_61E3 for 3 bytes.
0000_62D9 for 4 bytes.
0000_6559 for 5 bytes.
0000_6552 for 6 bytes.
0000_65B2 for 7 bytes.
0000_674D for 8 bytes.
0000_7A8D for 9 bytes.
0000_79AA for 10 bytes.
0000_7944 for 11 bytes.
0000_7954 for 12 bytes.
0000_7950 for 13 bytes.
0000_7A90 for 14 bytes.
0000_79DC for 15 bytes.

Now the Athlon happily handles everything that is aligned to fit completely within the 16 byte boundary with no penalty. (Well actually, it executes the mmx instruction in the same cycle as the sub, but it has a longer latency for the mmx result, thus the longer duration of even the 1-8 displacement compared to the previous tests. A long integer instruction would probably have executed in the same time as the previous tests. Both Pentium III and Athlon execute the jnz in a separate cycle from the sub, whereas the Pentium 4 apparently executes the jnz in the same cycle.) Being in the same 16 byte boundary, the jnz does not need a separate decode cycle. Note again also the effect on 0 displacement.

So, code for the Athlon (which can decode/execute up to three integer instructions per clock cycle), and everything will be optimal for other processors. That is, Ensure the targets of jumps, branches and calls avoid 16 byte boundaries - let the first three instructions from the target of a jump, call or branch all fit within a 16 byte boundary.

I optimized a compute bound program based on this information with very long (46-63 instruction) loops, and got one or two percent speed improvement as a result.

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: dsouza123 on February 14, 2006, 10:49:03 PM