News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Alignment

Started by Posit, May 14, 2005, 11:01:10 AM

Previous topic - Next topic

Posit

Alright, after seeing a 42% speed increase in one procedure simply by aligning a single label on a 4-byte boundary, I'm sold on alignment. In Assembly Optimization Tips, Mark Larson says he aligns 2 byte data on 2 byte boundaries, 4 on 4, 8 on 8, etc., and he mentions aligning code on 4 byte boundaries. I'm hoping a couple of the gurus here can discuss their general approach to alignment, how they align what, and when it is overkill, before I go wild and start aligning everything in sight on 32-byte boundaries.

James Ladd

when i make a proc i always preceed it with "align 4"
when I make a structure I try to make the count of bytes divisable by 4.

hutch--

Posit,

Its pretty simple stuff once you get the swing of it, modern hardware read 32 bit blocks no matter what the data size is and it reads it along 4 byte boundaries. Byte data is 1 byte aligned, word data is 2 byte aligned and dword data is 4 byte aligned.

This basically the notion of natural alignment based on the size of the data and it is important up into the larger data sizes like QWORD and OWORD as well.

With the simple data types this example may help to make sense of it.


0123-0123-0123-0123-0123-0123-0123-0123


Assuming that the beginning of this memory block is at least 4 byte aligned you read a DWORD as "0123". If you don't and end up crossing a 4 byte boundary with something like "3-012" the processor takes 2 reads to get the single 4 bytes so it slower.

BYTE data can be any single byte.

WORD data should be "01" or "23"

DWORD data should only be "0123".

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Posit

Thanks striker, I'll do that for procedures from now on.

The guidelines for aligning data seem simple enough. How about code though? Is there any reason to align the target of jumps on more than a 4 byte boundary, for instance?

James Ladd

I go with the suggestions so far, like "align 4" for procs and making structures pad out to a 4 byte boundary.
Trying to work out if jumps should be aligned right now is probably overkill.
Unless of course the only thing left to do for your application right now is optimisation at this level.

AeroASM


hutch--

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

AeroASM

Whoops, I meant, why not "12"?

Mark Jones

So does "align x" apply for only the following 1 token or the entire scope? i.e. is this second align redundant?


align 4
    lea eax, myvar
align 4
@@:
    mov dword ptr [foo],eax
    inc al
    jz @B


Same with data, do you need to align any different-sized elements, like this?


.data
align 4
    myvar  DB  0
align 8
    math   DQ  0
align 2
    count  DW  0


If that's the case, is there any way to make MASM do this automatically? :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Posit

Not specific to MASM, but the Intel documentation recommends arranging data from largest to smallest to help keep track of alignment, i.e. a QWORD aligned on 8 bytes followed by a DWORD, the DWORD will automatically be aligned on 4 bytes.

AeroASM

I don't understand how you can align the "entire scope".

Align x just means make the next byte be on a boundary of x. If you understand org then align x means org (the next mem location divisible by x). Thus anything you want to be on a x byte boundary must be preceded by an align.

hutch--

Posit is right here, if you have to stack variables without wasting space, start with the bigger ones first and the following in descending order will also be aligned.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

I'm not sure that this is all valid and correct.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
    .MMX

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\user32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\user32.lib
    include \masm32\macros\macros.asm
    include timers.asm

    _EMMS equ 1
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        aligned_mmx64     dd 0,0
        aligned_real8     REAL8 0.0
        aligned_mmx32     dd 0
        aligned_real4     REAL4 0.0
        aligned_dword     dd 0
        aligned_word      dw 0
        db 0              ; misalign by 1 byte
        misaligned_mmx64  dd 0,0
        misaligned_real8  REAL8 0.0
        misaligned_mmx32  dd 0
        misaligned_real4  REAL4 0.0
        misaligned_dword  dd 0
        dw 0              ; misalign across dword boundary
        misaligned_word   dw 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      mov   eax,OFFSET aligned_mmx64
      REPEAT REPEAT_COUNT
        movq  mm0,[eax]
        movq  [eax],mm0
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_mmx64    : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        fld   aligned_real8
        fstp  aligned_real8
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_real8    : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      mov   eax,OFFSET aligned_mmx32
      REPEAT REPEAT_COUNT
        movd  mm0,[eax]
        movd  [eax],mm0
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_mmx32    : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        fld   aligned_real4
        fstp  aligned_real4
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_real4    : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,aligned_dword
        mov   aligned_dword,eax
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_dword    : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   ax,aligned_word
        mov   aligned_word,ax
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("aligned_word     : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      mov   eax,OFFSET misaligned_mmx64
      REPEAT REPEAT_COUNT
        movq  mm0,[eax]
        movq  [eax],mm0
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_mmx64 : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        fld   misaligned_real8
        fstp  misaligned_real8
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_real8 : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      mov   eax,OFFSET misaligned_mmx32
      REPEAT REPEAT_COUNT
        movd  mm0,[eax]
        movd  [eax],mm0
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_mmx32 : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        fld   misaligned_real4
        fstp  misaligned_real4
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_real4 : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,misaligned_dword
        mov   misaligned_dword,eax
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_dword : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   ax,misaligned_word
        mov   misaligned_word,ax
      ENDM
    counter_end
    mov   ebx,eax
    print chr$("misaligned_word  : ")
    print ustr$(ebx)
    print chr$(" cycles",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Results on my P3:

aligned_mmx64    : 498 cycles
aligned_real8    : 499 cycles
aligned_mmx32    : 497 cycles
aligned_real4    : 365 cycles
aligned_dword    : 498 cycles
aligned_word     : 358 cycles
misaligned_mmx64 : 1709 cycles
misaligned_real8 : 1001 cycles
misaligned_mmx32 : 1001 cycles
misaligned_real4 : 315 cycles
misaligned_dword : 1001 cycles
misaligned_word  : 312 cycles



[attachment deleted by admin]
eschew obfuscation

Mark Jones

Another interesting test! Here's an AMD XP 1800+


aligned_mmx64    : 719 cycles
aligned_real8    : 763 cycles
aligned_mmx32    : 720 cycles
aligned_real4    : 754 cycles
aligned_dword    : 447 cycles
aligned_word     : 586 cycles
misaligned_mmx64 : 2191 cycles
misaligned_real8 : 2210 cycles
misaligned_mmx32 : 2174 cycles
misaligned_real4 : 753 cycles
misaligned_dword : 1545 cycles
misaligned_word  : 585 cycles
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

thomasantony

Hi,
   It seems misalignment is better for WORD and REAL4 data :bdg :bdg :bdg

Thomas :bdg :bdg
There are 10 types of people in the world. Those who understand binary and those who don't.


Programmer's Directory. Submit for free