Alright, after seeing a 42% speed increase in one procedure simply by aligning a single label on a 4-byte boundary, I'm sold on alignment. In Assembly Optimization Tips, Mark Larson says he aligns 2 byte data on 2 byte boundaries, 4 on 4, 8 on 8, etc., and he mentions aligning code on 4 byte boundaries. I'm hoping a couple of the gurus here can discuss their general approach to alignment, how they align what, and when it is overkill, before I go wild and start aligning everything in sight on 32-byte boundaries.
when i make a proc i always preceed it with "align 4"
when I make a structure I try to make the count of bytes divisable by 4.
Posit,
Its pretty simple stuff once you get the swing of it, modern hardware read 32 bit blocks no matter what the data size is and it reads it along 4 byte boundaries. Byte data is 1 byte aligned, word data is 2 byte aligned and dword data is 4 byte aligned.
This basically the notion of natural alignment based on the size of the data and it is important up into the larger data sizes like QWORD and OWORD as well.
With the simple data types this example may help to make sense of it.
0123-0123-0123-0123-0123-0123-0123-0123
Assuming that the beginning of this memory block is at least 4 byte aligned you read a DWORD as "0123". If you don't and end up crossing a 4 byte boundary with something like "3-012" the processor takes 2 reads to get the single 4 bytes so it slower.
BYTE data can be any single byte.
WORD data should be "01" or "23"
DWORD data should only be "0123".
Thanks striker, I'll do that for procedures from now on.
The guidelines for aligning data seem simple enough. How about code though? Is there any reason to align the target of jumps on more than a 4 byte boundary, for instance?
I go with the suggestions so far, like "align 4" for procs and making structures pad out to a 4 byte boundary.
Trying to work out if jumps should be aligned right now is probably overkill.
Unless of course the only thing left to do for your application right now is optimisation at this level.
Try it.
Whoops, I meant, why not "12"?
So does "align x" apply for only the following 1 token or the entire scope? i.e. is this second align redundant?
align 4
lea eax, myvar
align 4
@@:
mov dword ptr [foo],eax
inc al
jz @B
Same with data, do you need to align any different-sized elements, like this?
.data
align 4
myvar DB 0
align 8
math DQ 0
align 2
count DW 0
If that's the case, is there any way to make MASM do this automatically? :)
Not specific to MASM, but the Intel documentation recommends arranging data from largest to smallest to help keep track of alignment, i.e. a QWORD aligned on 8 bytes followed by a DWORD, the DWORD will automatically be aligned on 4 bytes.
I don't understand how you can align the "entire scope".
Align x just means make the next byte be on a boundary of x. If you understand org then align x means org (the next mem location divisible by x). Thus anything you want to be on a x byte boundary must be preceded by an align.
Posit is right here, if you have to stack variables without wasting space, start with the bigger ones first and the following in descending order will also be aligned.
I'm not sure that this is all valid and correct.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.MMX
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib
include \masm32\macros\macros.asm
include timers.asm
_EMMS equ 1
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
aligned_mmx64 dd 0,0
aligned_real8 REAL8 0.0
aligned_mmx32 dd 0
aligned_real4 REAL4 0.0
aligned_dword dd 0
aligned_word dw 0
db 0 ; misalign by 1 byte
misaligned_mmx64 dd 0,0
misaligned_real8 REAL8 0.0
misaligned_mmx32 dd 0
misaligned_real4 REAL4 0.0
misaligned_dword dd 0
dw 0 ; misalign across dword boundary
misaligned_word dw 0
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT EQU 10000000
REPEAT_COUNT EQU 100
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
mov eax,OFFSET aligned_mmx64
REPEAT REPEAT_COUNT
movq mm0,[eax]
movq [eax],mm0
ENDM
counter_end
mov ebx,eax
print chr$("aligned_mmx64 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
fld aligned_real8
fstp aligned_real8
ENDM
counter_end
mov ebx,eax
print chr$("aligned_real8 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
mov eax,OFFSET aligned_mmx32
REPEAT REPEAT_COUNT
movd mm0,[eax]
movd [eax],mm0
ENDM
counter_end
mov ebx,eax
print chr$("aligned_mmx32 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
fld aligned_real4
fstp aligned_real4
ENDM
counter_end
mov ebx,eax
print chr$("aligned_real4 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov eax,aligned_dword
mov aligned_dword,eax
ENDM
counter_end
mov ebx,eax
print chr$("aligned_dword : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov ax,aligned_word
mov aligned_word,ax
ENDM
counter_end
mov ebx,eax
print chr$("aligned_word : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
mov eax,OFFSET misaligned_mmx64
REPEAT REPEAT_COUNT
movq mm0,[eax]
movq [eax],mm0
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_mmx64 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
fld misaligned_real8
fstp misaligned_real8
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_real8 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
mov eax,OFFSET misaligned_mmx32
REPEAT REPEAT_COUNT
movd mm0,[eax]
movd [eax],mm0
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_mmx32 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
fld misaligned_real4
fstp misaligned_real4
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_real4 : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov eax,misaligned_dword
mov misaligned_dword,eax
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_dword : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov ax,misaligned_word
mov misaligned_word,ax
ENDM
counter_end
mov ebx,eax
print chr$("misaligned_word : ")
print ustr$(ebx)
print chr$(" cycles",13,10)
mov eax, input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3:
aligned_mmx64 : 498 cycles
aligned_real8 : 499 cycles
aligned_mmx32 : 497 cycles
aligned_real4 : 365 cycles
aligned_dword : 498 cycles
aligned_word : 358 cycles
misaligned_mmx64 : 1709 cycles
misaligned_real8 : 1001 cycles
misaligned_mmx32 : 1001 cycles
misaligned_real4 : 315 cycles
misaligned_dword : 1001 cycles
misaligned_word : 312 cycles
[attachment deleted by admin]
Another interesting test! Here's an AMD XP 1800+
aligned_mmx64 : 719 cycles
aligned_real8 : 763 cycles
aligned_mmx32 : 720 cycles
aligned_real4 : 754 cycles
aligned_dword : 447 cycles
aligned_word : 586 cycles
misaligned_mmx64 : 2191 cycles
misaligned_real8 : 2210 cycles
misaligned_mmx32 : 2174 cycles
misaligned_real4 : 753 cycles
misaligned_dword : 1545 cycles
misaligned_word : 585 cycles
Hi,
It seems misalignment is better for WORD and REAL4 data :bdg :bdg :bdg
Thomas :bdg :bdg
The bigger the data size the worse the penalty for mis-aligned data. I did some code to show the alignment problems with an MMX version of a string copy routine.
http://www.masmforum.com/simple/index.php?topic=1589.45 - search for "alignment". The orignal code ran in 87 cycles accessing 8 bytes at a time ( MMX registers). The misaligned code caused it to run in about 250-280 cycles depending on how unaligned it was. Other than the misaligned data, there are no other changes to the code.
Quote from: hutch-- on May 15, 2005, 01:58:13 AM
BYTE data can be any single byte.
WORD data should be "01" or "23"
DWORD data should only be "0123".
Intel's documentation (for the PIV) claims that word access of the form "12" are also fine. Also, most of the time you only get a big hit if the data object crosses a cache line.
Cheers,
Randy Hyde
ok, so now I know how to align my data and procedures.
But am I right in thinking I can align the code within a procedure using an align statement as well ?
If I use this align keyword, with masm put NOPs in the code to make it pad out ?
QuoteIf I use this align keyword, with masm put NOPs in the code to make it pad out ?
We beat that subject to death here:
http://www.masmforum.com/simple/index.php?topic=1622.0
Michael, Thanks for beating it one more time :)