News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Align by 4 or 16?

Started by Seb, November 21, 2006, 08:29:08 PM

Previous topic - Next topic

Seb

Well, hello again! :bg

I've got a quick question I'd like to get an answer to. I've seen functions and labels being aligned by both 4 and 16 bytes, but I've never really understood why people align differently. I mean, I've read Mark's great article on optimization in which he says he usually align his code by 4 bytes. On the other hand, if you jump to the second page of the InString Speed Test thread and download the latest ZIP-file, you'll see that both lingo and Hutch align their code by 16 bytes. Now, I aligned my own code by both 4 and 16 bytes, and there's usually no difference in terms of speed for me, but sometimes the 16-byte aligned function will perform better by a few clocks, so I ask myself and the people here: why? If you got any good paper on alignment in general to point me at, it'd be appreciated, too. Thanks.

While I'm at the subject 'alignment', I've got another question pushed onto the stack. ::) Say I have a byte array, like this:


.data?

szPath db MAX_PATH+1 dup(?)


What would I align it by? 4? 16? Or would I leave it unaligned?

Regards,
Seb

hutch--

Seb,

It very much a "suck it and see" approach, the idea behind 16 byte aligned procedures is to place the target on the next boundary and while it only occasionally makes a difference, when you are timing multiple algorithms, you make sure that a precending one does not effect the following one so you align both to 16 bytes.

I have seen instances where aligned lables in source code make the algo slower so its always worth the effort to test the results.

What really IS important is to align data if you are reading it in larger than 1 byte pieces, otherwise the processor makes two reads to get the data.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

donkey

Hi Seb,

Alignment is critical and depends on a number of factors but in general you can get by aligning data to it's size, BYTEs = no alignment, WORDs =ALIGN 2, DWORDs = ALIGN 4, QWORDs = ALIGN 8 and I usually align structures to 16 bytes but 4 will do in most cases. You should also consider code alignment, especially where loops are concerned, this can have an even bigger impact on execution speed than data alignment. Another thing to consider is the cache line size of the processor, it is always preferable to fit a tight loop into a single cache line (L1 cache) this can exponentially increase execution speed. Take for example my MMX version of lstrlenA, you would wonder why I have inserted NOPs in the code, this is strictly to align the instructions to specific boundaries and thereby increase throughput. Try timing it with and without the NOPs and you can see the difference. A really interesting test is to try it on different processors and see if that affects run time...

lszLenMMX FRAME pString

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms


   RET

ENDF


Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

ecube

Great question, this information I think should be added to Marks or a asm optimization guide ingeneral wheres its readily available, so everyone benefits. I didn't know about the nops but I do remember wondering why you did that donkey.

Seb

Hi guys,

thanks for your answers. I'm starting to get it now, but if any of you got any more information on that "NOP" stuff, I'd greatly appreciate if you shared it. :U

donkey:

Thanks for that code, I translated it to MASM and tested it - the version with NOP's was 9-10 clocks faster each time. I've attached the sample code if anyone else wishes to try.


35 timing lszLenMMX - Result: 43
44 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
44 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
44 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
44 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
45 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
45 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
45 timing lszLenMMX_ - Result: 43
35 timing lszLenMMX - Result: 43
44 timing lszLenMMX_ - Result: 43
----------------------
35 average lszLenMMX
44 average lszLenMMX_
----------------------


Regards,
Seb

[attachment deleted by admin]

Mark_Larson

Quote from: hutch-- on November 21, 2006, 11:48:29 PM
Seb,

It very much a "suck it and see" approach, the idea behind 16 byte aligned procedures is to place the target on the next boundary and while it only occasionally makes a difference, when you are timing multiple algorithms, you make sure that a precending one does not effect the following one so you align both to 16 bytes.

I have seen instances where aligned lables in source code make the algo slower so its always worth the effort to test the results.

What really IS important is to align data if you are reading it in larger than 1 byte pieces, otherwise the processor makes two reads to get the data.

  I actually do both 4 and 16 for code alignment.  I try both.  And like Hutch said timing is very important because sometimes the code will be slower.  I also align procedure entries on a 64 byte boundary if they  need to be fast and time that as well to make sure.

  I also recently added a macro to add up to 64 bytes to my data section if it isn't over 64 bytes so that the data and code don't share the same cache line ( which causes the code to slow down).


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

zooba

Quote from: Mark_Larson on November 22, 2006, 03:38:27 PM
I also recently added a macro to add up to 64 bytes to my data section if it isn't over 64 bytes so that the data and code don't share the same cache line ( which causes the code to slow down).

Won't Windows allocate the data and code on separate pages anyway?