Misaligned Memory Access?

cman · April 29, 2008, 09:06:07 PM

I'm reading "Computer Architecture A Quantitative Approach" and came upon something I found unclear. The author states:

Quote
Misalignment causes hardware complications , since memory is typically aligned on a word or double-word boundary. A misaligned memory access will . therefore , take multiple aligned memory references.

Does this mean the processor will have to access all aligned memory locations that contain a misaligned address and then extract the proper bits to access the data contained in the misaligned address? I'm a bit foggy on what the author is saying here! Thanks for any information! :bg

u · April 29, 2008, 10:50:14 PM

Yes.
It gets even worse when you write to an unaligned location, as you can deduct.

But cache+write-queues generally smudge the loss of performance, and anyway nowadays memory-busses are 64-bit and 128-bit. (only SSE can really shine). I'm just not sure whether there are cpus with 256-bit buses.

hutch-- · April 29, 2008, 11:58:49 PM

Its pretty straight forward stuff, the hardware does memory access in its native word size on the native word size alignment. If you want a DWORD that is contained across a 4 byte boundary, you get two memory accesses to read it instead of 1 if it was aligned.

MichaelW · April 30, 2008, 07:38:39 AM

This is a quick, crude test:

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      aligned     dd 100 dup(0)
      db 0
      misaligned1 dd 100 dup(0)
      db 0
      misaligned2 dd 100 dup(0)
      db 0
      misaligned3 dd 100 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, aligned+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, aligned",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned1+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned1",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned2+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned2",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned3+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned3",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov aligned+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, aligned",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned1+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned1",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned2+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned2",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned3+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned3",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Results on my P3:

Code Select


96 cycles, aligned
142 cycles, misaligned1
137 cycles, misaligned2
141 cycles, misaligned3

110 cycles, aligned
272 cycles, misaligned1
255 cycles, misaligned2
272 cycles, misaligned3

u · April 30, 2008, 08:45:17 AM

Sempron 3000+, DDR400 @ bad timing

Code Select


eax = 48 cycles reading of 100 dwords
eax = 97
eax = 97
eax = 98

eax = 46
eax = 95
eax = 95
eax = 95

Pretty consistent, thanks to good write-queues.

movq (MMX) takes:
56 cycles (so 28 cycles per 100 dwords) on reading 100 aligned qwords,
144 cycles on reading misaligned qwords
112 cycles on writing aligned qwords (56 cycles per 100 dwords)
146 cycles on writing misaligned qwords

SSE takes:
200 cycles (so 50 cycles per 100 dwords) on reading 100 aligned owords (via movaps)
203 cycles on reading aligned owords (but with movups)
295 cycles on reading misaligned owords

212 cycles on writing aligned owords with movaps
213 cycles on writing aligned owords with movups
444 cycles on writing misaligned owords

SSE results look wrong, but I triple-checked, and tried using all xmm registers to avoid possible stalls- same results. It just proves my Sempron has 64-bit bus to memory [or is it actually the bus to the L1/L2?] and it doesn't accelerate SSE to expected levels. And that it optimizes queued aligned DWORD stores quite well (despite half uploaded DWORDs are not QWORD-aligned ;) )

[edit: fixed-up my explanations from "48 cycles/dword" to "48 cycles per 100 dwords" and so on. Man, these cpus are beasts]

cman · April 30, 2008, 07:22:15 PM

Wow , thanks for your time on this , everyone! :bg Hopefully my study of Computer Architecture will sharpen my assembly skills ( its not enought just to know algorithms in this language! ). Thanks again..

News:

Misaligned Memory Access?

cman

u

hutch--

MichaelW

u

cman