News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Misaligned Memory Access?

Started by cman, April 29, 2008, 09:06:07 PM

Previous topic - Next topic

cman

I'm reading "Computer Architecture A Quantitative Approach" and came upon something I found unclear. The author states:

Quote
Misalignment causes hardware complications , since memory is typically aligned on a word or double-word boundary. A misaligned memory access will . therefore , take multiple aligned memory references.

Does this mean the processor will have to access all aligned memory locations that contain a misaligned address and then extract the proper bits to access the data contained in the misaligned address? I'm a bit foggy on what the author is saying here! Thanks for any information! :bg

u

Yes.
It gets even worse when you write to an unaligned location, as you can deduct.

But cache+write-queues generally smudge the loss of performance, and anyway nowadays memory-busses are 64-bit and 128-bit. (only SSE can really shine). I'm just not sure whether there are cpus with 256-bit buses.
Please use a smaller graphic in your signature.

hutch--

Its pretty straight forward stuff, the hardware does memory access in its native word size on the native word size alignment. If you want a DWORD that is contained across a 4 byte boundary, you get two memory accesses to read it instead of 1 if it was aligned.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

MichaelW

This is a quick, crude test:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      aligned     dd 100 dup(0)
      db 0
      misaligned1 dd 100 dup(0)
      db 0
      misaligned2 dd 100 dup(0)
      db 0
      misaligned3 dd 100 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, aligned+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, aligned",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned1+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned1",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned2+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned2",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov eax, misaligned3+N*4
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned3",13,10,13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov aligned+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, aligned",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned1+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned1",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned2+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned2",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      N=0
      REPEAT 100
        mov misaligned3+N*4, eax
        N=N+1
      ENDM
    counter_end
    print ustr$(eax)," cycles, misaligned3",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

96 cycles, aligned
142 cycles, misaligned1
137 cycles, misaligned2
141 cycles, misaligned3

110 cycles, aligned
272 cycles, misaligned1
255 cycles, misaligned2
272 cycles, misaligned3
eschew obfuscation

u

Sempron 3000+, DDR400 @ bad timing

eax = 48 cycles reading of 100 dwords
eax = 97
eax = 97
eax = 98

eax = 46
eax = 95
eax = 95
eax = 95

Pretty consistent, thanks to good write-queues.

movq (MMX) takes:
56 cycles (so 28 cycles per 100 dwords) on reading 100 aligned qwords,
144 cycles on reading misaligned qwords
112 cycles on writing aligned qwords (56 cycles per 100 dwords)
146 cycles on writing misaligned qwords

SSE takes:
200 cycles (so 50 cycles per 100 dwords) on reading 100 aligned owords (via movaps)
203 cycles on reading aligned owords (but with movups)
295 cycles on reading misaligned owords

212 cycles on writing aligned owords with movaps
213 cycles on writing aligned owords with movups
444 cycles on writing misaligned owords

SSE results look wrong, but I triple-checked, and tried using all xmm registers to avoid possible stalls- same results. It just proves my Sempron has 64-bit bus to memory [or is it actually the bus to the L1/L2?] and it doesn't accelerate SSE to expected levels. And that it optimizes queued aligned DWORD stores quite well (despite half uploaded DWORDs are not QWORD-aligned ;) )

[edit: fixed-up my explanations from "48 cycles/dword" to "48 cycles per 100 dwords" and so on. Man, these cpus are beasts]
Please use a smaller graphic in your signature.

cman

Wow , thanks for your time on this , everyone! :bg Hopefully my study of Computer Architecture will sharpen my assembly skills ( its not enought just to know algorithms in this language! ). Thanks again..