I'm reading "Computer Architecture A Quantitative Approach" and came upon something I found unclear. The author states:
Quote
Misalignment causes hardware complications , since memory is typically aligned on a word or double-word boundary. A misaligned memory access will . therefore , take multiple aligned memory references.
Does this mean the processor will have to access all aligned memory locations that contain a misaligned address and then extract the proper bits to access the data contained in the misaligned address? I'm a bit foggy on what the author is saying here! Thanks for any information! :bg
Yes.
It gets even worse when you write to an unaligned location, as you can deduct.
But cache+write-queues generally smudge the loss of performance, and anyway nowadays memory-busses are 64-bit and 128-bit. (only SSE can really shine). I'm just not sure whether there are cpus with 256-bit buses.
Its pretty straight forward stuff, the hardware does memory access in its native word size on the native word size alignment. If you want a DWORD that is contained across a 4 byte boundary, you get two memory accesses to read it instead of 1 if it was aligned.
This is a quick, crude test:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
aligned dd 100 dup(0)
db 0
misaligned1 dd 100 dup(0)
db 0
misaligned2 dd 100 dup(0)
db 0
misaligned3 dd 100 dup(0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 4000
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov eax, aligned+N*4
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, aligned",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov eax, misaligned1+N*4
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned1",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov eax, misaligned2+N*4
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned2",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov eax, misaligned3+N*4
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned3",13,10,13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov aligned+N*4, eax
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, aligned",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov misaligned1+N*4, eax
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned1",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov misaligned2+N*4, eax
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned2",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
N=0
REPEAT 100
mov misaligned3+N*4, eax
N=N+1
ENDM
counter_end
print ustr$(eax)," cycles, misaligned3",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3:
96 cycles, aligned
142 cycles, misaligned1
137 cycles, misaligned2
141 cycles, misaligned3
110 cycles, aligned
272 cycles, misaligned1
255 cycles, misaligned2
272 cycles, misaligned3
Sempron 3000+, DDR400 @ bad timing
eax = 48 cycles reading of 100 dwords
eax = 97
eax = 97
eax = 98
eax = 46
eax = 95
eax = 95
eax = 95
Pretty consistent, thanks to good write-queues.
movq (MMX) takes:
56 cycles (so 28 cycles per 100 dwords) on reading 100 aligned qwords,
144 cycles on reading misaligned qwords
112 cycles on writing aligned qwords (56 cycles per 100 dwords)
146 cycles on writing misaligned qwords
SSE takes:
200 cycles (so 50 cycles per 100 dwords) on reading 100 aligned owords (via movaps)
203 cycles on reading aligned owords (but with movups)
295 cycles on reading misaligned owords
212 cycles on writing aligned owords with movaps
213 cycles on writing aligned owords with movups
444 cycles on writing misaligned owords
SSE results look wrong, but I triple-checked, and tried using all xmm registers to avoid possible stalls- same results. It just proves my Sempron has 64-bit bus to memory [or is it actually the bus to the L1/L2?] and it doesn't accelerate SSE to expected levels. And that it optimizes queued aligned DWORD stores quite well (despite half uploaded DWORDs are not QWORD-aligned ;) )
[edit: fixed-up my explanations from "48 cycles/dword" to "48 cycles per 100 dwords" and so on. Man, these cpus are beasts]
Wow , thanks for your time on this , everyone! :bg Hopefully my study of Computer Architecture will sharpen my assembly skills ( its not enought just to know algorithms in this language! ). Thanks again..