I am trying to replace the first option below using an aligned access:
movdqu xmm4, [esi+1]
The idea was to get a further chunk...
movdqa xmm7, [esi+16] ; load another 16 bytes, aligned
... and then to shuffle a copy of xmm1 and xmm7
movdqa xmm4, xmm1
pshufb xmm4, xmm7, imm8
That doesn't work, there is no pshufb in SSE2. All workarounds below function, but they are much slower than the simple unaligned access. Any ideas?
@@: movdqa xmm1, [esi] ; load 16 bytes from current aligned address
if 0
movdqu xmm4, [esi+1] ; load another 16 bytes - 2500 cycles on Celeron M
elseif 1
movdqa xmm4, xmm1 ; works but slooooow, +370 ... +860 cycles more
psrldq xmm4, 1 ; shift out the lowest byte
if 0 ; +860 cycles
pextrw eax, xmm4, 7 ; extract the highest word
mov ah, [esi+16] ; replace hibyte with a byte from memory
pinsrw xmm4, eax, 7 ; insert the highest word
elseif 1 ; +375 cycles, best workaround
movzx eax, word ptr [esi+16-1] ; get two bytes from memory
pinsrw xmm4, eax, 7 ; insert the highest word
else
pinsrw xmm4, word ptr [esi+16-1], 7 ; insert highest word from memory, +390
endif
else
movdqa xmm7, [esi+16] ; load another 16 bytes, aligned
movdqa xmm4, xmm1 ; works but slooooow, +470
pslldq xmm7, 15
psrldq xmm4, 1
por xmm4, xmm7
endif
... do stuff...
jmp @B
hi,
I don't think there is a way to beat movdqu in singel access, because it does the same stuff you do "by hand" in one instruction . This may differ on sequential memory access...
regards, qWord
Well, at least the last option could do it entirely in registers, i.e. with only one aligned memory access:
Quotemovdqa xmm1, xmm6 ; get [esi] from previous loop
movdqa xmm7, [esi+16] ; load another 16 bytes, aligned
movdqa xmm4, xmm1 ; save for next loop
movdqa xmm4, xmm1 ; works but slooooow, +470
pslldq xmm7, 15
psrldq xmm4, 1
por xmm4, xmm7
But thanks anyway. I was not very optimistic to see miracle solutions... :wink
EDIT: Just tested it on a P4. In contrast to the poor performance on the Celeron M, the code above is a lot faster on the P4 - about 350 cycles less (ca. 7%, but the loop has some more elements).