News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SSE2: no pshufb...

Started by jj2007, April 29, 2009, 10:01:30 PM

Previous topic - Next topic

jj2007

I am trying to replace the first option below using an aligned access:
      movdqu xmm4, [esi+1]

The idea was to get a further chunk...
      movdqa xmm7, [esi+16]   ; load another 16 bytes, aligned
... and then to shuffle a copy of xmm1 and xmm7
      movdqa xmm4, xmm1
      pshufb xmm4, xmm7, imm8

That doesn't work, there is no pshufb in SSE2. All workarounds below function, but they are much slower than the simple unaligned access. Any ideas?

@@: movdqa xmm1, [esi] ; load 16 bytes from current aligned address
if 0
movdqu xmm4, [esi+1] ; load another 16 bytes - 2500 cycles on Celeron M
elseif 1
movdqa xmm4, xmm1 ; works but slooooow, +370 ... +860 cycles more
psrldq xmm4, 1 ; shift out the lowest byte
if 0 ; +860 cycles
pextrw eax, xmm4, 7 ; extract the highest word
mov ah, [esi+16] ; replace hibyte with a byte from memory
pinsrw xmm4, eax, 7 ; insert the highest word
elseif 1 ; +375 cycles, best workaround
movzx eax, word ptr [esi+16-1] ; get two bytes from memory
pinsrw xmm4, eax, 7 ; insert the highest word
else
pinsrw xmm4, word ptr [esi+16-1], 7  ; insert highest word from memory, +390
endif
else
movdqa xmm7, [esi+16] ; load another 16 bytes, aligned
movdqa xmm4, xmm1 ; works but slooooow, +470
pslldq xmm7, 15
psrldq xmm4, 1
por xmm4, xmm7
endif
... do stuff...
jmp @B

qWord

hi,

I don't think there is a  way to beat movdqu in singel access, because it does the same stuff you do "by hand" in one instruction . This may differ on sequential memory access...

regards, qWord
FPU in a trice: SmplMath
It's that simple!

jj2007

#2
Well, at least the last option could do it entirely in registers, i.e. with only one aligned memory access:
Quotemovdqa xmm1, xmm6   ; get [esi] from previous loop
      movdqa xmm7, [esi+16]   ; load another 16 bytes, aligned
      movdqa xmm4, xmm1   ; save for next loop
      movdqa xmm4, xmm1   ; works but slooooow, +470
      pslldq xmm7, 15
      psrldq xmm4, 1
      por xmm4, xmm7
But thanks anyway. I was not very optimistic to see miracle solutions... :wink

EDIT: Just tested it on a P4. In contrast to the poor performance on the Celeron M, the code above is a lot faster on the P4 - about 350 cycles less (ca. 7%, but the loop has some more elements).