Print Page - SSE2: no pshufb...

Title: SSE2: no pshufb...
Post by: jj2007 on April 29, 2009, 10:01:30 PM

I am trying to replace the first option below using an aligned access:
      movdqu xmm4, [esi+1]

The idea was to get a further chunk...
      movdqa xmm7, [esi+16]   ; load another 16 bytes, aligned
... and then to shuffle a copy of xmm1 and xmm7
      movdqa xmm4, xmm1
      pshufb xmm4, xmm7, imm8

That doesn't work, there is no pshufb in SSE2. All workarounds below function, but they are much slower than the simple unaligned access. Any ideas?

Code Select

@@:	movdqa xmm1, [esi]			; load 16 bytes from current aligned address
	if 0
		movdqu xmm4, [esi+1]		; load another 16 bytes - 2500 cycles on Celeron M
	elseif 1
		movdqa xmm4, xmm1		; works but slooooow, +370 ... +860 cycles more
		psrldq xmm4, 1			; shift out the lowest byte
		if 0				; +860 cycles
			pextrw eax, xmm4, 7	; extract the highest word
			mov ah, [esi+16]	; replace hibyte with a byte from memory
			pinsrw xmm4, eax, 7	; insert the highest word
		elseif 1				; +375 cycles, best workaround
			movzx eax, word ptr [esi+16-1]	; get two bytes from memory
			pinsrw xmm4, eax, 7		; insert the highest word
		else
			pinsrw xmm4, word ptr [esi+16-1], 7  ; insert highest word from memory, +390
		endif
	else
		movdqa xmm7, [esi+16]	; load another 16 bytes, aligned
		movdqa xmm4, xmm1	; works but slooooow, +470
		pslldq xmm7, 15
		psrldq xmm4, 1
		por xmm4, xmm7
	endif
... do stuff...
jmp @B

Title: Re: SSE2: no pshufb...
Post by: qWord on April 29, 2009, 10:52:07 PM

hi,

I don't think there is a way to beat movdqu in singel access, because it does the same stuff you do "by hand" in one instruction . This may differ on sequential memory access...

regards, qWord

Title: Re: SSE2: no pshufb...
Post by: jj2007 on April 30, 2009, 05:27:09 AM

Well, at least the last option could do it entirely in registers, i.e. with only one aligned memory access:

Quotemovdqa xmm1, xmm6   ; get [esi] from previous loop
      movdqa xmm7, [esi+16]   ; load another 16 bytes, aligned
      movdqa xmm4, xmm1   ; save for next loop
      movdqa xmm4, xmm1   ; works but slooooow, +470
      pslldq xmm7, 15
      psrldq xmm4, 1
      por xmm4, xmm7

But thanks anyway. I was not very optimistic to see miracle solutions... :wink

EDIT: Just tested it on a P4. In contrast to the poor performance on the Celeron M, the code above is a lot faster on the P4 - about 350 cycles less (ca. 7%, but the loop has some more elements).

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: jj2007 on April 29, 2009, 10:01:30 PM