SSE3: lddqu macro for ML version 6.15

jj2007 · April 02, 2010, 09:06:37 AM

Some of us are using MASM version 6.15 because a) it's available and b) it supports SSE2, in contrast to version 6.14 shipping with Masm32. There is one frequently used SSE3 instruction that is lacking, LDDQU:

Code Select

movapd xmm1, oword ptr [esi]	; 16-byte aligned mem to xmm move
movupd xmm0, oword ptr [esi]	; unaligned version of movapd
lddqu xmm0, oword ptr [esi]	; unaligned version of movapd, cache-friendly

This macro provides lddqu functionality for ML 6.15.

Code Select

; Usage variants (they all produce the same code):
; ldq xmm0, oword ptr [esi]	; full version, like lddqu xmm0, oword ptr [esi]
; ldq xmm0, [esi]		; short version, like movupd xmm1, [edi]
; ldq xmm0, esi			; lazy version

ldq MACRO xreg, reg		; substitute for SSE3 lddqu xmm0, [esi] with ml v615
LOCAL isx, isr, isb
  isb INSTR <reg>, <[>
  if isb
	isr INSTR <eax/ecx/edx/ebx/eSp/ebp/esi/edi>, @SubStr(<reg>, isb+1, 3)
  else
	isr INSTR <eax/ecx/edx/ebx/eSp/ebp/esi/edi>, <reg>
  endif
  if isr
	isx INSTR <xmm0xmm1xmm2xmm3xmm4xmm5xmm6xmm7>, <xreg>
	if isx
		isx=(isx-1)*2
		db 0F2h, 0Fh, 0F0h
		db isx+(isr-1)/4
	else
		echo <xreg> is not a valid xmm register
		.err
	endif
  else
	echo <reg> is not a valid register
	.err
  endif
ENDM

More info on lddqu:

Quote
lddqu
This instruction may improve performance relative to MOVDQU if the source operand crosses a cache line boundary. In situations that require the data loaded by LDDQU be modified and stored to the same location, use MOVDQU or MOVDQA instead of LDDQU. To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the MOVDQA instruction.

Implementation Notes - If the source is aligned to a 16-byte boundary, based on the implementation, the 16 bytes may be loaded more than once. For that reason, the usage of LDDQU should be avoided when using uncached or write-combining (WC) memory regions. For uncached or WC memory regions, keep using MOVDQU. [JJ: Which means it performs well with cached timings but is slow in some type of real life apps, e.g. when loading large files that don't fit into the cache].

Cache/page lines and LDDQU
I'm a developer for the x264 video encoder. When Loren Merritt, another developer, was testing SAD assembly operation performance, he noticed that the number of clocks required varied wildly. In particular, when the unaligned data to be loaded crossed a cache line boundary, which occurred 25% of the time for blocks of width 16 bytes, up to eight times as many clock cycles were required for the entire operation, depending on the processor. This issue was found to exist not only for SSE2 load operations like MOVDQU, but also for MMX, and on processors going all the way back to the Pentium 3 (though obviously the cache line intervals varied). No older processors were tested than the Pentium 3. Page lines were also encountered every 64 cache lines, i.e. 1/256th of the time. Data crossing a page line resulted in up to 100 times slower performance; thousands of clock cycles for a single 16x16 SAD operation that normally took 48. However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.

We created a workaround using PALIGNR for SSSE3, for Core 2s, that eliminated most of the performance hit. We would have preferred to use LDDQU, but our testing showed that LDDQU does not work correctly on Core 2s; rather it appears to be using two MOVDQUs or similar, since the performance doesn't improve. However, LDDQU works correctly on Prescotts and other Netburst-based processors, along with the Core 1, allowing us to successfully implement this workaround on SSE3 chips.

News:

SSE3: lddqu macro for ML version 6.15

jj2007