Quadruple dot product without haddps

Neo · December 31, 2007, 07:26:25 AM

Hi! It's been years since I've been on the MASM Forum, but I figured there's no time like the present to get back at it. I also posted this question on ASMCommunity.net a couple of hours ago.

Lately, I've been hard at work on my editor project, PwnIDE alpha, and the video tutorial to go with it. The tutorial walks through making a very CPU-intensive screen saver, which I'm almost done making, but I ran into a problem... my HP laptop died. I managed to salvage the files from it using a USB harddrive enclosure, but my old laptop doesn't have SSE3, and that means that I can't do fast and simple things like:

Code Select


mulps   xmm0,xmm4   ;xmm0 = P0x*Q0x,P0y*Q0y,P0z*Q0z,P0a*Q0a
mulps   xmm1,xmm5   ;xmm1 = P1x*Q1x,P1y*Q1y,P1z*Q1z,P1a*Q1a
mulps   xmm2,xmm6   ;xmm2 = P2x*Q2x,P2y*Q2y,P2z*Q2z,P2a*Q2a
mulps   xmm3,xmm7   ;xmm3 = P3x*Q3x,P3y*Q3y,P3z*Q3z,P3a*Q3a
haddps  xmm0,xmm1   ;xmm0 = P0.Q0,P1.Q1,P2.Q2,P3.Q3
haddps  xmm2,xmm3   ;
haddps  xmm0,xmm2   ;

There's always the brute force way of swizzling the data (equivalent to matrix transpose) after the multiplications, then doing 3 addps's, but that's definitely not the most efficient. Any advice on good ways to approach this type of problem?

Not all cases need all 4 values; some only need the first three products to have meaningful values. I think I've figured out a sufficiently fast way for the cases where I've got just a single dot product to do. I'll keep the SSE3 versions of the functions in and do a CPUID check to see if the CPU can run them, else it'll run the slower versions.

Here's a screenshot I took before my laptop died, in case people are curious:

asmfan · December 31, 2007, 09:43:40 AM

Interesting project. Is it opensource or some kind of it? Or it is commercial?
Talking about sse3 with its horisontal operations - i think sse(2) with pshuf* & shufp* can emulate everything if needed. But think first on changing algos maybe. Horisontal vs. vertical operations... Depends on data store mechanism (AOS vs. SOA).

u · December 31, 2007, 12:43:37 PM

afaik, the 3 shufps was the optimal approach in SSE1/2. Or maybe with "movhlps" you could get some speedup. Here's some code I wrote in that way (that sums the 3 floats of a register, iirc ^^") :

Code Select


Vec4 struct
	x	real4 ?
	y	real4 ?
	z	real4 ?
	w	real4 ?
Vec4 ends

;=====[[ makeShadowSSE >>===\
makeShadowSSE proc ; edi = outVec, esi=inVec, ebx = Plane
	assume esi:ptr Vec4
	assume edi:ptr Vec4
	assume ebx:ptr Vec4
	;----[ make xmm3 = Plane.normal ]-----\
	movups xmm3,[ebx]
	movaps xmm0,xmm3
	mulps  xmm0,xmm0
	movaps xmm1,xmm0
	movhlps xmm2,xmm0
	shufps xmm1,xmm1,1
	addss xmm0,xmm2
	addss xmm0,xmm1 
	rsqrtss xmm0,xmm0
	; now xmm0.0 = 1/sqrt(px^2+py^2+pz^2)
	shufps xmm0,xmm0,0
	mulps xmm3,xmm0 ; now xmm3 = P.normal
	;-------------------------------------/
	;---[ make xmm4 = dist ]---------\
	movups xmm0,[esi]
	movaps xmm4,xmm0
	mulps  xmm4,xmm3
	movaps xmm1,xmm4
	movhlps xmm2,xmm4
	shufps xmm1,xmm1,1
	addss xmm4,xmm2
	addss xmm4,xmm1
	addss xmm4,[ebx].w
	shufps xmm4,xmm4,0
	;--------------------------------/
	;----[ make outvec ]------\
	mulps xmm3,xmm4
	addps xmm3,xmm0
	movups [edi],xmm3
	;-------------------------/	
	
	
	assume esi:nothing
	assume edi:nothing
	assume ebx:nothing
	ret
makeShadowSSE endp
;=======/

There's a coder, nicknamed "daydreamer" here or on the other board, that has much more experience with SSE out of most of us (if not all), hopefully he joins the discussion :) .
It's amazing how fucked-up SSE was made by Intel; and it's even more amazing how AltiVec also stumbles in this computation... and horrible how a gaming machine like PS3 also lacks a full horizontal-sum instruction in either core. While it's all quite possible and not hard to implement in silicon. 5 iterations of SSE upgrades, and Intel/AMD never understood this major flaw ...

u · December 31, 2007, 01:17:45 PM

Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

For archival purposes, I'll quote the code:

method1)

Code Select


    MOVHLPS     XMM1,XMM0         
    ADDPS       XMM0,XMM1         
    MOVUPS      XMM1,XMM0
    SHUFPS      XMM1,XMM1,$55
    ADDPS       XMM0,XMM1

method2)

Code Select


The most efficient is probably to group them by chunks of 4:

/*  return [a.sum() b.sum() c.sum() d.sum()] */
inline __m128 sum4(__m128 a, __m128 b, __m128 c, __m128 d) {
    /* [a0+a2 c0+c2 a1+a3 c1+c3 */
    __m128 s1 = _mm_add_ps(_mm_unpacklo_ps(a,c),_mm_unpackhi_ps(a, c));
    /* [b0+b2 d0+d2 b1+b3 d1+d3 */
    __m128 s2 = _mm_add_ps(_mm_unpacklo_ps(b,d),_mm_unpackhi_ps(b, d));
    /* [a0+a2 b0+b2 c0+c2 d0+d2]+
       [a1+a3 b1+b3 c1+c3 d1+d3] */
    return _mm_add_ps(_mm_unpacklo_ps(s1,s2),_mm_unpackhi_ps( s1,s2));
}

Neo · December 31, 2007, 06:24:31 PM

Quote from: asmfan on December 31, 2007, 09:43:40 AM
Interesting project. Is it opensource or some kind of it? Or it is commercial?

It will be open source, because part of the point is to do a comparison of the same operations done in assembly, C, and C++, and for that, people will want to see that I've written the C/C++ as a regular C/C++ programmer would (not with tons of intrinsics, since then it might as well be assembly), and that the assembly isn't unmaintainable using PwnIDE. I'm not an open source fanatic or anything, but when there's no chance of making any money on something and little chance of others stealing the credit, I figure that I might as well. I'll probably use a very loose license like MIT or BSD for the screensaver, because I really don't care what people do with the code. :wink

Quote
Talking about sse3 with its horisontal operations - i think sse(2) with pshuf* & shufp* can emulate everything if needed. But think first on changing algos maybe. Horisontal vs. vertical operations... Depends on data store mechanism (AOS vs. SOA).

True, but the multiplies need to be done vertically, and the adds need to be done horizontally, so I'll need to do something special.

Quote from: Ultrano on December 31, 2007, 12:43:37 PM
It's amazing how fucked-up SSE was made by Intel; and it's even more amazing how AltiVec also stumbles in this computation... and horrible how a gaming machine like PS3 also lacks a full horizontal-sum instruction in either core. While it's all quite possible and not hard to implement in silicon. 5 iterations of SSE upgrades, and Intel/AMD never understood this major flaw ...

Yeah, the quadruple dot product is also equivalent to multiplying a 4x4 matrix by a 4-element vector, so it's something that's done all the time. Doing a matrix-vector multiply in just 7 simple instructions (6 if you don't need the last row) is pretty useful, because it'll run in about 30 clocks or better (about 26 or better without the last row). I even figured out a way to do 4 of these matrix-vector multiplies together with the data ending up swizzled automatically so that I can do the multiply by their 1/z's without shuffling data.

Quote from: Ultrano on December 31, 2007, 01:17:45 PM
Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

Thanks! :bg
I think that haddps could bring significant improvement in that with it, in the cases where I use it, I can avoid using temporary variables for the most part, but without it, I need at least 2 temporary vector variables for the horizontal add. I haven't tested it yet, though, so I don't know. When you're rendering about 88,000 semi-transparent spheres at 20fps, you need all of the speed you can get. One of the other projects I've got a friend helping with will mean that we won't need to just rely on colloquial evidence and guesses to judge performance, but that won't be ready for several months. It should be fun. :wink

c0d1f1ed · January 05, 2008, 09:33:57 AM

Quote from: Ultrano on December 31, 2007, 12:43:37 PMWhile it's all quite possible and not hard to implement in silicon. 5 iterations of SSE upgrades, and Intel/AMD never understood this major flaw ...

I don't think it's that simple. They would have to add floating-point adders to the execution units. These would only be used for a dot product, for everything else it's a waste of transistors. Furthermore, this instruction would have a longer latency anyway (meaning little benefit over implementing it with separate instructions), and could jeopardize low latency for the ordinary mulps instruction. (You can see that haddps was much easier to add since it's only a rewiring of addps).

Last but not least, SSE was always intended to be used with SoA data, i.e. doing four dot products in parallel. So I think it's very understandable it hasn't been added to SSE yet.

Anyway, your prayers have been heard, because AMD plans to add a dot product (and three-operand multiply-accumulate) to SSE5.

Rockoon · January 05, 2008, 03:07:06 PM

I ditto the SoA.

I have found that most posters at a game programming forum, when asking about SSE, are disappointed with its performance.

If you arent using SoA then you arent really leveraging SSE effectively. It wasn't designed for AoS and while the later incarnations have some features friendly to AoS they simply arent complete.

The SoA mindset also comes in handy when working with GPU's, where you can (for instance) leverage the 112 processors of a $280 8800GT to do true massively parallel SIMD.

...

If you want to know why SSE is the way it is.. think of how many ways 4x registers are usefull with a AoS mentallity? Pretty much just 4 component vectors and 4x4 matrices, right? Now think about it in terms of a SoA mentality? Still usefull for 4 component work, right? but also usefull for 1 component, 2 components, 3 components, 999 components, etc.. etc.. etc..

News:

Quadruple dot product without haddps

Neo

asmfan

u

u

Neo

c0d1f1ed

Rockoon