Passing SSE memory pointers across functions

softwareguy256 · November 29, 2006, 04:13:29 PM

My goal is to write a well encapsulated matrix function as opposed to inlined code. I am getting fairly strange errors with passing floats and then loading it into a register:

   float m1[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
   Transpose4x4(m1);
   _asm
   {
      movups xmm0,m1
   }

Transpose does exactly the same thing as the inlined assembly after it but it loads garbage data into xmm0 instead of 0,1,2,3 as expected. I spent about 30 minutes trying various different things and cannot figure out what is going on.

From the transpose method:
void Transpose4x4(float *pMat)
{
   _asm
   {
      movups xmm0,pMat
   }
}
Register output:
XMM00 = +1.83601E-039 XMM01 = +0.00000E+000 XMM02 = +0.00000E+000 XMM03 = +1.#QNANE+000

From the locally inlined code:
XMM00 = +0.00000E+000 XMM01 = +1.00000E+000 XMM02 = +2.00000E+000 XMM03 = +3.00000E+000

Mark_Larson · November 29, 2006, 04:19:36 PM

probably the way you set up your macro. Left off the #define. You can also do it as a procedure, and then just tell it to inline the procedure, and then it's equivalen to a macro. You can also use MOVAPS instead of MOVUPS ( much faster), since the data is 16 byte aligned.

Mark

softwareguy256 · November 29, 2006, 07:38:46 PM

I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue. Inlined function might work, although that is just a hint to the compiler and not a guarantee. I would like to know what is causing this problem however.

Mark_Larson · November 29, 2006, 08:28:02 PM

Quote from: softwareguy256 on November 29, 2006, 07:38:46 PM
I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue. Inlined function might work, although that is just a hint to the compiler and not a guarantee. I would like to know what is causing this problem however.

I guess I didn't explain myself well. The Transpose Macro wasn't done right. You need to use #define, and do multi-line continuation, and get rid of the semi-colon.

Code Select


#define Transpose4x4(m1) \
   _asm \
   { \
      movups xmm0,m1 \
   }

Using MOVAPS was a suggestion to make your macro faster, and not a suggestion to make it work correctly. MOVAPS is almost twice as fast as MOVUPS ( 10 cycles for MOVUPS and 6 cycles for MOVAPS). Those timings are for an Intel P4 processor. PSHUFD is even faster for moving data from memory to a regsiter ( 4 cycles on a P4). MOVAPS on AMD is a lot faster relative to a P4. I use PSHUFD if I have a P4 processor. If I have AMD, I use MOVAPS.

Same with doing an inline procedure. If you do it inline, it's a bit easier to maintain. If you use "__forceinline" it will ALWAYS inline the function unless you are building a debug version ( there are a few other reasons it won't inline, but I've never had a problem, you can read the entire list here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_pluslang_inline_specifier.asp). And that's what I use ( assuming you use VC++)

Code Select


__forceinline void Transpose4x4(float *pMat)
{
   _asm
   {
      movups xmm0,pMat
   }
}

softwareguy256 · November 29, 2006, 09:24:01 PM

Try to write the transpose4x4 function you posted in a test app on a default vc++ console project. It won't work.

News:

Passing SSE memory pointers across functions

softwareguy256

Mark_Larson

softwareguy256

Mark_Larson

softwareguy256