
MASM32 SDK Description, downloads and other helpful links New Forum Link
masmforum WebSite

Passing SSE memory pointers across functions

Started by softwareguy256, November 29, 2006, 04:13:29 PM

Previous topic - Next topic


My goal is to write a well encapsulated matrix function as opposed to inlined code.  I am getting fairly strange errors with passing floats and then loading it into a register:

   float m1[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
      movups xmm0,m1

Transpose does exactly the same thing as the inlined assembly after it but it loads garbage data into xmm0 instead of 0,1,2,3 as expected.  I spent about 30 minutes trying various different things and cannot figure out what is going on.

From the transpose method:
void Transpose4x4(float *pMat)
      movups xmm0,pMat
Register output:
XMM00 = +1.83601E-039 XMM01 = +0.00000E+000 XMM02 = +0.00000E+000 XMM03 = +1.#QNANE+000

From the locally inlined code:
XMM00 = +0.00000E+000 XMM01 = +1.00000E+000 XMM02 = +2.00000E+000 XMM03 = +3.00000E+000



  probably the way you set up your macro.  Left off the #define.  You can also do it as a procedure, and then just tell it to inline the procedure, and then it's equivalen to a macro.  You can also use MOVAPS instead of MOVUPS ( much faster), since the data is 16 byte aligned.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage


I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue.  Inlined function might work, although that is just a hint to the compiler and not a guarantee.  I would like to know what is causing this problem however. 


Quote from: softwareguy256 on November 29, 2006, 07:38:46 PM
I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue.  Inlined function might work, although that is just a hint to the compiler and not a guarantee.  I would like to know what is causing this problem however. 

I guess I didn't explain myself well.  The Transpose Macro wasn't done right.  You  need to use #define, and do multi-line continuation, and get rid of the semi-colon.

#define Transpose4x4(m1) \
   _asm \
   { \
      movups xmm0,m1 \

Using MOVAPS was a suggestion to make your macro faster, and not a suggestion to make it work correctly.  MOVAPS is almost twice as fast as MOVUPS ( 10 cycles for MOVUPS and 6 cycles for MOVAPS).  Those timings are for an Intel P4 processor.  PSHUFD is even faster for moving data from memory to a regsiter ( 4 cycles on a P4).  MOVAPS on AMD is a lot faster relative to a P4.  I use PSHUFD if I have a P4 processor.  If I have AMD, I use MOVAPS.

Same with doing an inline procedure.  If you do it inline, it's a bit easier to maintain.  If you use "__forceinline" it will ALWAYS inline the function unless you are building a debug version ( there are a few other reasons it won't inline, but I've never had a problem, you can read the entire list here:  And that's what I use ( assuming you use VC++)

__forceinline void Transpose4x4(float *pMat)
      movups xmm0,pMat
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage


Try to write the transpose4x4 function you posted in a test app on a default vc++ console project.  It won't work.