My goal is to write a well encapsulated matrix function as opposed to inlined code. I am getting fairly strange errors with passing floats and then loading it into a register:
float m1[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
Transpose4x4(m1);
_asm
{
movups xmm0,m1
}
Transpose does exactly the same thing as the inlined assembly after it but it loads garbage data into xmm0 instead of 0,1,2,3 as expected. I spent about 30 minutes trying various different things and cannot figure out what is going on.
From the transpose method:
void Transpose4x4(float *pMat)
{
_asm
{
movups xmm0,pMat
}
}
Register output:
XMM00 = +1.83601E-039 XMM01 = +0.00000E+000 XMM02 = +0.00000E+000 XMM03 = +1.#QNANE+000
From the locally inlined code:
XMM00 = +0.00000E+000 XMM01 = +1.00000E+000 XMM02 = +2.00000E+000 XMM03 = +3.00000E+000
probably the way you set up your macro. Left off the #define. You can also do it as a procedure, and then just tell it to inline the procedure, and then it's equivalen to a macro. You can also use MOVAPS instead of MOVUPS ( much faster), since the data is 16 byte aligned.
Mark
I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue. Inlined function might work, although that is just a hint to the compiler and not a guarantee. I would like to know what is causing this problem however.
Quote from: softwareguy256 on November 29, 2006, 07:38:46 PM
I've tried removing the alignment macro which is just __declspec(align(16)) thats not the issue. Inlined function might work, although that is just a hint to the compiler and not a guarantee. I would like to know what is causing this problem however.
I guess I didn't explain myself well. The Transpose Macro wasn't done right. You need to use #define, and do multi-line continuation, and get rid of the semi-colon.
#define Transpose4x4(m1) \
_asm \
{ \
movups xmm0,m1 \
}
Using MOVAPS was a suggestion to make your macro faster, and not a suggestion to make it work correctly. MOVAPS is almost twice as fast as MOVUPS ( 10 cycles for MOVUPS and 6 cycles for MOVAPS). Those timings are for an Intel P4 processor. PSHUFD is even faster for moving data from memory to a regsiter ( 4 cycles on a P4). MOVAPS on AMD is a lot faster relative to a P4. I use PSHUFD if I have a P4 processor. If I have AMD, I use MOVAPS.
Same with doing an inline procedure. If you do it inline, it's a bit easier to maintain. If you use "__forceinline" it will ALWAYS inline the function unless you are building a debug version ( there are a few other reasons it won't inline, but I've never had a problem, you can read the entire list here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_pluslang_inline_specifier.asp). And that's what I use ( assuming you use VC++)
__forceinline void Transpose4x4(float *pMat)
{
_asm
{
movups xmm0,pMat
}
}
Try to write the transpose4x4 function you posted in a test app on a default vc++ console project. It won't work.