News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Matrix and vector operations using FPU

Started by gabor, September 18, 2007, 10:54:32 PM

Previous topic - Next topic

gabor

Hello coder friends!

This thread is intended to present solutions for several d3dx methods. So far I have matrix multiplication and vector normalization.
Please test it and share your results. I'd be glad to see some optimization advices, I guess other members would be happy too. (Maybe I'm too optimistic).


Now the short story behind:
After I had made a few steps in the strange and unfamiliar domain of D3D I found myself surrounded by matrix and vector operations using floats.
I got the obvious idea to use the FPU for coding such operations, because I don't really know much about SSE and because I heard some opinions about the very good performance of the FPU on AMD procs (and I own an AMD Athlon). Btw, I too experienced amazing differences between P4 and Athlon FPU performances (about 1:10 ratio!!). So I'd like to use the FPU with float (32 bit) precision. As stubborn as I am I surely will get to an SSE implementation too.



Edit:
New zip contains the promised modifications. Enjoy! The matrix multiplication is not complete, it multiplies a matrix with a transposed marix. This means, it cannot be used to calculate the square of a matrix.


Greets, Gábor

[attachment deleted by admin]

GregL

Hi Gábor,

Since you are using arrays of REAL4 variables, I think this is a case where SSE would be the best solution. For example MULPS. It would be interesting to see what the actual timing differences are between x87 and SSE. I'm getting 168 on the vector3 normalization and 393 the matrix multiplication. I have a Pentium D 940 (Dual Core 3.2 GHz). I'm not sure the program is working correctly though, as I'm getting all zeros on the 2nd and 3rd matrix multiply. If it was a console program it would be much easier to copy and paste the results.

gabor

Hi Greg!

Thanks for your excellent suggestions! :U
- I'll create an SSE version for comparison,
- I'll convert the app to console app.
The 2nd matrix in the multiplication was all zero :red that is why zero matrices are shown...

My AMD Athlon XP 2500+ produces 82 and 209 cycles. Interesting, isn' it? According to my current investigations the big cycle difference comes from when the FPU is accessing the ram (writing into it via FSTP). I'm positive that this is not the case when SSE is involved. We'll see  :8)

I'll post the modified version soon...

Greets, Gábor

Draakie

Does this code make me look bloated ? (wink)

Rockoon

168 cycles seems a little extreme for vec3 normalization 

Correct me if I am wrong, but I suspect the use of 3 division ops when a single division op will do.

Specifically, a normalization can be performed like this:

length = sqrt(x*x + y*y + z*z);
x /= length;
y /= length;
z /= length;

However, it can ALSO be performed like this:

reciplength = 1 / sqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;

This trades 2 expensive divisions for 3 cheap multiplications


Moving right along, the sqrt function is a big bottleneck and SSE has a fast reciprocal square root op (its just an approximation but is more than good enough for most purposes)

reciplength = rsqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;

This final version, even without SIMD (using only the scaler SSE instructions), will blaze in comparison to the original.


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

GregL

QuoteThe 2nd matrix in the multiplication was all zero, that is why zero matrices are shown...
OK, I see that now.  :red

I have read AMD is faster at x87 too, it sure does look that way. It will be interesting to how SSE compares.

daydreamer

read about howto perform quaternions with help of SSE, which solves problems when rotating with usual rotation can cause an error when one value is zero, especially important if you want a 3d flyaround World work
if you want speed go for unroll a proc to perform many 1/sqrt(x*x+y*y+z*z) , the better cpu you have it can perform more mulps in parallel than just one

GregL

I don't claim to be good at this SSE stuff but this is what I came up with for the matrix multiply. I made matrix3 the destination so I could see something happening.


.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE
.XMM

INCLUDE kernel32.inc
INCLUDELIB kernel32.lib

FLOAT TYPEDEF REAL4

D3DMATRIX STRUCT
_11 FLOAT ?
_12 FLOAT ?
_13 FLOAT ?
_14 FLOAT ?
_21 FLOAT ?
_22 FLOAT ?
_23 FLOAT ?
_24 FLOAT ?
_31 FLOAT ?
_32 FLOAT ?
_33 FLOAT ?
_34 FLOAT ?
_41 FLOAT ?
_42 FLOAT ?
_43 FLOAT ?
_44 FLOAT ?
D3DMATRIX ENDS

.DATA

    ALIGN 16
    matrix1 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
    matrix2 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
    matrix3 D3DMATRIX <>

.CODE

start:

    call main
    INVOKE ExitProcess, eax

main PROC

    movaps xmm0, matrix1._11
    movaps xmm1, matrix1._21
    movaps xmm2, matrix1._31
    movaps xmm3, matrix1._41
   
    movaps xmm4, matrix2._11
    movaps xmm5, matrix2._21
    movaps xmm6, matrix2._31
    movaps xmm7, matrix2._41
   
    mulps xmm0, xmm4
    mulps xmm1, xmm5
    mulps xmm2, xmm6
    mulps xmm3, xmm7
   
    movaps matrix3._11, xmm0
    movaps matrix3._21, xmm1
    movaps matrix3._31, xmm2
    movaps matrix3._41, xmm3
   
    ret
main ENDP

END start


NightWare

hi,

depending of the use, but mulps XMMx,Mem is generally advantageous, because there is not enough XMMx register (coz here i suppose you're not going to stop your work with this matrix...)

GregL

NightWare,

Yeah, that is better.


    movaps xmm0, matrix1._11
    movaps xmm1, matrix1._21
    movaps xmm2, matrix1._31
    movaps xmm3, matrix1._41
   
    mulps xmm0, matrix2._11
    mulps xmm1, matrix2._21
    mulps xmm2, matrix2._31
    mulps xmm3, matrix2._41
   
    movaps matrix3._11, xmm0
    movaps matrix3._21, xmm1
    movaps matrix3._31, xmm2
    movaps matrix3._41, xmm3


gabor

Hi folks!

Nice posts, thanks a lot!

Rockoon!
Thanks for your valuable notice! Adding a fdiv to calculate the reciproke and then using fmul instead of fdiv 3 times did really a speed up.

Greg!
I'm afraid your suggestion for the matrix mul is not correct, because according to your code the matrices (for 2D) would look like


matrix1=a b   matrix2=x y   matrix3=ax by
        d e           u v           du ev

and what we need is

matrix3=ax+bu ay+bv
        dx+eu dy+ev

I'm working on it too, I'll post it soon (I hope) with the modifications I promised.

Greets, Gábor

GregL

#11
Gábor,

As you can tell, I have done very little with Direct3D. I thought it was just a regular matrix multiply. I need to do some reading on the subject.

  ...

I see how it goes now.



GregL

#12
Gábor,

Looks like you got it. Good job.

Here's the results I get (Pentium D 940). How does it test on your AMD Athlon?

Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 93

Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 29


Testing matrix multiplication with FPU
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 279

Testing matrix multiplication with SSE
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 180


gabor

Hi folks!

Sorry for the late answer. I made a few steps in the topic. I'll post it later because I don't want to present it in a immature state.

My results are in the office on a 3.0GHz P4:
Testing vector normalization with FPU
cycle count 140

Testing vector normalization with SSE
cycle count 56

Testing matrix multiplication with FPU
cycle count 452

Testing matrix multiplication with SSE
cycle count 187


I'm coding some quaternion operations right now...

Greets, Gábor

Rockoon

56 cycles is still way too much for SSE vector normalization.. something is wrong.

Using only SSE scaler ops:

3 loads
3 multiplications
2 additions
1 reciprocal sqrt
3 more multiplications
3 stores
1 function return
----
16 instructions total


The worst latency here should be the rsqrtss at 4 clock latency on the P4, the rest should be between 1 and 3. You seem to be averaging 3.5 clock cycles per instruction, an instruction throughput of only 0.286 per cycle.

Something is wrong with this picture. The P4 sucks, but it doesnt suck that badly.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.