Print Page - Matrix and vector operations using FPU

Title: Matrix and vector operations using FPU
Post by: gabor on September 18, 2007, 10:54:32 PM

Hello coder friends!

This thread is intended to present solutions for several d3dx methods. So far I have matrix multiplication and vector normalization.
Please test it and share your results. I'd be glad to see some optimization advices, I guess other members would be happy too. (Maybe I'm too optimistic).

Now the short story behind:
After I had made a few steps in the strange and unfamiliar domain of D3D I found myself surrounded by matrix and vector operations using floats.
I got the obvious idea to use the FPU for coding such operations, because I don't really know much about SSE and because I heard some opinions about the very good performance of the FPU on AMD procs (and I own an AMD Athlon). Btw, I too experienced amazing differences between P4 and Athlon FPU performances (about 1:10 ratio!!). So I'd like to use the FPU with float (32 bit) precision. As stubborn as I am I surely will get to an SSE implementation too.

Edit:
New zip contains the promised modifications. Enjoy! The matrix multiplication is not complete, it multiplies a matrix with a transposed marix. This means, it cannot be used to calculate the square of a matrix.

Greets, Gábor

[attachment deleted by admin]

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 19, 2007, 05:31:30 AM

Hi Gábor,

Since you are using arrays of REAL4 variables, I think this is a case where SSE would be the best solution. For example MULPS. It would be interesting to see what the actual timing differences are between x87 and SSE. I'm getting 168 on the vector3 normalization and 393 the matrix multiplication. I have a Pentium D 940 (Dual Core 3.2 GHz). I'm not sure the program is working correctly though, as I'm getting all zeros on the 2nd and 3rd matrix multiply. If it was a console program it would be much easier to copy and paste the results.

Title: Re: Matrix and vector operations using FPU
Post by: gabor on September 19, 2007, 06:22:37 AM

Hi Greg!

Thanks for your excellent suggestions! :U
- I'll create an SSE version for comparison,
- I'll convert the app to console app.
The 2nd matrix in the multiplication was all zero :red that is why zero matrices are shown...

My AMD Athlon XP 2500+ produces 82 and 209 cycles. Interesting, isn' it? According to my current investigations the big cycle difference comes from when the FPU is accessing the ram (writing into it via FSTP). I'm positive that this is not the case when SSE is involved. We'll see :8)

I'll post the modified version soon...

Greets, Gábor

Title: Re: Matrix and vector operations using FPU
Post by: Draakie on September 19, 2007, 06:55:52 AM

[NUDGE]

http://www.intel.com/design/pentiumiii/devtools/AMaths.zip

Title: Re: Matrix and vector operations using FPU
Post by: Rockoon on September 19, 2007, 08:38:15 AM

168 cycles seems a little extreme for vec3 normalization

Correct me if I am wrong, but I suspect the use of 3 division ops when a single division op will do.

Specifically, a normalization can be performed like this:

length = sqrt(x*x + y*y + z*z);
x /= length;
y /= length;
z /= length;

However, it can ALSO be performed like this:

reciplength = 1 / sqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;

This trades 2 expensive divisions for 3 cheap multiplications

Moving right along, the sqrt function is a big bottleneck and SSE has a fast reciprocal square root op (its just an approximation but is more than good enough for most purposes)

reciplength = rsqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;

This final version, even without SIMD (using only the scaler SSE instructions), will blaze in comparison to the original.

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 19, 2007, 06:01:51 PM

QuoteThe 2nd matrix in the multiplication was all zero, that is why zero matrices are shown...

OK, I see that now. :red

I have read AMD is faster at x87 too, it sure does look that way. It will be interesting to how SSE compares.

Title: Re: Matrix and vector operations using FPU
Post by: daydreamer on September 19, 2007, 07:27:00 PM

read about howto perform quaternions with help of SSE, which solves problems when rotating with usual rotation can cause an error when one value is zero, especially important if you want a 3d flyaround World work
if you want speed go for unroll a proc to perform many 1/sqrt(x*x+y*y+z*z) , the better cpu you have it can perform more mulps in parallel than just one

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 19, 2007, 09:22:54 PM

I don't claim to be good at this SSE stuff but this is what I came up with for the matrix multiply. I made matrix3 the destination so I could see something happening.

Code Select


.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE
.XMM

INCLUDE kernel32.inc
INCLUDELIB kernel32.lib

FLOAT TYPEDEF REAL4

D3DMATRIX STRUCT
	_11 FLOAT ?
	_12 FLOAT ?
	_13 FLOAT ?
	_14 FLOAT ?
	_21 FLOAT ?
	_22 FLOAT ?
	_23 FLOAT ?
	_24 FLOAT ?
	_31 FLOAT ?
	_32 FLOAT ?
	_33 FLOAT ?
	_34 FLOAT ?
	_41 FLOAT ?
	_42 FLOAT ?
	_43 FLOAT ?
	_44 FLOAT ?
D3DMATRIX ENDS

.DATA

    ALIGN 16
    matrix1 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
    matrix2 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
    matrix3 D3DMATRIX <>

.CODE

start:

    call main
    INVOKE ExitProcess, eax

main PROC

    movaps xmm0, matrix1._11
    movaps xmm1, matrix1._21
    movaps xmm2, matrix1._31
    movaps xmm3, matrix1._41
    
    movaps xmm4, matrix2._11
    movaps xmm5, matrix2._21
    movaps xmm6, matrix2._31
    movaps xmm7, matrix2._41
    
    mulps xmm0, xmm4
    mulps xmm1, xmm5
    mulps xmm2, xmm6
    mulps xmm3, xmm7
    
    movaps matrix3._11, xmm0
    movaps matrix3._21, xmm1
    movaps matrix3._31, xmm2
    movaps matrix3._41, xmm3
    
    ret
main ENDP

END start

Title: Re: Matrix and vector operations using FPU
Post by: NightWare on September 19, 2007, 09:47:28 PM

hi,

depending of the use, but mulps XMMx,Mem is generally advantageous, because there is not enough XMMx register (coz here i suppose you're not going to stop your work with this matrix...)

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 19, 2007, 10:00:18 PM

NightWare,

Yeah, that is better.

Code Select


    movaps xmm0, matrix1._11
    movaps xmm1, matrix1._21
    movaps xmm2, matrix1._31
    movaps xmm3, matrix1._41
    
    mulps xmm0, matrix2._11
    mulps xmm1, matrix2._21
    mulps xmm2, matrix2._31
    mulps xmm3, matrix2._41
    
    movaps matrix3._11, xmm0
    movaps matrix3._21, xmm1
    movaps matrix3._31, xmm2
    movaps matrix3._41, xmm3

Title: Re: Matrix and vector operations using FPU
Post by: gabor on September 20, 2007, 12:43:08 PM

Hi folks!

Nice posts, thanks a lot!

Rockoon!
Thanks for your valuable notice! Adding a fdiv to calculate the reciproke and then using fmul instead of fdiv 3 times did really a speed up.

Greg!
I'm afraid your suggestion for the matrix mul is not correct, because according to your code the matrices (for 2D) would look like

Code Select


matrix1=a b   matrix2=x y   matrix3=ax by
        d e           u v           du ev

and what we need is

Code Select


matrix3=ax+bu ay+bv
        dx+eu dy+ev

I'm working on it too, I'll post it soon (I hope) with the modifications I promised.

Greets, Gábor

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 20, 2007, 05:05:15 PM

Gábor,

As you can tell, I have done very little with Direct3D. I thought it was just a regular matrix multiply. I need to do some reading on the subject.

...

I see how it goes now.

Title: Re: Matrix and vector operations using FPU
Post by: GregL on September 20, 2007, 08:49:50 PM

Gábor,

Looks like you got it. Good job.

Here's the results I get (Pentium D 940). How does it test on your AMD Athlon?

Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 93

Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 29

Testing matrix multiplication with FPU
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00

0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60

3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00

cycle count 279

Testing matrix multiplication with SSE
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00

0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60

3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00

cycle count 180

Title: Re: Matrix and vector operations using FPU
Post by: gabor on September 25, 2007, 12:43:23 PM

Hi folks!

Sorry for the late answer. I made a few steps in the topic. I'll post it later because I don't want to present it in a immature state.

My results are in the office on a 3.0GHz P4:
Testing vector normalization with FPU
cycle count 140

Testing vector normalization with SSE
cycle count 56

Testing matrix multiplication with FPU
cycle count 452

Testing matrix multiplication with SSE
cycle count 187

I'm coding some quaternion operations right now...

Greets, Gábor

Title: Re: Matrix and vector operations using FPU
Post by: Rockoon on September 25, 2007, 01:29:26 PM

56 cycles is still way too much for SSE vector normalization.. something is wrong.

Using only SSE scaler ops:

3 loads
3 multiplications
2 additions
1 reciprocal sqrt
3 more multiplications
3 stores
1 function return
----
16 instructions total

The worst latency here should be the rsqrtss at 4 clock latency on the P4, the rest should be between 1 and 3. You seem to be averaging 3.5 clock cycles per instruction, an instruction throughput of only 0.286 per cycle.

Something is wrong with this picture. The P4 sucks, but it doesnt suck that badly.

Title: Re: Matrix and vector operations using FPU
Post by: MichaelW on September 25, 2007, 01:52:44 PM

This is on my P3. I tried it multiple times and the matrix multiplication cycle counts varied by only about one cycle. Why would the SSE matrix multiplication be slower?

Code Select


Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 114

Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 30


Testing matrix multiplication with FPU
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 193

Testing matrix multiplication with SSE
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 242

Title: Re: Matrix and vector operations using FPU
Post by: Jimg on September 25, 2007, 02:19:27 PM

Similar with Athlon XP 3000

Code Select

Testing vector3 normalization with FPU
cycle count 48

Testing vector3 normalization with SSE
cycle count 22

Testing matrix multiplication with FPU
cycle count 168

Testing matrix multiplication with SSE
cycle count 239

Title: Re: Matrix and vector operations using FPU
Post by: gabor on September 27, 2007, 07:37:24 AM

Hi!

This are the results on my AMD Athlon XP 2500+.

Code Select

Testing vector3 normalization with FPU
cycle count 52

Testing vector3 normalization with SSE
cycle count 22

Testing matrix multiplication with FPU
cycle count 172

Testing matrix multiplication with SSE
cycle count 252

About the cycle counts I have 2 things to share.
1. I experienced that the P4 (and maybe other Pentiums too) has trouble with accessing memory multiple times via the FPU. Reading or writing 1 single memory var is normal, accessing another var adds many cycles. Don't know why... Any ideas?
2. My P4 measures were made on a Dual Core, and I've heard (from Ultrano) that this sort of CPU is kinda buggy with timing functions. This could mean that my results are not quite precise.

I must admit, that I haven't played much around with the SSE codes, there could be optimization possibilities.
Finally, I attached the codes to the kick-off post, please have a look at the methods if you believe this results might be cut down. A speed improvement is a must at such basic functions.

Thanks and greets, Gábor

Title: Re: Matrix and vector operations using FPU
Post by: daydreamer on September 29, 2007, 08:41:18 AM

I tried to rewrite it to perform 4 operations in parallel, but it didnt compiled due to lack of some timing include file I dont have:(

Title: Re: Matrix and vector operations using FPU
Post by: gabor on September 30, 2007, 10:25:39 AM

Hi Daydreamer!

The timer macro is that from MichaelW. You'll find it here http://www.masm32.com/board/index.php?topic=770.0.
I'm keen on seeing your version. Please share your results with us!

Greets, Gábor

Title: Re: Matrix and vector operations using FPU
Post by: Rockoon on October 01, 2007, 01:46:36 PM

Idealy this function requires only the relevant operations:

3 scaler multiplications, follow by 2 scaler additions, follow by a reciprocal square root, followed by 3 scaler multiplications

In the current SSE version of the function, there are plenty of CPU cycles where NONE of the relevant operations are happening, and that in particular the swizzling seems to be "extra" work

Note that the SSE-Parallel rsqrt contains 2 extra rsqrt operations that arent really needed (the 3 results are equal)

In a CPU that can always handle precisely 2 independent operations at a time, the "ideal" scaler-style processing would happen something like this:

0: load, load
1: load, copy
2: copy, copy
3: mul, mul
4: mul, add
5: add, stall
6: rsqrt, stall
7: mul, mul
8: mul, store
9: store, store

And in a CPU that can always handle precisely 3 independent operations at a time, the "ideal" SSE-Scaler processing would happen something like this:

0: load, load, load
1: copy, copy, copy
2: mul, mul, mul
3: add, stall, stall
4: add, stall, stall
5: rsqrt, stall, stall
6: mul, mul, mul
7: store, store, store

I realize that these are simplified models of a CPU, but it illustrates the point:

There are only 3 cycles where an execution unit is stalled out in the 3 execution unit model, and only 2 cycles with stalls in the 2-unit model.

The code in the current SSE-Parallel version has 2 extra swizzle cycles that cannot be paired up, 1 SSE-Parallel rsqrt cycle which duplicates effort, and all of the SSE-Parallel operations are 4-component operations when only 3 is actualy needed. So even if there are absolutely no stalls elsewhere, it can at best only equal the ideal SSE-Scaler methodology within the model.

I propose that this:

movss xmm0, [eax]
movss xmm1, [eax + 4]
movss xmm2, [eax + 8]
movss xmm3, xmm0
movss xmm4, xmm1
movss xmm5, xmm2
mulss xmm0, xmm0
mulss xmm1, xmm1
mulss xmm2, xmm2
addss xmm0, xmm1
addss xmm0, xmm2
rsqrtss xmm1, xmm0
mulss xmm3, xmm1
mulss xmm4, xmm1
mulss xmm5, xmm1
movss [edx], xmm3
movss [edx + 4], xmm4
movss [edx + 8], xmm5

..is likely to be no worse than the SSE-Parallel version on AMD64 (which is very similar to the 3-unit model) and that it might infact be substantialy superior. Additionally, some ordering tweaks could be tested.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: gabor on September 18, 2007, 10:54:32 PM