Hello coder friends!
This thread is intended to present solutions for several d3dx methods. So far I have matrix multiplication and vector normalization.
Please test it and share your results. I'd be glad to see some optimization advices, I guess other members would be happy too. (Maybe I'm too optimistic).
Now the short story behind:
After I had made a few steps in the strange and unfamiliar domain of D3D I found myself surrounded by matrix and vector operations using floats.
I got the obvious idea to use the FPU for coding such operations, because I don't really know much about SSE and because I heard some opinions about the very good performance of the FPU on AMD procs (and I own an AMD Athlon). Btw, I too experienced amazing differences between P4 and Athlon FPU performances (about 1:10 ratio!!). So I'd like to use the FPU with float (32 bit) precision. As stubborn as I am I surely will get to an SSE implementation too.
Edit:
New zip contains the promised modifications. Enjoy! The matrix multiplication is not complete, it multiplies a matrix with a transposed marix. This means, it cannot be used to calculate the square of a matrix.
Greets, Gábor
[attachment deleted by admin]
Hi Gábor,
Since you are using arrays of REAL4 variables, I think this is a case where SSE would be the best solution. For example MULPS. It would be interesting to see what the actual timing differences are between x87 and SSE. I'm getting 168 on the vector3 normalization and 393 the matrix multiplication. I have a Pentium D 940 (Dual Core 3.2 GHz). I'm not sure the program is working correctly though, as I'm getting all zeros on the 2nd and 3rd matrix multiply. If it was a console program it would be much easier to copy and paste the results.
Hi Greg!
Thanks for your excellent suggestions! :U
- I'll create an SSE version for comparison,
- I'll convert the app to console app.
The 2nd matrix in the multiplication was all zero :red that is why zero matrices are shown...
My AMD Athlon XP 2500+ produces 82 and 209 cycles. Interesting, isn' it? According to my current investigations the big cycle difference comes from when the FPU is accessing the ram (writing into it via FSTP). I'm positive that this is not the case when SSE is involved. We'll see :8)
I'll post the modified version soon...
Greets, Gábor
[NUDGE]
http://www.intel.com/design/pentiumiii/devtools/AMaths.zip
168 cycles seems a little extreme for vec3 normalization
Correct me if I am wrong, but I suspect the use of 3 division ops when a single division op will do.
Specifically, a normalization can be performed like this:
length = sqrt(x*x + y*y + z*z);
x /= length;
y /= length;
z /= length;
However, it can ALSO be performed like this:
reciplength = 1 / sqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;
This trades 2 expensive divisions for 3 cheap multiplications
Moving right along, the sqrt function is a big bottleneck and SSE has a fast reciprocal square root op (its just an approximation but is more than good enough for most purposes)
reciplength = rsqrt(x*x + y*y + z*z);
x *= reciplength;
y *= reciplength;
z *= reciplength;
This final version, even without SIMD (using only the scaler SSE instructions), will blaze in comparison to the original.
QuoteThe 2nd matrix in the multiplication was all zero, that is why zero matrices are shown...
OK, I see that now. :red
I have read AMD is faster at x87 too, it sure does look that way. It will be interesting to how SSE compares.
read about howto perform quaternions with help of SSE, which solves problems when rotating with usual rotation can cause an error when one value is zero, especially important if you want a 3d flyaround World work
if you want speed go for unroll a proc to perform many 1/sqrt(x*x+y*y+z*z) , the better cpu you have it can perform more mulps in parallel than just one
I don't claim to be good at this SSE stuff but this is what I came up with for the matrix multiply. I made matrix3 the destination so I could see something happening.
.686
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE
.XMM
INCLUDE kernel32.inc
INCLUDELIB kernel32.lib
FLOAT TYPEDEF REAL4
D3DMATRIX STRUCT
_11 FLOAT ?
_12 FLOAT ?
_13 FLOAT ?
_14 FLOAT ?
_21 FLOAT ?
_22 FLOAT ?
_23 FLOAT ?
_24 FLOAT ?
_31 FLOAT ?
_32 FLOAT ?
_33 FLOAT ?
_34 FLOAT ?
_41 FLOAT ?
_42 FLOAT ?
_43 FLOAT ?
_44 FLOAT ?
D3DMATRIX ENDS
.DATA
ALIGN 16
matrix1 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
matrix2 D3DMATRIX <1.0,2.0,3.0,4.0, 2.0,4.0,6.0,8.0, 3.0,6.0,9.0,12.0, 4.0,8.0,12.0,16.0>
matrix3 D3DMATRIX <>
.CODE
start:
call main
INVOKE ExitProcess, eax
main PROC
movaps xmm0, matrix1._11
movaps xmm1, matrix1._21
movaps xmm2, matrix1._31
movaps xmm3, matrix1._41
movaps xmm4, matrix2._11
movaps xmm5, matrix2._21
movaps xmm6, matrix2._31
movaps xmm7, matrix2._41
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
movaps matrix3._11, xmm0
movaps matrix3._21, xmm1
movaps matrix3._31, xmm2
movaps matrix3._41, xmm3
ret
main ENDP
END start
hi,
depending of the use, but mulps XMMx,Mem is generally advantageous, because there is not enough XMMx register (coz here i suppose you're not going to stop your work with this matrix...)
NightWare,
Yeah, that is better.
movaps xmm0, matrix1._11
movaps xmm1, matrix1._21
movaps xmm2, matrix1._31
movaps xmm3, matrix1._41
mulps xmm0, matrix2._11
mulps xmm1, matrix2._21
mulps xmm2, matrix2._31
mulps xmm3, matrix2._41
movaps matrix3._11, xmm0
movaps matrix3._21, xmm1
movaps matrix3._31, xmm2
movaps matrix3._41, xmm3
Hi folks!
Nice posts, thanks a lot!
Rockoon!
Thanks for your valuable notice! Adding a fdiv to calculate the reciproke and then using fmul instead of fdiv 3 times did really a speed up.
Greg!
I'm afraid your suggestion for the matrix mul is not correct, because according to your code the matrices (for 2D) would look like
matrix1=a b matrix2=x y matrix3=ax by
d e u v du ev
and what we need is
matrix3=ax+bu ay+bv
dx+eu dy+ev
I'm working on it too, I'll post it soon (I hope) with the modifications I promised.
Greets, Gábor
Gábor,
As you can tell, I have done very little with Direct3D. I thought it was just a regular matrix multiply. I need to do some reading on the subject.
...
I see how it goes now.
Gábor,
Looks like you got it. Good job.
Here's the results I get (Pentium D 940). How does it test on your AMD Athlon?
Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 93
Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 29
Testing matrix multiplication with FPU
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00
0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60
3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00
cycle count 279
Testing matrix multiplication with SSE
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00
0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60
3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00
cycle count 180
Hi folks!
Sorry for the late answer. I made a few steps in the topic. I'll post it later because I don't want to present it in a immature state.
My results are in the office on a 3.0GHz P4:
Testing vector normalization with FPU
cycle count 140
Testing vector normalization with SSE
cycle count 56
Testing matrix multiplication with FPU
cycle count 452
Testing matrix multiplication with SSE
cycle count 187
I'm coding some quaternion operations right now...
Greets, Gábor
56 cycles is still way too much for SSE vector normalization.. something is wrong.
Using only SSE scaler ops:
3 loads
3 multiplications
2 additions
1 reciprocal sqrt
3 more multiplications
3 stores
1 function return
----
16 instructions total
The worst latency here should be the rsqrtss at 4 clock latency on the P4, the rest should be between 1 and 3. You seem to be averaging 3.5 clock cycles per instruction, an instruction throughput of only 0.286 per cycle.
Something is wrong with this picture. The P4 sucks, but it doesnt suck that badly.
This is on my P3. I tried it multiple times and the matrix multiplication cycle counts varied by only about one cycle. Why would the SSE matrix multiplication be slower?
Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 114
Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 30
Testing matrix multiplication with FPU
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00
0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60
3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00
cycle count 193
Testing matrix multiplication with SSE
1.00 2.00 3.00 4.00
2.00 4.00 6.00 8.00
3.00 6.00 9.00 12.00
4.00 8.00 12.00 16.00
0.10 0.20 0.30 0.40
0.20 0.40 0.60 0.80
0.30 0.60 0.90 1.20
0.40 0.80 1.20 1.60
3.00 6.00 9.00 12.00
6.00 12.00 18.00 24.00
9.00 18.00 27.00 36.00
12.00 24.00 36.00 48.00
cycle count 242
Similar with Athlon XP 3000
Testing vector3 normalization with FPU
cycle count 48
Testing vector3 normalization with SSE
cycle count 22
Testing matrix multiplication with FPU
cycle count 168
Testing matrix multiplication with SSE
cycle count 239
Hi!
This are the results on my AMD Athlon XP 2500+.
Testing vector3 normalization with FPU
cycle count 52
Testing vector3 normalization with SSE
cycle count 22
Testing matrix multiplication with FPU
cycle count 172
Testing matrix multiplication with SSE
cycle count 252
About the cycle counts I have 2 things to share.
1. I experienced that the P4 (and maybe other Pentiums too) has trouble with accessing memory multiple times via the FPU. Reading or writing 1 single memory var is normal, accessing another var adds many cycles. Don't know why... Any ideas?
2. My P4 measures were made on a Dual Core, and I've heard (from Ultrano) that this sort of CPU is kinda buggy with timing functions. This could mean that my results are not quite precise.
I must admit, that I haven't played much around with the SSE codes, there could be optimization possibilities.
Finally, I attached the codes to the kick-off post, please have a look at the methods if you believe this results might be cut down. A speed improvement is a must at such basic functions.
Thanks and greets, Gábor
I tried to rewrite it to perform 4 operations in parallel, but it didnt compiled due to lack of some timing include file I dont have:(
Hi Daydreamer!
The timer macro is that from MichaelW. You'll find it here http://www.masm32.com/board/index.php?topic=770.0.
I'm keen on seeing your version. Please share your results with us!
Greets, Gábor
Idealy this function requires only the relevant operations:
3 scaler multiplications, follow by 2 scaler additions, follow by a reciprocal square root, followed by 3 scaler multiplications
In the current SSE version of the function, there are plenty of CPU cycles where NONE of the relevant operations are happening, and that in particular the swizzling seems to be "extra" work
Note that the SSE-Parallel rsqrt contains 2 extra rsqrt operations that arent really needed (the 3 results are equal)
In a CPU that can always handle precisely 2 independent operations at a time, the "ideal" scaler-style processing would happen something like this:
0: load, load
1: load, copy
2: copy, copy
3: mul, mul
4: mul, add
5: add, stall
6: rsqrt, stall
7: mul, mul
8: mul, store
9: store, store
And in a CPU that can always handle precisely 3 independent operations at a time, the "ideal" SSE-Scaler processing would happen something like this:
0: load, load, load
1: copy, copy, copy
2: mul, mul, mul
3: add, stall, stall
4: add, stall, stall
5: rsqrt, stall, stall
6: mul, mul, mul
7: store, store, store
I realize that these are simplified models of a CPU, but it illustrates the point:
There are only 3 cycles where an execution unit is stalled out in the 3 execution unit model, and only 2 cycles with stalls in the 2-unit model.
The code in the current SSE-Parallel version has 2 extra swizzle cycles that cannot be paired up, 1 SSE-Parallel rsqrt cycle which duplicates effort, and all of the SSE-Parallel operations are 4-component operations when only 3 is actualy needed. So even if there are absolutely no stalls elsewhere, it can at best only equal the ideal SSE-Scaler methodology within the model.
I propose that this:
movss xmm0, [eax]
movss xmm1, [eax + 4]
movss xmm2, [eax + 8]
movss xmm3, xmm0
movss xmm4, xmm1
movss xmm5, xmm2
mulss xmm0, xmm0
mulss xmm1, xmm1
mulss xmm2, xmm2
addss xmm0, xmm1
addss xmm0, xmm2
rsqrtss xmm1, xmm0
mulss xmm3, xmm1
mulss xmm4, xmm1
mulss xmm5, xmm1
movss [edx], xmm3
movss [edx + 4], xmm4
movss [edx + 8], xmm5
..is likely to be no worse than the SSE-Parallel version on AMD64 (which is very similar to the 3-unit model) and that it might infact be substantialy superior. Additionally, some ordering tweaks could be tested.