News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Matrix and vector operations using FPU

Started by gabor, September 18, 2007, 10:54:32 PM

Previous topic - Next topic

MichaelW

This is on my P3. I tried it multiple times and the matrix multiplication cycle counts varied by only about one cycle. Why would the SSE matrix multiplication be slower?

Testing vector3 normalization with FPU
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 114

Testing vector3 normalization with SSE
Input 2.00, 4.00, 8.00
Normalized 0.22, 0.44, 0.87
cycle count 30


Testing matrix multiplication with FPU
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 193

Testing matrix multiplication with SSE
1.00    2.00    3.00    4.00
2.00    4.00    6.00    8.00
3.00    6.00    9.00    12.00
4.00    8.00    12.00   16.00

0.10    0.20    0.30    0.40
0.20    0.40    0.60    0.80
0.30    0.60    0.90    1.20
0.40    0.80    1.20    1.60

3.00    6.00    9.00    12.00
6.00    12.00   18.00   24.00
9.00    18.00   27.00   36.00
12.00   24.00   36.00   48.00

cycle count 242
eschew obfuscation

Jimg

#16
Similar with Athlon XP 3000

Testing vector3 normalization with FPU
cycle count 48

Testing vector3 normalization with SSE
cycle count 22

Testing matrix multiplication with FPU
cycle count 168

Testing matrix multiplication with SSE
cycle count 239

gabor

Hi!

This are the results on my AMD Athlon XP 2500+.
Testing vector3 normalization with FPU
cycle count 52

Testing vector3 normalization with SSE
cycle count 22

Testing matrix multiplication with FPU
cycle count 172

Testing matrix multiplication with SSE
cycle count 252


About the cycle counts I have 2 things to share.
1. I experienced that the P4 (and maybe other Pentiums too) has trouble with accessing memory multiple times via the FPU. Reading or writing 1 single memory var is normal, accessing another var adds many cycles. Don't know why... Any ideas?
2. My P4 measures were made on a Dual Core, and I've heard (from Ultrano) that this sort of CPU is kinda buggy with timing functions. This could mean that my results are not quite precise.

I must admit, that I haven't played much around with the SSE codes, there could be optimization possibilities.
Finally, I attached the codes to the kick-off post, please have a look at the methods if you believe this results might be cut down. A speed improvement is a must at such basic functions.

Thanks and greets, Gábor



daydreamer

I tried to rewrite it to perform 4 operations in parallel, but it didnt compiled due to lack of some timing include file I dont have:(

gabor

Hi Daydreamer!

The timer macro is that from MichaelW. You'll find it here http://www.masm32.com/board/index.php?topic=770.0.
I'm keen on seeing your version. Please share your results with us!

Greets, Gábor

Rockoon

Idealy this function requires only the relevant operations:

3 scaler multiplications, follow by 2 scaler additions, follow by a reciprocal square root, followed by 3 scaler multiplications

In the current SSE version of the function, there are plenty of CPU cycles where NONE of the relevant operations are happening, and that in particular the swizzling seems to be "extra" work

Note that the SSE-Parallel rsqrt contains 2 extra rsqrt operations that arent really needed (the 3 results are equal)

In a CPU that can always handle precisely 2 independent operations at a time, the "ideal" scaler-style processing would happen something like this:

0: load, load
1: load, copy
2: copy, copy
3: mul, mul
4: mul, add
5: add, stall
6: rsqrt, stall
7: mul, mul
8: mul, store
9: store, store

And in a CPU that can always handle precisely 3 independent operations at a time, the "ideal" SSE-Scaler processing would happen something like this:

0: load, load, load
1: copy, copy, copy
2: mul, mul, mul
3: add, stall, stall
4: add, stall, stall
5: rsqrt, stall, stall
6: mul, mul, mul
7: store, store, store

I realize that these are simplified models of a CPU, but it illustrates the point:

There are only 3 cycles where an execution unit is stalled out in the 3 execution unit model, and only 2 cycles with stalls in the 2-unit model.

The code in the current SSE-Parallel version has 2 extra swizzle cycles that cannot be paired up, 1 SSE-Parallel rsqrt cycle which duplicates effort, and all of the SSE-Parallel operations are 4-component operations when only 3 is actualy needed. So even if there are absolutely no stalls elsewhere, it can at best only equal the ideal SSE-Scaler methodology within the model.

I propose that this:

movss xmm0, [eax]
movss xmm1, [eax + 4]
movss xmm2, [eax + 8]
movss xmm3, xmm0
movss xmm4, xmm1
movss xmm5, xmm2
mulss xmm0, xmm0
mulss xmm1, xmm1
mulss xmm2, xmm2
addss xmm0, xmm1
addss xmm0, xmm2
rsqrtss xmm1, xmm0
mulss xmm3, xmm1
mulss xmm4, xmm1
mulss xmm5, xmm1
movss [edx], xmm3
movss [edx + 4], xmm4
movss [edx + 8], xmm5

..is likely to be no worse than the SSE-Parallel version on AMD64 (which is very similar to the 3-unit model) and that it might infact be substantialy superior. Additionally, some ordering tweaks could be tested.


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.