News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Performance / Timing Wierdness ASM vs C#

Started by johnsa, June 15, 2008, 10:48:31 PM

Previous topic - Next topic

johnsa

So I updated the C++ testpiece to use fsqrt and reciprocals.. it came down from 79ms to 70ms, still around 20ms slower than the asm versions at full FPU precision. I checked the dis-asm and C++ still puts a bit of overhead into the routine along with a few less than optimal fld/fstp combinations. But mostly it's quite close.

johnsa

Re benchmarked this update FPU setup against the SSE version (using AOS model - one vector at a time). And the SSE comes in at 11ms opposed to 25ms for the fastest FPU version.
Assuming one used SOA we could imagine that we'd see an 8x increase over the fastest FPU version.

jj2007

Quote from: johnsa on June 17, 2008, 04:57:22 PM

    timer_begin 10000000, HIGH_PRIORITY_CLASS
invoke Vector3D_Normalize_FPU, ADDR v1, ADDR vr
    timer_end
    print ustr$(eax)
    print chr$(" Vector3D Normalize ms",13,10)

So.. on my machine both of these come in at around 530ms... while the C#.Net version still manages 150ms for the same number of iterations... so we're still about 4 times slower... and they're definately no stack faults now.

Quote from: johnsa on June 18, 2008, 09:54:57 AM

C++ Test-App using straight 3 divs and fsqrt (no recip). 1,000,000 iterations.
156ms debug mode
94ms   release mode with pragma optimizations switched off
78ms   release mode all optimizations

ASM Test Piece (using reciprocal with fmuls) set to PC to REAL4 - 1,000,000 iterations.
MichaelW's Normalize 13ms
Vector3D Noramlize   13ms

So now.. the ASM version is 5 times faster than the c++ version (mainly due to REAL4 PC and reciprocal).
Part of the factor 20 between "4*slower" and "5*faster" might be hidden in the counters:

timer_begin 10000000
C++ ...       1,000,000 iterations.

And the remaining factor 2 seems attributable to inverting the divs. Thanks a lot for clarifying this... I am working on a little FPU lib and got deeply worried when I saw your initial post  :wink

johnsa

Yeah I noticed the original posted piece had a a 10 million counter, not 1 million :)

At least I'm happy to say that now the asm is considerably faster than it's counterparts in C# and even C++... Even with PC set the same.

From 150ms to 25ms and then to 11ms with SSE or an effective 3ms if you maximise the throughput of SSE with 4 vectors in parallel.

The only thing I'm not too convinced about with the SSE version is that if you use homogenous vectors with a W, in it's standard AOS format (1 vector at a time) it modifies the W coordinate too.. which I don't think it should.. that should remain 1 for direction and 0 for a coordinate in space... Thinking about the best way to get the AOS version to not touch the W.

NightWare

Quote from: johnsa on June 18, 2008, 12:51:27 PM
The only thing I'm not too convinced about with the SSE version is that if you use homogenous vectors with a W, in it's standard AOS format (1 vector at a time) it modifies the W coordinate too.. which I don't think it should.. that should remain 1 for direction and 0 for a coordinate in space... Thinking about the best way to get the AOS version to not touch the W.
pffft... now it sucks...  :P
.data
Mask_210 DWORD 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0

.code
; NormalizeSse2 1
movaps XMM0,OWORD PTR [esi] ;; ) XMM0 and XMM3 = W,Z,Y,X
movaps XMM3,XMM0 ;; )
mulps XMM0,XMM0 ;; XMM0 = _,Z^2,Y^2,X^2
pshufd XMM1,XMM0,001h ;; XMM1 = _,_,_,Y^2
movhlps XMM2,XMM0 ;; XMM2 = _,_,_,Z^2
addss XMM0,XMM1 ;; XMM0 = _,_,_,X^2+Y^2
addss XMM0,XMM2 ;; XMM0 = _,_,_,X^2+Z^2+Z^2
rsqrtss XMM0,XMM0 ;; XMM0 = _,_,_,1/(X^2+Y^2+Z^2)
pshufd XMM0,XMM0,000h ;; XMM0 = 1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2)
mulps XMM0,XMM3 ;; XMM0 = W*1/(X^2+Y^2+Z^2),Z*1/(X^2+Y^2+Z^2),Y*1/(X^2+Y^2+Z^2),X*1/(X^2+Y^2+Z^2)
movhlps XMM3,XMM0 ;; XMM3 = W,_,_,Z*1/(X^2+Y^2+Z^2)
shufps XMM0,XMM3,0C4h ;; XMM0 = w,Z/(X^2+Y^2+Z^2),Y/(X^2+Y^2+Z^2),X/(X^2+Y^2+Z^2)
movaps OWORD PTR [edi],XMM0 ;; XMM0 = W,Z/(X^2+Y^2+Z^2),Y/(X^2+Y^2+Z^2),X/(X^2+Y^2+Z^2)

; NormalizeSse2 2
push [esi+12]
movdqa XMM0,OWORD PTR [esi] ;; XMM0 = W,Z,Y,X
movdqa XMM2,XMM0 ;; XMM2 = W,Z,Y,X
andps XMM0,OWORD PTR [Mask_210] ;; XMM2 = 0,Z,Y,X
mulps XMM0,XMM0 ;; XMM0 = 0,Z^2,Y^2,X^2
pshufd XMM1,XMM0,04Eh ;; XMM1 = Y^2,X^2,0^2,Z^2
addps XMM1,XMM0 ;; XMM1 = 0^2+Y^2,Z^2+X^2,Y^2+0^2,X^2+Z^2
pshufd XMM0,XMM1,0B1h ;; XMM0 = Z^2+X^2,0^2+Y^2,X^2+Z^2,Y^2+0^2
addps XMM0,XMM1 ;; XMM0 = Z^2+X^2+Y^2,Y^2+Z^2+X^2,X^2+Z^2+Y^2,Y^2+X^2+Z^2
rsqrtps XMM0,XMM0 ;; XMM0 = 1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2),1/(X^2+Y^2+Z^2)
mulps XMM0,XMM2 ;; XMM0 = W*1/(X^2+Y^2+Z^2),Z*1/(X^2+Y^2+Z^2),Y*1/(X^2+Y^2+Z^2),X*1/(X^2+Y^2+Z^2)
movdqa OWORD PTR [edi],XMM0 ;; XMM0 = W,Z/(X^2+Y^2+Z^2),Y/(X^2+Y^2+Z^2),X/(X^2+Y^2+Z^2)
pop [edi+12]



johnsa

It's not so bad :) With adjustments made to my SSE normalize for AOS It still runs at 10ms for 1,000,000 iterations on my machine, approx. 20cycles. Pentium M Centrino 1.8ghz.

One other thing that should be cater for as an option is a single iteration of Newton Raphson on the result returned by the reciprocal square root. This adds about 4ms / 5 cycles for me but will give you significantly better precision if needed.