News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dot Products

Started by johnsa, February 17, 2011, 03:24:34 PM

Previous topic - Next topic

johnsa

Here are 3 variations of dotproduct with timings on my core2duo for 100,000,000 iterations..



;475ms
movaps xmm0,[esi]
movaps xmm1,[edi]
mulps xmm0,xmm1
haddps xmm0,xmm0
haddps xmm0,xmm0
movss dp,xmm0

;395ms
movaps xmm0,[esi]
movaps xmm1,[edi]
mulps xmm0,xmm1
pshufd xmm1,xmm0,00011011b
addps xmm1,xmm0
pshufd xmm0,xmm1,00000001b
addps xmm0,xmm1
movss dp,xmm0

;374ms
fld dword ptr [esi]
fmul dword ptr [edi]
fld dword ptr [esi+4]
fmul dword ptr [edi+4]
fld dword ptr [esi+8]
fmul dword ptr [edi+8]
faddp st(2),st
faddp st(1),st
fstp dword ptr dp



I wanted to compare this against the sse4.1 DPPS but I don't have access to a 4.1 capable machine at the moment.
Not sure if anyone has any other variations to add that might be an improvement or could test these + DPPS and report the results?
to be honest i think dotproduct (unless dpps turns out faster) is best done with FPU as it's not a great candidate for vectorization. Plus having it on the FPU means you could potentially interleave that code with other SSE code..

johnsa

Ok, found a core i7 here.. results are:

187ms - SSE 4.1 DPPS
203ms - FPU
187ms - SSE2 version without haddps
203ms - SSE3 version with haddps