The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on December 26, 2011, 08:55:27 PM

Title: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 26, 2011, 08:55:27 PM
Just for fun...
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
276 cycles for mul      res=10000
146 cycles for imul     res=10000
615 cycles for fimul    res=10000
314 cycles for fmul     res=10000
478 cycles for pmuludq+movd     res=10000
298 cycles for pmuludq+mem4     res=10000
298 cycles for movss+mulps      res=10000

276 cycles for mul      res=10000
146 cycles for imul     res=10000
615 cycles for fimul    res=10000
314 cycles for fmul     res=10000
474 cycles for pmuludq+movd     res=10000
298 cycles for pmuludq+mem4     res=10000
298 cycles for movss+mulps      res=10000


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 100
mov ecx, 100
mul ecx
ENDM
xchg eax, esi
counter_end
Print Str$("%i cycles for mul", eax), Str$("\tres=%i\n", esi)

... etc
mov eax, 100
imul eax, eax, 100
...
fild v1
fimul v2
fistp v3
...
fld v1r
fmul v2r
fstp v3r
...
mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0
...
movd xmm1, v2
pmuludq xmm1, v1 ; must be 16-byte aligned!!!
...
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: dedndave on December 26, 2011, 10:56:33 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
214 cycles for mul      res=10000
99 cycles for imul      res=10000
569 cycles for fimul    res=10000
276 cycles for fmul     res=10000
603 cycles for pmuludq+movd     res=10000
220 cycles for pmuludq+mem4     res=10000
220 cycles for movss+mulps      res=10000

212 cycles for mul      res=10000
102 cycles for imul     res=10000
567 cycles for fimul    res=10000
275 cycles for fmul     res=10000
599 cycles for pmuludq+movd     res=10000
219 cycles for pmuludq+mem4     res=10000
223 cycles for movss+mulps      res=10000
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: hutch-- on December 27, 2011, 07:06:19 AM


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
179 cycles for mul      res=10000
94 cycles for imul      res=10000
308 cycles for fimul    res=10000
297 cycles for fmul     res=10000
175 cycles for pmuludq+movd     res=10000
195 cycles for pmuludq+mem4     res=10000
195 cycles for movss+mulps      res=10000

179 cycles for mul      res=10000
94 cycles for imul      res=10000
308 cycles for fimul    res=10000
297 cycles for fmul     res=10000
175 cycles for pmuludq+movd     res=10000
195 cycles for pmuludq+mem4     res=10000
195 cycles for movss+mulps      res=10000


--- ok ---
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 27, 2011, 07:30:02 AM
So imul rocks, and fimul sucks. Interesting that this cumbersome combination...

mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0


... is so fast on the Quad.
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: dedndave on December 27, 2011, 08:35:25 AM
might save a cycle or two   :P
mov eax, 100
mov ecx, 100
movd xmm0, eax
movd xmm1, ecx
pmuludq xmm1, xmm0
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: sinsi on December 27, 2011, 09:09:34 AM
Something odd...res is result?

AMD Phenom(tm) II X6 1100T Processor (SSE3)
187 cycles for mul      res=10000
92 cycles for imul      res=10000
280 cycles for fimul    res=10000
233 cycles for fmul     res=10000
565 cycles for pmuludq+movd     res=0
94 cycles for pmuludq+mem4      res=0
94 cycles for movss+mulps       res=0

186 cycles for mul      res=10000
92 cycles for imul      res=10000
279 cycles for fimul    res=10000
232 cycles for fmul     res=10000
564 cycles for pmuludq+movd     res=0
94 cycles for pmuludq+mem4      res=0
93 cycles for movss+mulps       res=0
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 27, 2011, 11:36:50 AM
Quote from: sinsi on December 27, 2011, 09:09:34 AM
Something odd...res is result?

AMD Phenom(tm) II X6 1100T Processor (SSE3)
94 cycles for movss+mulps       res=0


Odd indeed. Could you please test the attachment? Expected value is 3* 100.00
Thanks, Jochen

include \masm32\MasmBasic\MasmBasic.inc
.data
v1   dd 100
v2   dq 100
v3   REAL8 100.0
   Init
   movd xmm1, v1
   PrintLine Str$("xmm1=%f", xmm1)
   movlps xmm2, v2
   PrintLine Str$("xmm2=%f", xmm2)
   movlps xmm3, v3
   PrintLine Str$("xmm3=%f", f:xmm3)
   Inkey "OK?"
   Exit
end start
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: sinsi on December 27, 2011, 11:53:46 AM

xmm1=100.0000
xmm2=100.0000
xmm3=100.0000
OK?
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 27, 2011, 05:13:10 PM
Quote from: sinsi on December 27, 2011, 11:53:46 AM
OK?

OK! For a moment, I feared that Str$(xmm1) was not working properly, but apparently it's a problem with the instructions themselves on the AMD. Olly might know - I can't resolve the mystery because I have no AMD around...
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: clive on December 27, 2011, 08:26:22 PM
AMD C-50 Processor (SSE4)
261 cycles for mul      res=10000
133 cycles for imul     res=10000
498 cycles for fimul    res=10000
297 cycles for fmul     res=10000
707 cycles for pmuludq+movd     res=0
309 cycles for pmuludq+mem4     res=0
301 cycles for movss+mulps      res=0

254 cycles for mul      res=10000
130 cycles for imul     res=10000
496 cycles for fimul    res=10000
295 cycles for fmul     res=10000
707 cycles for pmuludq+movd     res=0
303 cycles for pmuludq+mem4     res=0
305 cycles for movss+mulps      res=0
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 27, 2011, 09:02:37 PM
Mysterious. Sinsi, Clive, could you launch a test with Olly? I attach a version with int 3 before the pmuludqs start:

CPU Disasm
Address         Hex dump                Command                             Comments
00403BE0          CC                    int3
00403BE1          660F6E0D 04C04000     movd xmm1, [40C004]
00403BE9          660FF40D 00C04000     pmuludq xmm1, [40C000]
00403BF1          660F6E0D 04C04000     movd xmm1, [40C004]
00403BF9          660FF40D 00C04000     pmuludq xmm1, [40C000]
00403C01          660F6E0D 04C04000     movd xmm1, [40C004]
00403C09          660FF40D 00C04000     pmuludq xmm1, [40C000]


There is a second int 3 (for Olly noob: you can reach the second one by replacing the first one with a nop, then hit F9):
invoke Sleep, SleepMs
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
int 3
REPEAT 100
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!
ENDM
counter_end
movss v3r, xmm1
Print Str$("%i cycles for movss+mulps", eax), Str$("\tres=%i\n", v3r)
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: qWord on December 27, 2011, 09:19:37 PM
Maybe the XMM-registers are overwritten by a API call?
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: ERNST on December 27, 2011, 09:35:53 PM
QuoteIntel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
179 cycles for mul      res=10000
94 cycles for imul      res=10000
309 cycles for fimul    res=10000
299 cycles for fmul     res=10000
176 cycles for pmuludq+movd     res=0
195 cycles for pmuludq+mem4     res=0
196 cycles for movss+mulps      res=0

179 cycles for mul      res=10000
94 cycles for imul      res=10000
309 cycles for fimul    res=10000
299 cycles for fmul     res=10000
176 cycles for pmuludq+movd     res=0
196 cycles for pmuludq+mem4     res=0
196 cycles for movss+mulps      res=0
qWord is right. After SetPriorityClass (Win 7 x64) was called XMM0 (100) and XMM1 (10000) are set to 0.
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on December 27, 2011, 09:38:48 PM
Quote from: qWord on December 27, 2011, 09:19:37 PM
Maybe the XMM-registers are overwritten by a API call?

Thanks, qWord and ERNST. This version should work for two of the "bad boyz" ;-)
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: clive on December 27, 2011, 09:49:38 PM
Windbg suggests XMM1 is thus
0.000000e+000: 0.000000e+000: 0.000000e+000: 1.401298e-043
Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: qWord on December 27, 2011, 09:51:36 PM
Win7-x64
Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
222 cycles for shl      res=10000
197 cycles for mul      res=10000
87 cycles for imul      res=10000
260 cycles for fimul    res=10000
219 cycles for fmul     res=10000
164 cycles for pmuludq+movd     res=0
190 cycles for pmuludq+mem4     res=10000
239 cycles for movss+mulps      res=10000

209 cycles for shl      res=10000
169 cycles for mul      res=10000
82 cycles for imul      res=10000
260 cycles for fimul    res=10000
171 cycles for fmul     res=10000
160 cycles for pmuludq+movd     res=0
202 cycles for pmuludq+mem4     res=10000
207 cycles for movss+mulps      res=10000


--- ok ---

Title: Re: Multiply timings - reg32, FPU, SSE2
Post by: jj2007 on January 02, 2012, 07:26:55 AM
Yep, that's it. In the meantime I found this old thread (http://www.masm32.com/board/index.php?topic=13765.msg108247#msg108247) by googling for xmm abi - it's actually the top hit :bg
There is also a post by sinsi pointing to the x64 register usage page (http://msdn.microsoft.com/en-us/library/9z1stfyw%28v=VS.100%29.aspx).

Mystery solved, thanks to all :U

P.S.: Timings for P4:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
318 cycles for shl      res=10000
210 cycles for mul      res=10000
95 cycles for imul      res=10000 <<<<<<<<<<<< !!
569 cycles for fimul    res=10000
273 cycles for fmul     res=10000
599 cycles for pmuludq+movd     res=10000
274 cycles for pmuludq+mem4     res=10000
274 cycles for movss+mulps      res=10000