Just for fun...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
276 cycles for mul res=10000
146 cycles for imul res=10000
615 cycles for fimul res=10000
314 cycles for fmul res=10000
478 cycles for pmuludq+movd res=10000
298 cycles for pmuludq+mem4 res=10000
298 cycles for movss+mulps res=10000
276 cycles for mul res=10000
146 cycles for imul res=10000
615 cycles for fimul res=10000
314 cycles for fmul res=10000
474 cycles for pmuludq+movd res=10000
298 cycles for pmuludq+mem4 res=10000
298 cycles for movss+mulps res=10000
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 100
mov ecx, 100
mul ecx
ENDM
xchg eax, esi
counter_end
Print Str$("%i cycles for mul", eax), Str$("\tres=%i\n", esi)
... etc
mov eax, 100
imul eax, eax, 100
...
fild v1
fimul v2
fistp v3
...
fld v1r
fmul v2r
fstp v3r
...
mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0
...
movd xmm1, v2
pmuludq xmm1, v1 ; must be 16-byte aligned!!!
...
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
214 cycles for mul res=10000
99 cycles for imul res=10000
569 cycles for fimul res=10000
276 cycles for fmul res=10000
603 cycles for pmuludq+movd res=10000
220 cycles for pmuludq+mem4 res=10000
220 cycles for movss+mulps res=10000
212 cycles for mul res=10000
102 cycles for imul res=10000
567 cycles for fimul res=10000
275 cycles for fmul res=10000
599 cycles for pmuludq+movd res=10000
219 cycles for pmuludq+mem4 res=10000
223 cycles for movss+mulps res=10000
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
179 cycles for mul res=10000
94 cycles for imul res=10000
308 cycles for fimul res=10000
297 cycles for fmul res=10000
175 cycles for pmuludq+movd res=10000
195 cycles for pmuludq+mem4 res=10000
195 cycles for movss+mulps res=10000
179 cycles for mul res=10000
94 cycles for imul res=10000
308 cycles for fimul res=10000
297 cycles for fmul res=10000
175 cycles for pmuludq+movd res=10000
195 cycles for pmuludq+mem4 res=10000
195 cycles for movss+mulps res=10000
--- ok ---
So imul rocks, and fimul sucks. Interesting that this cumbersome combination...
mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0
... is so fast on the Quad.
might save a cycle or two :P
mov eax, 100
mov ecx, 100
movd xmm0, eax
movd xmm1, ecx
pmuludq xmm1, xmm0
Something odd...res is result?
AMD Phenom(tm) II X6 1100T Processor (SSE3)
187 cycles for mul res=10000
92 cycles for imul res=10000
280 cycles for fimul res=10000
233 cycles for fmul res=10000
565 cycles for pmuludq+movd res=0
94 cycles for pmuludq+mem4 res=0
94 cycles for movss+mulps res=0
186 cycles for mul res=10000
92 cycles for imul res=10000
279 cycles for fimul res=10000
232 cycles for fmul res=10000
564 cycles for pmuludq+movd res=0
94 cycles for pmuludq+mem4 res=0
93 cycles for movss+mulps res=0
Quote from: sinsi on December 27, 2011, 09:09:34 AM
Something odd...res is result?
AMD Phenom(tm) II X6 1100T Processor (SSE3)
94 cycles for movss+mulps res=0
Odd indeed. Could you please test the attachment? Expected value is 3* 100.00
Thanks, Jochen
include \masm32\MasmBasic\MasmBasic.inc
.data
v1 dd 100
v2 dq 100
v3 REAL8 100.0
Init
movd xmm1, v1
PrintLine Str$("xmm1=%f", xmm1)
movlps xmm2, v2
PrintLine Str$("xmm2=%f", xmm2)
movlps xmm3, v3
PrintLine Str$("xmm3=%f", f:xmm3)
Inkey "OK?"
Exit
end start
xmm1=100.0000
xmm2=100.0000
xmm3=100.0000
OK?
Quote from: sinsi on December 27, 2011, 11:53:46 AM
OK?
OK! For a moment, I feared that Str$(xmm1) was not working properly, but apparently it's a problem with the instructions themselves on the AMD. Olly might know - I can't resolve the mystery because I have no AMD around...
AMD C-50 Processor (SSE4)
261 cycles for mul res=10000
133 cycles for imul res=10000
498 cycles for fimul res=10000
297 cycles for fmul res=10000
707 cycles for pmuludq+movd res=0
309 cycles for pmuludq+mem4 res=0
301 cycles for movss+mulps res=0
254 cycles for mul res=10000
130 cycles for imul res=10000
496 cycles for fimul res=10000
295 cycles for fmul res=10000
707 cycles for pmuludq+movd res=0
303 cycles for pmuludq+mem4 res=0
305 cycles for movss+mulps res=0
Mysterious. Sinsi, Clive, could you launch a test with Olly? I attach a version with int 3 before the pmuludqs start:
CPU Disasm
Address Hex dump Command Comments
00403BE0 CC int3
00403BE1 660F6E0D 04C04000 movd xmm1, [40C004]
00403BE9 660FF40D 00C04000 pmuludq xmm1, [40C000]
00403BF1 660F6E0D 04C04000 movd xmm1, [40C004]
00403BF9 660FF40D 00C04000 pmuludq xmm1, [40C000]
00403C01 660F6E0D 04C04000 movd xmm1, [40C004]
00403C09 660FF40D 00C04000 pmuludq xmm1, [40C000]
There is a second int 3 (for Olly noob: you can reach the second one by replacing the first one with a nop, then hit F9):
invoke Sleep, SleepMs
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
int 3
REPEAT 100
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!
ENDM
counter_end
movss v3r, xmm1
Print Str$("%i cycles for movss+mulps", eax), Str$("\tres=%i\n", v3r)
Maybe the XMM-registers are overwritten by a API call?
QuoteIntel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
179 cycles for mul res=10000
94 cycles for imul res=10000
309 cycles for fimul res=10000
299 cycles for fmul res=10000
176 cycles for pmuludq+movd res=0
195 cycles for pmuludq+mem4 res=0
196 cycles for movss+mulps res=0
179 cycles for mul res=10000
94 cycles for imul res=10000
309 cycles for fimul res=10000
299 cycles for fmul res=10000
176 cycles for pmuludq+movd res=0
196 cycles for pmuludq+mem4 res=0
196 cycles for movss+mulps res=0
qWord is right. After SetPriorityClass (Win 7 x64) was called XMM0 (100) and XMM1 (10000) are set to 0.
Quote from: qWord on December 27, 2011, 09:19:37 PM
Maybe the XMM-registers are overwritten by a API call?
Thanks, qWord and ERNST. This version should work for two of the "bad boyz" ;-)
Windbg suggests XMM1 is thus
0.000000e+000: 0.000000e+000: 0.000000e+000: 1.401298e-043
Win7-x64
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
222 cycles for shl res=10000
197 cycles for mul res=10000
87 cycles for imul res=10000
260 cycles for fimul res=10000
219 cycles for fmul res=10000
164 cycles for pmuludq+movd res=0
190 cycles for pmuludq+mem4 res=10000
239 cycles for movss+mulps res=10000
209 cycles for shl res=10000
169 cycles for mul res=10000
82 cycles for imul res=10000
260 cycles for fimul res=10000
171 cycles for fmul res=10000
160 cycles for pmuludq+movd res=0
202 cycles for pmuludq+mem4 res=10000
207 cycles for movss+mulps res=10000
--- ok ---
Yep, that's it. In the meantime I found this old thread (http://www.masm32.com/board/index.php?topic=13765.msg108247#msg108247) by googling for xmm abi - it's actually the top hit :bg
There is also a post by sinsi pointing to the x64 register usage page (http://msdn.microsoft.com/en-us/library/9z1stfyw%28v=VS.100%29.aspx).
Mystery solved, thanks to all :U
P.S.: Timings for P4:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
318 cycles for shl res=10000
210 cycles for mul res=10000
95 cycles for imul res=10000 <<<<<<<<<<<< !!
569 cycles for fimul res=10000
273 cycles for fmul res=10000
599 cycles for pmuludq+movd res=10000
274 cycles for pmuludq+mem4 res=10000
274 cycles for movss+mulps res=10000