News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multiply timings - reg32, FPU, SSE2

Started by jj2007, December 26, 2011, 08:55:27 PM

Previous topic - Next topic

jj2007

Just for fun...
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
276 cycles for mul      res=10000
146 cycles for imul     res=10000
615 cycles for fimul    res=10000
314 cycles for fmul     res=10000
478 cycles for pmuludq+movd     res=10000
298 cycles for pmuludq+mem4     res=10000
298 cycles for movss+mulps      res=10000

276 cycles for mul      res=10000
146 cycles for imul     res=10000
615 cycles for fimul    res=10000
314 cycles for fmul     res=10000
474 cycles for pmuludq+movd     res=10000
298 cycles for pmuludq+mem4     res=10000
298 cycles for movss+mulps      res=10000


counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
mov eax, 100
mov ecx, 100
mul ecx
ENDM
xchg eax, esi
counter_end
Print Str$("%i cycles for mul", eax), Str$("\tres=%i\n", esi)

... etc
mov eax, 100
imul eax, eax, 100
...
fild v1
fimul v2
fistp v3
...
fld v1r
fmul v2r
fstp v3r
...
mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0
...
movd xmm1, v2
pmuludq xmm1, v1 ; must be 16-byte aligned!!!
...
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
214 cycles for mul      res=10000
99 cycles for imul      res=10000
569 cycles for fimul    res=10000
276 cycles for fmul     res=10000
603 cycles for pmuludq+movd     res=10000
220 cycles for pmuludq+mem4     res=10000
220 cycles for movss+mulps      res=10000

212 cycles for mul      res=10000
102 cycles for imul     res=10000
567 cycles for fimul    res=10000
275 cycles for fmul     res=10000
599 cycles for pmuludq+movd     res=10000
219 cycles for pmuludq+mem4     res=10000
223 cycles for movss+mulps      res=10000

hutch--



Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
179 cycles for mul      res=10000
94 cycles for imul      res=10000
308 cycles for fimul    res=10000
297 cycles for fmul     res=10000
175 cycles for pmuludq+movd     res=10000
195 cycles for pmuludq+mem4     res=10000
195 cycles for movss+mulps      res=10000

179 cycles for mul      res=10000
94 cycles for imul      res=10000
308 cycles for fimul    res=10000
297 cycles for fmul     res=10000
175 cycles for pmuludq+movd     res=10000
195 cycles for pmuludq+mem4     res=10000
195 cycles for movss+mulps      res=10000


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

So imul rocks, and fimul sucks. Interesting that this cumbersome combination...

mov eax, 100
movd xmm0, eax
mov ecx, 100
movd xmm1, ecx
pmuludq xmm1, xmm0


... is so fast on the Quad.

dedndave

might save a cycle or two   :P
mov eax, 100
mov ecx, 100
movd xmm0, eax
movd xmm1, ecx
pmuludq xmm1, xmm0

sinsi

Something odd...res is result?

AMD Phenom(tm) II X6 1100T Processor (SSE3)
187 cycles for mul      res=10000
92 cycles for imul      res=10000
280 cycles for fimul    res=10000
233 cycles for fmul     res=10000
565 cycles for pmuludq+movd     res=0
94 cycles for pmuludq+mem4      res=0
94 cycles for movss+mulps       res=0

186 cycles for mul      res=10000
92 cycles for imul      res=10000
279 cycles for fimul    res=10000
232 cycles for fmul     res=10000
564 cycles for pmuludq+movd     res=0
94 cycles for pmuludq+mem4      res=0
93 cycles for movss+mulps       res=0
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on December 27, 2011, 09:09:34 AM
Something odd...res is result?

AMD Phenom(tm) II X6 1100T Processor (SSE3)
94 cycles for movss+mulps       res=0


Odd indeed. Could you please test the attachment? Expected value is 3* 100.00
Thanks, Jochen

include \masm32\MasmBasic\MasmBasic.inc
.data
v1   dd 100
v2   dq 100
v3   REAL8 100.0
   Init
   movd xmm1, v1
   PrintLine Str$("xmm1=%f", xmm1)
   movlps xmm2, v2
   PrintLine Str$("xmm2=%f", xmm2)
   movlps xmm3, v3
   PrintLine Str$("xmm3=%f", f:xmm3)
   Inkey "OK?"
   Exit
end start

sinsi


xmm1=100.0000
xmm2=100.0000
xmm3=100.0000
OK?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: sinsi on December 27, 2011, 11:53:46 AM
OK?

OK! For a moment, I feared that Str$(xmm1) was not working properly, but apparently it's a problem with the instructions themselves on the AMD. Olly might know - I can't resolve the mystery because I have no AMD around...

clive

AMD C-50 Processor (SSE4)
261 cycles for mul      res=10000
133 cycles for imul     res=10000
498 cycles for fimul    res=10000
297 cycles for fmul     res=10000
707 cycles for pmuludq+movd     res=0
309 cycles for pmuludq+mem4     res=0
301 cycles for movss+mulps      res=0

254 cycles for mul      res=10000
130 cycles for imul     res=10000
496 cycles for fimul    res=10000
295 cycles for fmul     res=10000
707 cycles for pmuludq+movd     res=0
303 cycles for pmuludq+mem4     res=0
305 cycles for movss+mulps      res=0
It could be a random act of randomness. Those happen a lot as well.

jj2007

Mysterious. Sinsi, Clive, could you launch a test with Olly? I attach a version with int 3 before the pmuludqs start:

CPU Disasm
Address         Hex dump                Command                             Comments
00403BE0          CC                    int3
00403BE1          660F6E0D 04C04000     movd xmm1, [40C004]
00403BE9          660FF40D 00C04000     pmuludq xmm1, [40C000]
00403BF1          660F6E0D 04C04000     movd xmm1, [40C004]
00403BF9          660FF40D 00C04000     pmuludq xmm1, [40C000]
00403C01          660F6E0D 04C04000     movd xmm1, [40C004]
00403C09          660FF40D 00C04000     pmuludq xmm1, [40C000]


There is a second int 3 (for Olly noob: you can reach the second one by replacing the first one with a nop, then hit F9):
invoke Sleep, SleepMs
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
int 3
REPEAT 100
movss xmm1, v2r
mulps xmm1, v1r ; must be 16-byte aligned!!!
ENDM
counter_end
movss v3r, xmm1
Print Str$("%i cycles for movss+mulps", eax), Str$("\tres=%i\n", v3r)

qWord

Maybe the XMM-registers are overwritten by a API call?
FPU in a trice: SmplMath
It's that simple!

ERNST

QuoteIntel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
179 cycles for mul      res=10000
94 cycles for imul      res=10000
309 cycles for fimul    res=10000
299 cycles for fmul     res=10000
176 cycles for pmuludq+movd     res=0
195 cycles for pmuludq+mem4     res=0
196 cycles for movss+mulps      res=0

179 cycles for mul      res=10000
94 cycles for imul      res=10000
309 cycles for fimul    res=10000
299 cycles for fmul     res=10000
176 cycles for pmuludq+movd     res=0
196 cycles for pmuludq+mem4     res=0
196 cycles for movss+mulps      res=0
qWord is right. After SetPriorityClass (Win 7 x64) was called XMM0 (100) and XMM1 (10000) are set to 0.

jj2007

Quote from: qWord on December 27, 2011, 09:19:37 PM
Maybe the XMM-registers are overwritten by a API call?

Thanks, qWord and ERNST. This version should work for two of the "bad boyz" ;-)

clive

Windbg suggests XMM1 is thus
0.000000e+000: 0.000000e+000: 0.000000e+000: 1.401298e-043
It could be a random act of randomness. Those happen a lot as well.