I am testing an algo that yields the max and min values in a given array of 128 REAL8 variables. One uses the FPU, the other SSE2.
Grateful for some timings and/or suggestions,
Jochen
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3747 cycles for fCmpFpu
1794 cycles for fCmpXmm
1784 cycles for fCmpXmmNf
What if use MAXPD and MINPD ?
Alex
AMD Sempron(tm) Processor 3100+ (SSE3)
1520 cycles for fCmpFpu
1096 cycles for fCmpXmm
1089 cycles for fCmpXmmNf
1513 cycles for fCmpFpu
1099 cycles for fCmpXmm
1099 cycles for fCmpXmmNf
85 bytes for fCmpFpu
78 bytes for fCmpXmm
78 bytes for fCmpXmmNf
OMG.... What's with the Pentium timings? My AMD clocks at 1.8Ghz.... Now I know why my code runs as fast as yours :lol
is it safe to assume all the values are valid (i.e. no NANs) ?
well - had an idea - but don't think it's valid - lol
let's try another idea...
can't you just test the high order dword (as though they were integers - without using the FPU) ?
only if they are equal do you need to compare the remaining bits
Quote from: Antariy on October 25, 2010, 01:52:57 PM
What if use MAXPD and MINPD ?
Alex
Alex, you are a real friend :bg
.Repeat
minsd xmm2, REAL8 ptr [edx]
maxsd xmm3, REAL8 ptr [edx]
add edx, 8
dec ecx
.Until Sign?
Under 1000 cycles on the Pentium - thanxalot :U
Note that maxpd throws exceptions for not being 16-byte aligned, maxsd behaves ok.
@Dave: > is it safe to assume all the values are valid (i.e. no NANs) ?
Yes, the array is composed of valid REAL8 numbers
QuoteAMD Turion(tm) 64 X2 Mobile Technology TL-52 (SSE3)
1517 cycles for fCmpFpu
1089 cycles for fCmpXmm
1094 cycles for fCmpXmmNf
1516 cycles for fCmpFpu
1088 cycles for fCmpXmm
1100 cycles for fCmpXmmNf
85 bytes for fCmpFpu
78 bytes for fCmpXmm
78 bytes for fCmpXmmNf
--- ok ---
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3091 cycles for fCmpFpu
1670 cycles for fCmpXmm
1718 cycles for fCmpXmmNf
5766 cycles for fCmpFpu
1912 cycles for fCmpXmm
1812 cycles for fCmpXmmNf
85 bytes for fCmpFpu
78 bytes for fCmpXmm
78 bytes for fCmpXmmNf
--- ok ---
If doing it with ther FPU wouldn't it make more sense (and more speed) to use the FCOMI and FCMOV instructions to avoid the slow interaction with the flags via ax?
Paul.
Quote from: dioxin on October 25, 2010, 08:22:26 PM
If doing it with ther FPU wouldn't it make more sense (and more speed) to use the FCOMI and FCMOV instructions to avoid the slow interaction with the flags via ax?
Paul.
Yes it does - thanks Paul :U
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1151 cycles for fCmpFpu (fcomi)
1727 cycles for fCmpFpu2 (fcom)
1162 cycles for fCmpXmm (comisd)
725 cycles for fMinMax (minsd)
721 cycles for fMinMax1 (minsd)
1326 cycles for fCmpFpu (fcomi)
1727 cycles for fCmpFpu2 (fcom)
1158 cycles for fCmpXmm (comisd)
731 cycles for fMinMax (minsd)
743 cycles for fMinMax1 (minsd)
1326 cycles for fCmpFpu (fcomi)
1727 cycles for fCmpFpu2 (fcom)
1163 cycles for fCmpXmm (comisd)
725 cycles for fMinMax (minsd)
721 cycles for fMinMax1 (minsd)
75 bytes for fCmpFpu
85 bytes for fCmpFpu2
77 bytes for fCmpXmm
60 bytes for fMinMax
66 bytes for fMinMax1
Note that the first loop is consistently faster, no idea why.
I get an:
error A2070: invalid instruction operands
on lines 236, 254 and 256.... Why would that happen? It is an SSE2 instruction and I have SSE2
movsd xmm0, qword ptr fMinMaxHigh-4 ; about 1.79e308
movsd REAL8 ptr [eax], xmm0
movsd REAL8 ptr [eax], xmm1
Quote from: oex on October 25, 2010, 10:51:27 PM
I get an:
error A2070: invalid instruction operands
on lines 236, 254 and 256.... Why would that happen? It is an SSE2 instruction and I have SSE2
movsd xmm0, qword ptr fMinMaxHigh-4 ; about 1.79e308
movsd REAL8 ptr [eax], xmm0
movsd REAL8 ptr [eax], xmm1
You use ML 6.15 probably? This is bug of it - it mess MOVSD integer with SIMD.
Just download ML8 - it works.
Alex
Jochen, there are results:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
2246 cycles for fCmpFpu (fcomi)
5878 cycles for fCmpFpu2 (fcom)
1657 cycles for fCmpXmm (comisd)
712 cycles for fMinMax (minsd)
708 cycles for fMinMax1 (minsd)
2368 cycles for fCmpFpu (fcomi)
5856 cycles for fCmpFpu2 (fcom)
1705 cycles for fCmpXmm (comisd)
712 cycles for fMinMax (minsd)
925 cycles for fMinMax1 (minsd)
2290 cycles for fCmpFpu (fcomi)
5740 cycles for fCmpFpu2 (fcom)
1662 cycles for fCmpXmm (comisd)
700 cycles for fMinMax (minsd)
706 cycles for fMinMax1 (minsd)
75 bytes for fCmpFpu
85 bytes for fCmpFpu2
77 bytes for fCmpXmm
60 bytes for fMinMax
66 bytes for fMinMax1
Hardware is still faster :bg
Alex
Ah kk yep used 6.15 ty Alex....
http://www.masm32.com/board/index.php?topic=12719.msg98468#msg98468
AMD Sempron(tm) Processor 3100+ (SSE3)
1263 cycles for fCmpFpu (fcomi)
1667 cycles for fCmpFpu2 (fcom)
1144 cycles for fCmpXmm (comisd)
408 cycles for fMinMax (minsd)
413 cycles for fMinMax1 (minsd)
1321 cycles for fCmpFpu (fcomi)
1670 cycles for fCmpFpu2 (fcom)
1128 cycles for fCmpXmm (comisd)
409 cycles for fMinMax (minsd)
413 cycles for fMinMax1 (minsd)
1333 cycles for fCmpFpu (fcomi)
1677 cycles for fCmpFpu2 (fcom)
1126 cycles for fCmpXmm (comisd)
408 cycles for fMinMax (minsd)
409 cycles for fMinMax1 (minsd)
75 bytes for fCmpFpu
85 bytes for fCmpFpu2
77 bytes for fCmpXmm
60 bytes for fMinMax
66 bytes for fMinMax1
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
4539 cycles for fCmpFpu (fcomi)
5228 cycles for fCmpFpu2 (fcom)
3785 cycles for fCmpXmm (comisd)
1257 cycles for fMinMax (minsd)
1278 cycles for fMinMax1 (minsd)
4159 cycles for fCmpFpu (fcomi)
5313 cycles for fCmpFpu2 (fcom)
3664 cycles for fCmpXmm (comisd)
1265 cycles for fMinMax (minsd)
1260 cycles for fMinMax1 (minsd)
4158 cycles for fCmpFpu (fcomi)
5338 cycles for fCmpFpu2 (fcom)
3663 cycles for fCmpXmm (comisd)
1257 cycles for fMinMax (minsd)
1263 cycles for fMinMax1 (minsd)
75 bytes for fCmpFpu
85 bytes for fCmpFpu2
77 bytes for fCmpXmm
60 bytes for fMinMax
66 bytes for fMinMax1
--- ok ---