News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Floating point comparisons

Started by jj2007, October 25, 2010, 01:45:53 PM

Previous topic - Next topic

jj2007

I am testing an algo that yields the max and min values in a given array of 128 REAL8 variables. One uses the FPU, the other SSE2.
Grateful for some timings and/or suggestions,
Jochen

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3747    cycles for fCmpFpu
1794    cycles for fCmpXmm
1784    cycles for fCmpXmmNf

Antariy

What if use MAXPD and MINPD ?



Alex

oex


AMD Sempron(tm) Processor 3100+ (SSE3)
1520    cycles for fCmpFpu
1096    cycles for fCmpXmm
1089    cycles for fCmpXmmNf

1513    cycles for fCmpFpu
1099    cycles for fCmpXmm
1099    cycles for fCmpXmmNf

85       bytes for fCmpFpu
78       bytes for fCmpXmm
78       bytes for fCmpXmmNf


OMG.... What's with the Pentium timings? My AMD clocks at 1.8Ghz.... Now I know why my code runs as fast as yours :lol
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

is it safe to assume all the values are valid (i.e. no NANs) ?
well - had an idea - but don't think it's valid - lol

let's try another idea...

can't you just test the high order dword (as though they were integers - without using the FPU) ?
only if they are equal do you need to compare the remaining bits

jj2007

Quote from: Antariy on October 25, 2010, 01:52:57 PM
What if use MAXPD and MINPD ?

Alex

Alex, you are a real friend :bg
.Repeat
minsd xmm2, REAL8 ptr [edx]
maxsd xmm3, REAL8 ptr [edx]
add edx, 8
dec ecx
.Until Sign?

Under 1000 cycles on the Pentium - thanxalot :U
Note that maxpd throws exceptions for not being 16-byte aligned, maxsd behaves ok.

@Dave: > is it safe to assume all the values are valid (i.e. no NANs) ?
Yes, the array is composed of valid REAL8 numbers

brethren

QuoteAMD Turion(tm) 64 X2 Mobile Technology TL-52 (SSE3)
1517    cycles for fCmpFpu
1089    cycles for fCmpXmm
1094    cycles for fCmpXmmNf

1516    cycles for fCmpFpu
1088    cycles for fCmpXmm
1100    cycles for fCmpXmmNf

85       bytes for fCmpFpu
78       bytes for fCmpXmm
78       bytes for fCmpXmmNf

--- ok ---

RuiLoureiro

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3091    cycles for fCmpFpu
1670    cycles for fCmpXmm
1718    cycles for fCmpXmmNf

5766    cycles for fCmpFpu
1912    cycles for fCmpXmm
1812    cycles for fCmpXmmNf

85       bytes for fCmpFpu
78       bytes for fCmpXmm
78       bytes for fCmpXmmNf

--- ok ---

dioxin

If doing it with ther FPU wouldn't it make more sense (and more speed) to use the FCOMI and FCMOV instructions to avoid the slow interaction with the flags via ax?

Paul.

jj2007

Quote from: dioxin on October 25, 2010, 08:22:26 PM
If doing it with ther FPU wouldn't it make more sense (and more speed) to use the FCOMI and FCMOV instructions to avoid the slow interaction with the flags via ax?

Paul.


Yes it does - thanks Paul :U
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1151    cycles for fCmpFpu  (fcomi)
1727    cycles for fCmpFpu2 (fcom)
1162    cycles for fCmpXmm  (comisd)
725     cycles for fMinMax  (minsd)
721     cycles for fMinMax1 (minsd)

1326    cycles for fCmpFpu  (fcomi)
1727    cycles for fCmpFpu2 (fcom)
1158    cycles for fCmpXmm  (comisd)
731     cycles for fMinMax  (minsd)
743     cycles for fMinMax1 (minsd)

1326    cycles for fCmpFpu  (fcomi)
1727    cycles for fCmpFpu2 (fcom)
1163    cycles for fCmpXmm  (comisd)
725     cycles for fMinMax  (minsd)
721     cycles for fMinMax1 (minsd)

75       bytes for fCmpFpu
85       bytes for fCmpFpu2
77       bytes for fCmpXmm
60       bytes for fMinMax
66       bytes for fMinMax1


Note that the first loop is consistently faster, no idea why.

oex

I get an:

error A2070: invalid instruction operands

on lines 236, 254 and 256.... Why would that happen? It is an SSE2 instruction and I have SSE2

   movsd xmm0, qword ptr fMinMaxHigh-4   ; about 1.79e308

   movsd REAL8 ptr [eax], xmm0

   movsd REAL8 ptr [eax], xmm1
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

Antariy

Quote from: oex on October 25, 2010, 10:51:27 PM
I get an:

error A2070: invalid instruction operands

on lines 236, 254 and 256.... Why would that happen? It is an SSE2 instruction and I have SSE2

   movsd xmm0, qword ptr fMinMaxHigh-4   ; about 1.79e308

   movsd REAL8 ptr [eax], xmm0

   movsd REAL8 ptr [eax], xmm1

You use ML 6.15 probably? This is bug of it - it mess MOVSD integer with SIMD.
Just download ML8 - it works.



Alex

Antariy

Jochen, there are results:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
2246    cycles for fCmpFpu  (fcomi)
5878    cycles for fCmpFpu2 (fcom)
1657    cycles for fCmpXmm  (comisd)
712     cycles for fMinMax  (minsd)
708     cycles for fMinMax1 (minsd)

2368    cycles for fCmpFpu  (fcomi)
5856    cycles for fCmpFpu2 (fcom)
1705    cycles for fCmpXmm  (comisd)
712     cycles for fMinMax  (minsd)
925     cycles for fMinMax1 (minsd)

2290    cycles for fCmpFpu  (fcomi)
5740    cycles for fCmpFpu2 (fcom)
1662    cycles for fCmpXmm  (comisd)
700     cycles for fMinMax  (minsd)
706     cycles for fMinMax1 (minsd)

75       bytes for fCmpFpu
85       bytes for fCmpFpu2
77       bytes for fCmpXmm
60       bytes for fMinMax
66       bytes for fMinMax1


Hardware is still faster  :bg



Alex

oex

Ah kk yep used 6.15 ty Alex....

http://www.masm32.com/board/index.php?topic=12719.msg98468#msg98468


AMD Sempron(tm) Processor 3100+ (SSE3)
1263    cycles for fCmpFpu  (fcomi)
1667    cycles for fCmpFpu2 (fcom)
1144    cycles for fCmpXmm  (comisd)
408     cycles for fMinMax  (minsd)
413     cycles for fMinMax1 (minsd)

1321    cycles for fCmpFpu  (fcomi)
1670    cycles for fCmpFpu2 (fcom)
1128    cycles for fCmpXmm  (comisd)
409     cycles for fMinMax  (minsd)
413     cycles for fMinMax1 (minsd)

1333    cycles for fCmpFpu  (fcomi)
1677    cycles for fCmpFpu2 (fcom)
1126    cycles for fCmpXmm  (comisd)
408     cycles for fMinMax  (minsd)
409     cycles for fMinMax1 (minsd)

75       bytes for fCmpFpu
85       bytes for fCmpFpu2
77       bytes for fCmpXmm
60       bytes for fMinMax
66       bytes for fMinMax1
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

clive

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
4539    cycles for fCmpFpu  (fcomi)
5228    cycles for fCmpFpu2 (fcom)
3785    cycles for fCmpXmm  (comisd)
1257    cycles for fMinMax  (minsd)
1278    cycles for fMinMax1 (minsd)

4159    cycles for fCmpFpu  (fcomi)
5313    cycles for fCmpFpu2 (fcom)
3664    cycles for fCmpXmm  (comisd)
1265    cycles for fMinMax  (minsd)
1260    cycles for fMinMax1 (minsd)

4158    cycles for fCmpFpu  (fcomi)
5338    cycles for fCmpFpu2 (fcom)
3663    cycles for fCmpXmm  (comisd)
1257    cycles for fMinMax  (minsd)
1263    cycles for fMinMax1 (minsd)

75       bytes for fCmpFpu
85       bytes for fCmpFpu2
77       bytes for fCmpXmm
60       bytes for fMinMax
66       bytes for fMinMax1

--- ok ---
It could be a random act of randomness. Those happen a lot as well.