What kind of representation for floats should be used?

Started by gabor, May 17, 2005, 10:55:06 AM

Previous topic - Next topic

AeroASM

Quote from: Greg on May 19, 2005, 06:21:26 PM
One thing to keep in mind is that 64-bit Windows does not allow FPU or MMX code in 64-bit mode, only SSE/SSE2/SSE3. Which I think was a dumb decision.

It had to be done some time, and better sooner rather than later so developers do not attempt to write 64-bit FPU.

gabor

I would say AeroASM is right. There are times when all the old and obsolete stuff must be thrown away (I'm not saying that FPU is am old technology. BTW I would suggest totally different things to be canceled)

Okay this topic is becoming a topic in the Soap Box.

And yes, I finally choosed to use real4 type, but I am also considering to use 64bit fix point: 32bit int 32bit fraction is quite comfortable. What da ya think?

Greets, gábor

raymond

Quotebut I am also considering to use 64bit fix point: 32bit int 32bit fraction is quite comfortable. What da ya think?

That would be fine with 32-bit registers if you limit yourself to additions and subtractions. However, you can forget multiplications (except with integers), and divisions (unless the quotient would be less than 1), with 32-bit registers.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

gabor

Now I'm in trouble again. I used REAL4 numbers and then I bumped into this:


REAL4number  REAL4 0.7
Text         db 40 dup(0)

fld REAL4 PTR [REAL4number
invoke FpuFLtoA, 0, 12, ADDR Text, SRC1_FPU or SRC2_DIMM
PrintString Text

The result was: Text =  0.699999988079!!! Why is that so?

Should I use REAL8 or REAL10? My problem is that I need to store plenty real numbers and I would use the possible smallest format.
The representation range is about 1.0E+6 and 1.0E-6. It should fit into a REAL8, shouldn't it?

I used the FPULIB and the DEBUG includes and libraries.

Mark Jones

Gabor, the methodology by which floating-point operations calculate, intrinsically produces a small error. This error is smallest around 0 and gets bigger with the exponent. This is normal FP behavior. Do a google search for "exponent-mantissa math" if you need an explanation how this works. More bits produces a wider value range and overall accuracy, but there will still be a small error. If you want precise integer and fractional data, maybe your idea of 32bits integer and 32bits fraction would be better suited. (I think you said that earlier.)  :wink

Remember to round! (You might be able to keep the floating-point version if you rounded from say, the thousandth's postion. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

raymond

QuoteThe result was: Text =  0.699999988079!!! Why is that so?

In order to fully understand why that is so, you must first be familiar with the floating point data format. I would suggest you look at the following:

http://www.ray.masmcode.com/tutorial/fpuchap2.htm#floats

Then, look at a further explanation given with the description of the "fld" instruction at:

http://www.ray.masmcode.com/tutorial/fpuchap4.htm#fld

If you then need further clarification, I will definitely try to provide it. :clap:

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

gabor

Raymond, Mark thanks!

I' ve read some of such documents, and I have now a clearer view. Now I know that those precision problems raised becaouse I switched the rounding to truncating. I used truncating because I wanted to transfer the int part and the fraction part seperetedly into 2 dwords... I found my error there and so I am free of the error about 0.6999 insted of 0.7.

However can you approve me: if I want to use real numbers in the range of 1.0E-6 .. 1.0E+6 so 1 000 000.000 001 should be a valid number then I should not use REAL4??

Greets, Gábor