What is it worth?

Gunther · October 25, 2010, 10:03:19 PM

The IBM super computer at the Los Alamos laboratory in New Mexico is one of the fastest machines in the world. It is used there for the computation of nuclear weapons and the simulation of nuclear tests. The machine brings approximately 1 peta flop. Peta stands for 10^15, and flop stands for floating point operations per second. 20 years ago was the Cray 2 the fastest machine with 1 giga flop. From 2001 to 2004, the Japanese earth simulator in Yokohama was the fastest machine with 37 tera flops. Tera stands for 10^12.

I'll try to illustrate these numbers a bit. Let's start with the giga flop machines; a usual PC with an average Intel or AMD processor can reach that area. We can try to print out the numbers which a giga flop computer produces in a second. If we use small letters, we can print 100 rows and 5 columns (500 numbers) on one side of a normal printer sheet. By using both sides, that gives 1000 numbers per sheet. A package of printer paper has a height of approximately 10 cm; 10 packages are 1 m. A giga flop computer produces in 1 second numbers which give a paper stack of 100 m; that's more than 1/4 of the height of the Empire State Building.

A tera flop computer is 1000 times faster. It produces a paper stack of 100 km height in the same time. That's approximately the distance between New York and Philadelphia. If such a computer calculates 1 hour (3600 seconds), it would lead to a paper stack of 360 000 km; that is approximately the distance between earth and moon.

A peta flop computer is again 1000 times faster. It will produce a paper stack of 100 000 km per second. In 25 minutes it gives a paper stack of 150 000 000 km. That is the distance between sun and earth. The floating point speed of modern computers has indeed reached astronomical dimensions.

The Japanese earth simulator solves for example a linear equation system with 1 000 000 variables in 1 000 000 equations in 5 hours. That's impressive. But how reliable are the results? The calculations are usual done with REAL 8 numbers (double), a 64 bit word, which gives approximately 16 decimal digits.

The attached test program adds 5 elements of a double array and calculates the array sum. The speed isn't the main question here, we only want the right result (which is 137, as you will see without any computer). It's always the same vector, but with different element order. The results are mostly false. But that's not all. In the new 64 bit world (both Windows and Unix), in practice all floating operations are done with XMM registers; the old FPU has nothing to do there. That's very dangerous, because the FPU results of the test program are not so "false" like the XMM results. That has to do with the fact, that the FPU uses the internal 80 bit format for calculations, which isn't possible with XMM registers.

The program is written for the gcc with a bit inline assembly (Intel syntax). It shouldn't be to hard, to implement it with VC.

Gunther

Antariy · October 25, 2010, 10:16:17 PM

I got these results:

Code Select


Sum 1 (FPU) = 136.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 137.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 136.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 139.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = 134.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

Alex

Gunther · October 25, 2010, 10:20:24 PM

Alex,

yes, that's exactly the problem. If you've a look in the source, you'll find out that the results are mostly false. But now I've to check your pretty nice RAM utility.

Gunther

Gunther · October 25, 2010, 10:42:05 PM

Are you joking Alex? That should be all what VC can compute?

Gunther

Antariy · October 25, 2010, 10:49:47 PM

Quote from: Gunther on October 25, 2010, 10:42:05 PM
Are you joking Alex? That should be all what VC can compute?

Many sorrys :green2 Previous version is bugged. Too late, I'm tired.

Well, there is a right version.

Results:

Code Select


Sum 1 (FPU) = 0.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 17.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 120.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 147.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = -10.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

Alex

Antariy · October 25, 2010, 11:37:06 PM

At MSVC10.ZIP

Code Select


..........
    fstp       qword ptr [esp] <--- really nice thing :)
    fld qword ptr [esp]
    pop ecx
..........

raymond · October 26, 2010, 03:35:07 AM

Such results should certainly NOT be a surprise. They are entirely predictable. One should always be aware of the precision available with any instrument.

How would you like to try measuring accurately thicknesses of a few micrometers added onto a block of concrete with an ordinary ruler???

The FPU has a maximum precision of 64 bits, equivalent to some 19 decimal digits when used in extended double precision. The XMM uses only double precision with 54 bits of precision equivalent to some 16 decimal digits. Thus, if you mix large numbers with 20+ decimal digits (as in this case 1e20) with small numbers, some precision loss is bound to happen, more of it with the less precise instruments. Try your program with 1e25 instead of 1e20 and the FPU won't fare any better than the XMM. :(

Are you sure that the IBM super computer is not running a few giga flops faster than you reported? :bg

Edit: One thing I noticed was the size of the EXE which was some 4x larger than the source code. So much for bloating with these HLLs.

Twister · October 26, 2010, 04:27:04 AM

What if we increase the precision using software instead of the cpu handling the whole calculation part?

I do remember someone talking about this but with strings. It could go up to 5.7697487348734872348734873482734892347289237483258752395728 x 10⁴⁵⁶

jj2007 · October 26, 2010, 07:49:48 AM

Quote from: raymond on October 26, 2010, 03:35:07 AM
The FPU has a maximum precision of 64 bits, equivalent to some 19 decimal digits when used in extended double precision. The XMM uses only double precision with 54 bits of precision equivalent to some 16 decimal digits. Thus, if you mix large numbers with 20+ decimal digits (as in this case 1e20) with small numbers, some precision loss is bound to happen, more of it with the less precise instruments. Try your program with 1e25 instead of 1e20 and the FPU won't fare any better than the XMM. :(

Exactly. Here is what you get using the FPU:

QuoteFPU, Real8
Sum=128.0000
Sum=129.0000
Sum=136.0000
Sum=131.0000
Sum=137.0000
Sum=134.0000

Note that results are much closer to 137 than Alex' second VC version attached above (while the first version posted as reply #1 yields the same results as the MB code below - why is the second version less precise???). Most probably, VC uses only the 53 bit mode of the FPU.

Just for fun, I also added a version in which the variables are REAL10, but it does not change anything because fld V1 yields exactly the same value as fld R10. It is the subsequent steps that cheat the FPU, i.e. 1.0e20+17=1.00...2, -10.0=1.00....0 etc

Only V5 resp. R105 yield a correct result: 1.0e20+-1.0e20=0 (exactly), +17-10+130=137

Code Select

include \masm32\MasmBasic\MasmBasic.inc
.data
V1	REAL8	1.0e20, 17.0, -10.0, 130.0, -1.0e20
V2	REAL8	1.0e20, -10.0, 130.0, -1.0e20, 17.0
V3	REAL8	1.0e20, 17.0, -1.0e20, -10.0, 130.0
V4	REAL8	1.0e20, -10.0, -1.0e20, 130.0, 17.0
V5	REAL8	1.0e20, -1.0e20, 17.0, -10.0, 130.0 ; this one yields the correct result
V6	REAL8	1.0e20, 17.0, 130.0, -1.0e20, -10.0

R101	REAL10	1.0e20, 17.0, -10.0, 130.0, -1.0e20
R102	REAL10	1.0e20, -10.0, 130.0, -1.0e20, 17.0
R103	REAL10	1.0e20, 17.0, -1.0e20, -10.0, 130.0
R104	REAL10	1.0e20, -10.0, -1.0e20, 130.0, 17.0
R105	REAL10	1.0e20, -1.0e20, 17.0, -10.0, 130.0
R106	REAL10	1.0e20, 17.0, 130.0, -1.0e20, -10.0

	Init
	Print "FPU, Real8", CrLf$
	Print Str$("Sum=%f\n", V1+V1[8]+V1[16]+V1[24]+V1[32])
	Print Str$("Sum=%f\n", V2+V2[8]+V2[16]+V2[24]+V2[32])
	Print Str$("Sum=%f\n", V3+V3[8]+V3[16]+V3[24]+V3[32])
	Print Str$("Sum=%f\n", V4+V4[8]+V4[16]+V4[24]+V4[32])
	Print Str$("Sum=%f\n", V5+V5[8]+V5[16]+V5[24]+V5[32])
	Print Str$("Sum=%f\n\n", V6+V6[8]+V6[16]+V6[24]+V6[32])

	Print "FPU, Real10", CrLf$
	Print Str$("Sum=%f\n", R101+R101[10]+R101[20]+R101[30]+R101[40])
	Print Str$("Sum=%f\n", R102+R102[10]+R102[20]+R102[30]+R102[40])
	Print Str$("Sum=%f\n", R103+R103[10]+R103[20]+R103[30]+R103[40])
	Print Str$("Sum=%f\n", R104+R104[10]+R104[20]+R104[30]+R104[40])
	Print Str$("Sum=%f\n", R105+R105[10]+R105[20]+R105[30]+R105[40])
	Print Str$("Sum=%f\n\n", R106+R106[10]+R106[20]+R106[30]+R106[40])

	Inkey Str$("Your puter has run %3f hours since the last boot, give it a break!", Timer()/3600000)
	Exit
end start

hutch-- · October 26, 2010, 08:49:23 AM

:bg

BCD anyone ?

vanjast · October 26, 2010, 12:12:59 PM

Quote from: hutch-- on October 26, 2010, 08:49:23 AM
:bg

BCD anyone ?

with adjustments...

raymond · October 26, 2010, 04:48:12 PM

BCD is one way to go for best accuracy, limited only to the amount of memory available. Even then, the extent of fractional errors must be fully understood to estimate the accuracy of the least significant digits.

The advantage of BCD is that conversion to ascii is not a problem if the result must be displayed in readable form by the average human.

Gunther · October 26, 2010, 08:39:55 PM

Raymond,

Quote from: raymond, October 26, 2010, at 05:48:12 PMBCD is one way to go for best accuracy,

Right, it avoids especially converting errors between the decimal and binary system. But usual BCD arithmetic won't help much by our problem. What's dangerous? For example, subtracting two numbers which are approximately even, will lead to a significant accuracy loss. That's increased by rounding after every operation. What would we need? A long accumulator to accumulate the interim results and rounding by finishing the calculation. We could round up and round down and we would get the result inside an interval (the idea comes from the interval mathematics). That would save us a lot of numerical surprises.

Quote from: raymond, October 26, 2010, at 04:35:07 AMAre you sure that the IBM super computer is not running a few giga flops faster than you reported

May be it runs faster, but that doesn't change anything. The question isn't speed, but accuracy.

Quote from: raymond, October 26, 2010, at 04:35:07 AMOne thing I noticed was the size of the EXE which was some 4x larger than the source code. So much for bloating with these HLLs.

Raymond, it's clear that we can beat with a standalone assembly language application every HLL implementation in size and speed. But is that really necessary for such a small test program? On the other hand, the program runs under Windows, Linux, and BSD without modifications and 20 KB isn't really bloat ware.

I'm trying to write a package for arbitrary accurate floating point operations (with rounding interval); it'll be sure written in assembly language. But I would need a bit help.

Gunther

Antariy · October 26, 2010, 09:34:45 PM

Quote from: raymond on October 26, 2010, 03:35:07 AM
Edit: One thing I noticed was the size of the EXE which was some 4x larger than the source code. So much for bloating with these HLLs.

:bg

Yes, agree.

The new one attached which is updated to for compiling with MSVC. Also results is slightly closer to right - because in main function I reinitialize FPU. When optimizing is on - compiler tried to keep all params at registers, so, even if MSVC have not support 80bit precision, we can get it manually, because losses would not occur while numbers are not stored to memory. But this is not good approach, of course, just note.

Attached archive contain updated source (© Gunther), where I insert FINIT in main function.

Also executable file have smaller size than source now :bg

Results for this program:

Code Select


Sum 1 (FPU) = 136.00 Sum 1 (XMM) = 0.00
Sum 2 (FPU) = 137.00 Sum 2 (XMM) = 17.00
Sum 3 (FPU) = 136.00 Sum 3 (XMM) = 120.00
Sum 4 (FPU) = 139.00 Sum 4 (XMM) = 147.00
Sum 5 (FPU) = 137.00 Sum 5 (XMM) = 137.00
Sum 6 (FPU) = 134.00 Sum 6 (XMM) = -10.00

Right Sum   = 137.00

FPU results is "better", but this is have no meaning at all - they are just close, still not right.

Alex

dioxin · October 26, 2010, 09:36:41 PM

There's no reason binary can't give results just as exactly as BCD.
The only advantage of BCD is the conversion to/from readable numbers but there's a huge reduction in calculation speed when calculating in BCD so for non-trivial calculations it's usually better to use binary for all calculations and just convert at the end to readble digits if needed.

Paul.

News:

What is it worth?