I mentioned working on a PI program under the Assembler is Irrelevent topic posted by Mant. It can calculate PI to 1 million digits in 2.82 seconds on my Athlon 64 2.2. I posted the program on the pi-hacks newsgroup. I posted it about a month ago. I wrote the bulk of the code in C, with the stuff that needed to be optimzied in assembler written in assembler.
If you want to play with it, you can download it from the following link. The name of the file is chud.zip
http://groups.yahoo.com/group/pi-hacks/files/
have fun
No Yahoo ID = no download for me.
Here's the attached program.
EDIT: modified the .zip due to a bug in downloading it from pi-hacks yahoo newsgroup.
[attachment deleted by admin]
Thanks. I ran Chudp4 and this was its output:
Processor Name: Intel(R) Pentium(R) 4 CPU 2.40GHz
Processor Speed: 2400 MHz
Number of processors: 1
#terms=7, depth=4
.......
total time = 0.000
pi =
0.3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117068e1
If I specify a command line option it crashes with a Windows error report request. If I don't, it only prints PI out to what I've shown above and then stops. At least it displays some digits; the K7 version displays nothing after total time.
some explanations. I should have made the code nicer, currently it outputs PI to the console window instead of a file. And it doesn't print any help information when you try and run it.
chudk7 is for AMD k7 processors and up. If you run it on a Intel processor it won't work since it uses AMD specific instructions.
and vice a versa for the chudp4 version. So run it on the appropriate processor for your system.
You need to pass in one parameter to tell the program how many digits to compute
chudp4 1048576 > pi.txt - will compute pi to 1 million digits and output the result to pi.tzt
chudk7 1048576 > pi.txt - will compute pi to 1 million digits and output the result to pi.txt
if you don't specifiy anything on the command line it defaults to computing 100 digits.
enjoy.
I figured out the difference in the files. My problem was that my work computer is a P4 but my home computer is a AMD64 so I was getting them confused.
Here's the header for the 1,000,000 digit output, just for performance reference:
Processor Name: Intel(R) Pentium(R) 4 CPU 2.40GHz
Processor Speed: 2400 MHz
Number of processors: 1
#terms=73938, depth=18
..................................................
total time = 4.578
Thanks for the help.
I have 3.375 on athlon 64 3000+.
I found a possible bug...
This was the output from CHUDK7 on my AMD at home:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Number of processors: 1
#terms=73938, depth=18
..................................................
total time = 2.953
And this is the output (clipped) from CHUDP4 on my AMD at home:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Processor Speed: 2000 MHz
Number of processors: 1
#terms=73938, depth=18
..................................................
total time = 4.063
pi =
0.3141592653589793238<snip>
I try:
C:\pi>chudk7 1000000 > pi7.txt
C:\pi>chudp4 1000000 > pi4.txt
and get:
k7:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Number of processors: 1
#terms=70513, depth=18
..................................................
total time = 2.812
p4:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Processor Speed: 2000 MHz
Number of processors: 1
#terms=70513, depth=18
..................................................
total time = 3.812
pi =
0.3141592653589793<a lot>
The k7 version seens bugged..
Also, wich algorithm did you use? --EDIT: The exe name sugest chudnovsky...
Anyway, good job, SuperPi need 45 seconds to do it here.
I have an Athlon 64 at home and both the K7 and P4 versions print out the processor speed correctly. I get the speed from the registry and round up. It doesn't need to know the processor speed to run correctly. I had planned on doing code that was CPU speed dependent, but I never added it.
K7
Processor Name: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
Processor Speed: 2200 MHz
Number of processors: 2
P4
Processor Name: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
Processor Speed: 2200 MHz
Number of processors: 2
I use Chudnovzsky to compute the value of PI. I use binary splitting to calculate the Chud formula quickly. Currently using binary splitting is the fastest known method for computing Chud. It was primarily created to speed up factorials ( there are 3 factorials in the Chud formula). Binary splitting uses a lot more memory than standard methods, but it makes up for it in speed. Chud calculates 14 digits of PI per iteration. The "depth=" field printed out is how deep the binary splitting went ( it's recursive). If you want more detail let me know.
In the K7 version it also don't prints the result.
And... How chudnovsky formula is? (Sorry for this stupid question... Don't find on Google..)
Quote from: EduardoS on October 12, 2006, 03:16:32 PM
In the K7 version it also don't prints the result.
I think I figured out the problem on my way to work this morning. I have 2 parts to the code. A C part and the asm part. The asm part doesn't use any instructions greater than a K7. However I remember using a switch for the C compiler telling it was as K8 ( I have a K8). So I am guessing it is generating K8 specific code for the main part. I don't have the code in front of me, so I won't be able to check until I get home.
Quote from: EduardoS on October 12, 2006, 03:16:32 PM
And... How chudnovsky formula is? (Sorry for this stupid question... Don't find on Google..)
good place to check is mathworld. This webpage below has quite a large number of algorithms for computing PI.
http://mathworld.wolfram.com/PiFormulas.html
search for "Chud"
you can also view the formula directly by going to this link:
http://mathworld.wolfram.com/images/equations/PiFormulas/inline216.gif
Quote from: Mark_Larson on October 12, 2006, 05:28:27 PM
I think I figured out the problem on my way to work this morning. I have 2 parts to the code. A C part and the asm part. The asm part doesn't use any instructions greater than a K7. However I remember using a switch for the C compiler telling it was as K8 ( I have a K8). So I am guessing it is generating K8 specific code for the main part. I don't have the code in front of me, so I won't be able to check until I get home.
I'm using K-8 here, and with SSE3 on a x64 windows...
Quote from: EduardoS on October 12, 2006, 07:12:30 PM
I'm using K-8 here, and with SSE3 on a x64 windows...
Ah, I don't have 64-bit Windows. I wonder if that is the problem. I didn't want to install it, because when I first got my Athlon it wasn't that stable , and didn't have support for a number of things. I had planned on adding a 64-bit linux and doing dual boot.
Have you had any issues running 64-bit Windows?
Any issues with any 32-bit programs you compile and then run?
I used GCC for the compiler since I feel it does a better job optimizing for the C part of the code.
Wistrik do you also have 64-bit Windows?
I tryed the XP x64 and now Vista x64, the problems:
- 16-bit programs simple don't work;
- BIG compatibility problems with drivers and programs wich depends on these drivers (mostly on Vista);
- Debuggers don't work well.
Everything else goes well...
Mark,
No, I have standard 32-bit Windows XP Professional SP2. It's running on an AMD64 processor but not in native 64-bit mode.
Quote from: Wistrik on October 12, 2006, 08:34:23 PM
Mark,
No, I have standard 32-bit Windows XP Professional SP2. It's running on an AMD64 processor but not in native 64-bit mode.
Do you have a K7 or K8? If you have a K7 you probably have the issue I mentioned earlier with the gcc compile switch.
EDIT: My bad Wistrik. You said AMD64 and that's K8. I was busy at work when I read your reply and missed that detail. It still works on my K8, so I am going to try the version attached to the forum in case it got corrupted. If that's not it, I'll try to make a special debug version with no optimiations turned on to see if that is it.
EDIT2: It was the zip file being corrupted. I didn't have a copy at work when I posted this, so I downloaded it from the pi-hacks newsgroup. I just downloaded that version to my computer and got the same error as ya'll saw. So I zipped up my current versions of chudp4 and chudk7. It should work now. Let me know if you have any problems.
Working now:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Processor Speed: 2000 MHz
Number of processors: 1
#terms=70513, depth=18
..................................................
total time = 2.828
pi = ...
Note 1: Your 2.2GHz AMD needs 2.82 seconds, shouldn't it be a bit faster than my 2.0GHz?
Note 2: Why the output result is 0.31...e1 instead of 3.1...e0?
Mark,
ChudK7 now works on my home computer. Here's my output, clipped for brevity:
Processor Name: AMD Athlon(tm) 64 Processor 3200+
Processor Speed: 2000 MHz
Number of processors: 1
#terms=73938, depth=18
..................................................
total time = 2.969
pi =
0.31415926535897932<snip>e1
Edit: I'm not running with anything overclocked, but I do have a WinXP processor driver from AMD that varies processor speed in proportion to load, so it tends to run slower (and cooler) when the computer is 'idling', and full speed when I'm doing something like playing media or running a game.
Quote from: Wistrik on October 12, 2006, 11:37:59 PM
Edit: I'm not running with anything overclocked, but I do have a WinXP processor driver from AMD that varies processor speed in proportion to load, so it tends to run slower (and cooler) when the computer is 'idling', and full speed when I'm doing something like playing media or running a game.
COOL!!! where can I get a copy!!?!?!? :)
Quote from: EduardoS on October 12, 2006, 11:02:21 PM
Note 1: Your 2.2GHz AMD needs 2.82 seconds, shouldn't it be a bit faster than my 2.0GHz?
There's gonna be variation from running it several times, as well as what you have running in the background. I usually have a zillion windows open doing all sorts of stuff.
Quote from: EduardoS on October 12, 2006, 11:02:21 PM
Note 2: Why the output result is 0.31...e1 instead of 3.1...e0?
floating point notation. That's the format I picked. For consistency with C I should have gone with the second format.
Mark,
Click here (http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_871_9706,00.html) for AMD's Athlon64 utilities. At the top of the list is a dual-core optimizer I don't use, but you might find handy. Further down look for AMD Athlon™ 64/FX Processor Driver for Windows XP and Windows Server 2003 Version (x86 and x64 exe) 1.3.2.16 ; this is the driver I use.
You might find other utilities by first going to the information page for a particular processor, then clicking on any drivers links near the bottom. I did that by selecting the AMD Athlon64 from the list of processors.
thanks my electric bill has been bad lately. Anything to cut back on power :) This is actually my first AMD processor. I swore never to by an Intel processor again unless they made major changes to how they do things after I got a P4.
I am looking at alternate methods to calculate PI that might eventually lead to a faster solution than using Chud's formula. One of the formulas I am looking at is using Sin() to compue PI. You can check out this webpage ( search for "sin")
http://numbers.computation.free.fr/Constants/Pi/iterativePi.html
It's a series to compute PI that you can add any number of digits of precision of PI per iteration. Chud does 14 bits of PI per iteration. You can extend the sin algorithm to use 100 digits of precision per iteration or a million. It's completly flexible.
The coefficients in front of the sin in the formula can be easily calculated using a formula, thus making it easy to extend the equation.
n = n-th digit.
numerator = 1;
for ( int i = 1; i <= n; i++)
numerator *= (2*i - 1);
denominator = (2*n + 1);
for ( int i = 1; i <= n; i++)
denominator *= (2*i);
example of using 3 digits per iteration, 5, 7, and 9
3 digits
a = a + sin(a )
k+1 k k
5 digits
1
a = a + sin(a ) + --- * sin^3(a )
k+1 k k 6 k
7 digits
1 3
a = a + sin(a ) + --- * sin^3(a ) + --- * sin^5(a )
k+1 k k 6 k 40 k
9 digits
1 3 5
a = a + sin(a ) + --- * sin^3(a ) + --- * sin^5(a ) + ---- * sin^7(a )
k+1 k k 6 k 40 k 112 k
you'll notice in the above algorithms you are really only calculating one sin per iteration, since all of them are using sin(a )
k
You can calculate sin using the following formula
n^3 n^5 n^7 n^9
sin n = n - --- + --- - --- + ---
3! 5! 7! 9!
sin is an expensive operation that is why no one has done this method before. But what if we do just ONE iteration and calculate one sin, and have enough sin's in the formula to calculate the number of digits of precision of PI we are looking for?
each additional sin in the formula above adds 2 digits of precision. you start with 1 digit of precision. so if you wanted to calculate 1048576 digits of PI you would have 524,288 sins in your formula
I run a P4 at work and I'm not impressed. My home system is an order of magnitude faster (perceived performance) even though the clock speed is similar. It helps that I have little to no hard drive bottleneck in my home system (dual 250Gb SATA drives in RAID 0).
Yes, the trigonometric functions are spendy. In my old Commodore 64 I took a shortcut with SIN (in 6502 Assembly) by creating a table of precalculated values, so instead of feeding the angle into SIN, I used it as an index into the table. I seem to recall that SIN repeats every 90 degrees or so, like the sign bit flips or something. It's been awhile. You can use that to create a smaller table.
Edit: I notice the website author mentions accuracy with trig functions. I thought about that while writing the above. The problem is that high level languages only take their math functions out to so many digits of precision, so if you're striving for millions of digits, inaccuracy might sneak in while using the HLL functions. Or maybe I'm barking up the wrong tree...
Quote from: Wistrik on October 13, 2006, 03:05:19 PM
Yes, the trigonometric functions are spendy. In my old Commodore 64 I took a shortcut with SIN (in 6502 Assembly) by creating a table of precalculated values, so instead of feeding the angle into SIN, I used it as an index into the table. I seem to recall that SIN repeats every 90 degrees or so, like the sign bit flips or something. It's been awhile. You can use that to create a smaller table.
The values for us passed to SIN will range between 0 and PI, because you are passing the current computed value of PI into the SIN function.
Quote from: Wistrik on October 13, 2006, 03:05:19 PM
Edit: I notice the website author mentions accuracy with trig functions. I thought about that while writing the above. The problem is that high level languages only take their math functions out to so many digits of precision, so if you're striving for millions of digits, inaccuracy might sneak in while using the HLL functions. Or maybe I'm barking up the wrong tree...
sorry I didn't go into more detail with the SIN function above. It's actually an infinite series that you can run as many times as you need to get as many digits of precision you need. So if we were computing PI to a million digits, we'd want a million digits of precision for the SIN. To be able to do the above formula you have to be able to support large integer arithmetic ( which I already have for my PI code). So I have routines that can do multiplications, divisions, subtract, addition, etc all on numbers that are millions of digits long.
http://mathworld.wolfram.com/images/equations/Sine/equation4.gif
Hi Mark,
I try it on a K-7, it crashes after printing the time and "pi =", i guess it is due to the compiler-generated K-8 specific code (SSE2) you talk about.
A little off-topic: How long your math lib take to calculate multiply and square root of a 1 million decimals digits number?
Quote from: EduardoS on October 13, 2006, 10:18:15 PM
Hi Mark,
I try it on a K-7, it crashes after printing the time and "pi =", i guess it is due to the compiler-generated K-8 specific code (SSE2) you talk about.
I'll get to it when I get a free moment. I am going to make 2 programs chudk7 ( uses k7 compiler switch) and chudk8 ( uses k8 compiler switch) and time them on my K8, if there isn't a big performance difference, I am simply gonna use the k7 version.
Quote from: EduardoS on October 13, 2006, 10:18:15 PM
A little off-topic: How long your math lib take to calculate multiply and square root of a 1 million decimals digits number?
I don't have the code in front of me, but I had done timings on a 1,048,576 digit number times a 1,048,576 digit number, and it was taking 8.59 milliseconds.
I am looking at doing some stuff to speed it up.
Quote from: Mark_Larson on October 16, 2006, 03:13:22 PM
I don't have the code in front of me, but I had done timings on a 1,048,576 digit number times a 1,048,576 digit number, and it was taking 8.59 milliseconds.
That's very fast, with a also fast square root will blow up your chudk7.exe :bdg