Just noticed that Dave's nickname dedndave is being used by a new member on Project Euler but with some upper case letters, i.e. DednDave (although his e-mail address is with all lower case letters).
Could it be his cyber twin? :eek :dazzled: :dance:
no, Raymond
that is me :P
didn't realize my email was visible - thanks for the heads-up
I wasn't too sure since upper case letters were being used, and you never use them on this forum.
And don't worry Dave. Your e-mail is NOT visible. Remember I'm one of the admins having access to all the info.
oohhhhhh
then, you probably saw my first post, too
As I have replied to someone else on that forum, we are studying various possibilities to open up locked threads for new members. Stay tuned and be patient.
i was disappointed to see the previous discussion for that particular problem
hopefully, they aren't all like that
What is "that particular problem"?
Euler problem #48
QuoteThe series, 11 + 22 + 33 + ... + 1010 = 10405071317.
Find the last ten digits of the series, 11 + 22 + 33 + ... + 10001000.
http://projecteuler.net/index.php?section=problems&id=48
once you have provided the correct answer, you are allowed to see the forum for that problem
unfortunately, the discussion spans from 2004 to 2008, when the thread was locked
for the most part, they discussed their favorite languages
and showed how few lines and how short the source text was - lol
i am more interested in methodology than i am in syntax
Well that was straight forward. It just surprised me how many people computed it to full precision, and the level of complexity they used. This would make a great interview question.
it is certainly more efficient to tally only the low 10 digits
however, it is easier to write the code to go full precision
especially in my case, as i already had the bignum exponentiation and integer to ascii routines on hand
this is about all i had to write to get the full string...
loop00: INVOKE uExpInt,ebx,ebx,offset Accum,sizeof Accum
add ecx,7
shr ecx,2
call AddAcc
inc ebx
cmp ebx,1000
jbe loop00
INVOKE uArb2StrL9,offset Result,sizeof Result,offset Buff,sizeof Buff
;
;
;
AddAcc PROC
mov esi,offset Accum
clc
mov edi,offset Result
AddAc0: mov eax,[esi]
adc [edi],eax
dec ecx
lea esi,[esi+4]
lea edi,[edi+4]
jnz AddAc0
ret
AddAcc ENDP
i think i spent more time calculating how many digits there would be, which took a few minutes - lol
Quote from: dedndave
i think i spent more time calculating how many digits there would be, which took a few minutes - lol
I didn't have a library to hand, so jammed it out in C using 64-bit math, for both the power and accumulation. The assembler version is a bit more bulky, but you're probably hiding behind several hundred lines of library code.
Ok so how many digits would you have with n=10000, the 10 least significant digits should be 6237204500, I think.
i am gonna guess about 30,000 digits :bg
1000010000 = (104)10000 = 1040000 = 40000 digits
As a side note, in the post previous to yours, someone also suggested to look into modular exponentiation.
And everything could have been a lot easier (in assembly) if only the last 9 digits had been required. 10-digit remainders don't all fit in 32-bit!
Yes, it dawned on me driving about town that 10000 has 4 zeros, so multiplying it 10000 times would be up at 40001 digits.
It was pretty clear 32-bit didn't work, the 64-bit method I used should work for values of n up to 100000000. Or some 800000001 digits
my last ten digits agree with yours, Clive :bg
I didn't get too aggressive with this, another optimization would be to note that modulo 10 powers contribute nothing to the sum
unsigned __int64 computepuzzle48(void)
{
__asm
{
xor eax,eax ; accumulator
xor edx,edx
mov ecx,1 ; power
loopouter:
push ecx
push edx
push eax
xor edx,edx ; edx:eax = ecx
mov eax,ecx
or ecx,ecx ; paranoid, ecx = 0
jz skip2
mov ebx,ecx
dec ecx
jz skip2
loopinner:
; edx:eax *= ebx
push edx
push eax
mov eax,edx
xor edx,edx
mul ebx
mov edi,eax
pop eax
pop edx
mul ebx
add edx,edi
; edx:eax modulo 10000000000 (max 429496729599999)
cmp edx,2 ; cheap test for below 10000000000
jb skip1
push edx
push eax
mov edi,100000
div edi
xor edx,edx
div edi
xor edx,edx
mul edi
mul edi
mov edi,eax
pop eax
sub eax,edi
mov edi,edx
pop edx
sbb edx,edi
skip1:
dec ecx
jnz loopinner
skip2:
; edx:eax holds power to 10 least significant digits
; edx:eax += (accumulate)
pop ecx
add eax,ecx
pop ecx
adc edx,ecx
; edx:eax modulo 10000000000 (max 429496729599999)
cmp edx,2 ; cheap test for below 10000000000
jb skip3
push edx
push eax
mov edi,100000
div edi
xor edx,edx
div edi
xor edx,edx
mul edi
mul edi
mov edi,eax
pop eax
sub eax,edi
mov edi,edx
pop edx
sbb edx,edi
skip3:
pop ecx
inc ecx
cmp ecx,1000
jb loopouter
; edx:eax Answer, 10 least significant digits
}
}
In those early days, I used plain BCD's for this problem, doing computations only with the last 10 digits. No need for the conversion of a big number at the end. Still runs in a decent time: 0.12 sec on my i7.
I'll try and see if using modular exponentiation and BCD's will improve that. What was your timing Dave with "big numbers"?
for N=1000, it was about 78 mS, as i recall
that yielded the full length result
for N=10000, i had time to go to the bathroom and grab a fresh soda :lol
it also yielded the full length result
For the C version N=1000 was about 47 ms, N=10000 was about 5.8 s, N=100000 was about 571.3 s (3031782500 ?)
For the assembler version N=10000 was about 4.2 s, N=100000 was about 430.9 s but I had to adjust the balance of the modulus to keep results within 32-bits.
1.6 GHz Atom Netbook
Quoteanother optimization would be to note that modulo 10 powers contribute nothing to the sum
well, it would be easy to add 1010101 (decimal) for N=1, 10, 100, 1000
i think there may be other simple shortcuts, as well - those that apply to powers of 2, for example
i looked at factoring, it doesn't seem like it would be very efficient
and - it may be faster, but it would require a lot more code
no - i think there is some other trick, here
most of the euler problems have "hidden" solutions :bg
My observation was that 10**10, 20**20, .. 110*110, 120*120, etc contribute nothing to the least significant digits. Thus eliminating ~10% (non linear, but whatever) of the work.
my mistake, Clive
you're right (again - lol)
my head may have been thinking exponents, but my hands were stuck on squares :P
Quote from: dedndave
no - i think there is some other trick, here, most of the euler problems have "hidden" solutions
Yes, there are probably a lot of tricks and short cuts, the site has a lot of interesting problems. Some languages are better at this than others, at least with the postings it seems that some languages handle large numbers in a much more fluid manner, and permit brute force solutions with very simple coding.
Doing it in assembler is a particular pig because you have to think in binary and decimal, and the scope of the numbers. Raymond's BCD suggestion at least keeps the decimal/decades in focus. I've done BCD on the 6502/68K/x86, but not sure how much effort has gone into optimizing those instructions over the years.
The trick with this one was to truncate the computation of unused digits, allowing the solution to scale 10**2 for each additional decade in N.
well - the challenge part is finding the efficient method
the fun part is coding it in assembler :bg
for this problem, my first instinct was to make it a horner's rule solution
that would apply better if it were
11 x 22 x 33 x 44...... :bg
anyways, binary exponentiation looks promising, and we only need to go as far as the low-order 34 bits in the result
it is a 2-step process
1) exponentiation loop
2) conversion to decimal
horner could be used for step 2
but, from my experience, there are faster algorithms for converting values that are smaller than 2 x CPU register width
On a Phenom II X4 820, 2.8 GHz
For the C solution, skipping modulo 10 powers
N=10000, 1.512 seconds, 6237204500
N=100000, 149.378 seconds, 3031782500
N=1000000, 14955.012 seconds, 4077562500 ?
I could split the power computations into threads?
So Raymond, what kind of results are you getting with BCD, or other methodologies ?
that's pretty fast :P
64-bit ?
not sure i can play on the same field with a 32-bit machine
Quote from: dedndavethat's pretty fast :P
64-bit ? not sure i can play on the same field with a 32-bit machine
The machine is 64-bit, the C is MSVC 12.00, 32-bit, using 64-bit integers (unsigned __int64). It's about 2.8x the speed of the Atom, so figure 43000 seconds for N=1000000, say 12 hours or so on the netbook.
The run time for the previous assembler, with a slight modification to balance the mod 10000000000 a bit better so the first divide fits the result in 32-bits, but without the mod 10 power fix is only 8-9% slower.
Modular exponentiation is definitely faster. Got the timing down from 120 ms to 12 ms for the 1000 limit.
Result for the 10000 limit confirmed. Timing 150 ms.
Result for the 100000 limit different from clive: 4131782500, timing 1.88 sec.
Result for the 1000000 limit slightly different from clive: 5077562500, timing 23.4 sec.
I wonder what Dave would get on the last two.
All these should be significantly faster with a 64-bit program using integers instead of BCD's.
i am working on an FPU version, Ray - seeing as i do not have a 64-bit machine :P
however, i am not using packed BCD values
i am using 64-bit integers
i don't recall what the different timings are for FPU integer math, so my method may not be that efficient
i will post code if i get it working - lol
I have a pretty high confidence in the limit numbers, the algorithm is sound, and I'm not seeing any overflow. I've pulled some x86-64 compilers, and built both 64-bit and 128-bit integer variants to run on the Phenom. My exponential/power computation is brute force rather than efficient/optimized, hence the 100x increase with each decade, as apposed to the 10x, but it should be immune to the potential overflow of other methods. I could generate a table of n**n computation to see where our solutions diverge.
I downloaded a 64-bit version of MSVC, which surprisingly didn't support __int128, so I pulled gcc/mingw64. Microsoft are so screwed.
Dave doesn't your Prescott do 64-bit? or did you get one of the crippled ones? I had Win32 running on mine, but it would dual boot to 32 or 64-bit Linux. My 2005 Prescott box died a while back, I had to transplant it's guts to an AMD X2 refurbished donor box, and picked up Quad Phenom to have a more current dev box.
actually, the prescott i have is em64t compatible
however, i have no idea how to execute 64-bit instructions - or even if i can - with 32-bit XP
my current solution is to use the 64-bit integer capability of the FPU
i am not too familiar with using integers with the FPU, and even less familiar with its' packed BCD capability - lol
but, i am making headway
i looked at multiple-precision multiplication and became disinterested :P
i have done it before, the Ling Long Kai Fang routines use 64 x 64 multiply with 32-bit registers
Clive, your numbers are probably correct. I must have a minor bug with the BCD's which affects the higher significant digits. I'll try and find it.
The reason I think my BCD results may be wrong is that I wrote another algo using integers to get the lower 9 digits (i.e. <32 bits) and results agree with all your lower 9-digit results.
As expected, timings with integer math and modular exponentiation are considerably lower:
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms
Dave, I'm still wondering why you would have chosen #48 as the first one to solve. It seems rather random.
Once you solve #12, you may like the PDF I wrote for that problem.
actually, one of the other members mentioned that specific problem
when i looked at it, i thought it was a horner's rule problem
as i mentioned, i had already written routines to exponentiate and convert the bignums
so, it looked easy and i gave it a whirl :P
it has been many years since i took the math courses in college
i have forgotten most of what i learned because i rarely have to use it to design circuits
i use derivatives once in a while
now and then, i have to understand how something works that is based on higher math
only enough to grasp the underlying principal, though :P
at any rate, many of the euler problems are a little over my head
it just dawned on me.....
this does not apply to the euler problem because 10 digits are required and Nmax is only 1000
but, it does apply to these extended versions that you guys have created
the one thing that has remained constant in all these problems is that the result is limited to 10 digits
10 digits may be fully expressed with 34 binary bits
that means that only the lower 17 bits of N will effect the result (squareroot)
for the Nmax=1,000,000 problem, only the lower 5 digits of N will effect the 10 digit result
in other words, there are several repeated calculations :bg
notice the result ends in 0's
you could perform the calculation for N=1 to 99,999, then multiply by 100 :U
Most (if not all) of the first 150-200 problems are accessible with high school math knowledge, a little logic, pattern spotting, or maybe a little extra knowledge about how some numbers can be computed.
My knowledge is very little above high school math so that many of the problems above #250 are also way above my head. Some of the exceptions above #300 are:
309, 311, 315, 321, 327, 333, 341, most of which I had a hand in their development.
That last one (#341) was fun since my algo ran 100 times faster than that of the original proposer who is an excellent problem solver.
i may look at them - after a bit :P
too many irons in the fire, at the moment
another revelation - lol
i can use 80-bit extended reals, rather than 64-bit integers
so long as i keep everything under 64 bits, there should be no loss in precision
this being the "native language" of the FPU should be faster
Clive,
Found my stupid error. My initial algo was based on a maximum limit having at most 4 digits. I simply forgot to adjust that when I started raising the limit above 10000 and it was affecting all the odd numbers. All my results now agree with yours.
It's surprizing that I was still getting the last 8 digits correct with that error. Dave may have something regarding a shortcut. It may be worth investigating.
Quoteso long as i keep everything under 64 bits
Under is the correct word. Qwords are treated as signed by the FPU.
Well this was the original 5 min solution. I could probably do it with doubles, or extended precision. I seemed more viable than computing all the digits. There are probably far better ways of doing the power/exponent function, I was just worried about keeping the number within the scope of a 64-bit number, as it gets out of control very quickly, obviously.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int i, j;
unsigned __int64 x, y;
x = 0;
for(i=1; i<=1000; i++)
{
y = i;
for(j=1; j<i; j++)
{
y *= i;
y %= 10000000000;
}
x += y;
}
x %= 10000000000;
printf("%4d %010I64d %010I64d\n",i-1, y, x);
}
ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried
the game plan is to use right-to-left binary exponentiation
we only need 34 bits in the result, but i am designing it to use 42 bits
because we have room and i think we could get an 11th digit if we wanted it
my idea is to use all 8 registers of the FPU stack and rotate them like a revolver cylinder
(http://operatorchan.org/k/arch/src/k225479_627-8shot.jpg)
that way, we eliminate memory accesses inside the loop
i am designing the loops from the inside out
here is what i have so far...
jmp short loop_entry
;-------------------------------------------------------------------
mul_operand_to_accumulator:
fincstp
fincstp
;st(0) = exponentiation accumulator
;st(1) = 42-bit modulo divisor = 4,398,046,511,104
;st(2) = addition accumulator
;st(3) = 21-bit modulo divisor = 2,097,152
;st(4) = 1
;st(5) = operand (N)
;st(6) = temporary operand (N^2^x)
;st(7) = 21-bit modulo divisor = 2,097,152
fmul st(6)
fprem ;lower 42 bits of exponentiation accumulator
fdecstp
fdecstp
;-------------------------------------------------------------------
square_temp_operand:
;st(0) = temporary operand (N^2^x)
;st(1) = 21-bit modulo divisor
;st(2) = exponentiation accumulator
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = addition accumulator
;st(5) = 21-bit modulo divisor = 2,097,152
;st(6) = 1
;st(7) = operand (N)
fmul st(0)
fprem ;lower 21 bits of temporary operand
;-------------------------------------------------------------------
loop_entry:
shr eax,1
jc mul_operand_to_accumulator
jnz square_temp_operand
;-------------------------------------------------------------------
fincstp
fincstp
fincstp
fincstp
;st(0) = addition accumulator
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = 1
;st(3) = operand (N)
;st(4) = temporary operand (N^2^x)
;st(5) = 21-bit modulo divisor
;st(6) = exponentiation accumulator
;st(7) = 42-bit modulo divisor = 4,398,046,511,104
fadd st(6)
fprem ;lower 21 bits of addition accumulator
;-------------------------------------------------------------------
i was doing it with integers, but switched to extended reals and had to start over :'(
at the end, there, i see i can re-arrange the registers and eliminate a couple rotations
well - the loop assembles....
....and runs with no exceptions....
....but i don't get the right answer, yet
i am in "debug" mode, now :P
ok - the loop works, to some degree
and, it's very fast
i am starting to count clock cycles instead of mS :U
something in the order of 2,000,000 cycles
if you prefer a time, it is in the order of 667 microseconds on my 3 GHz P4
however, i am having a precision problem - i haven't sorted it out, yet
the time shouldn't change signifigantly if fixed unless i have to extend precision via adding instructions
using fincstp and fdecstp works like a charm
Quote from: dedndave on August 14, 2011, 08:31:35 PM
ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried
Hi,
Well actually I was trying a BCD version. And while I eventually
got the N = 10 value correct, I can't duplicate Clive's numbers for
N = 10,000. So I assume my N = 1,000 is bad as well. Pegs the
CPU at 100% nicely though. And it probably should be thrown
out and started over if I wanted reasonable performance.
N = 1,000 second(s), N = 10,000 minutes.
Cheers,
Steve N.
welcome to the fray, Steve :bg
for 1,000, the first digit is a 9 and the last one is a 0
this code is somewhat correct
the problem is lack of precision
i may get to play with it today
uArb2StrL9 is my binary-to-ascii routine
i will write a special/faster version once i fix the loop
Nmax EQU 1000
.DATA
fpTzero real10 0.0
fpTone real10 1.0
fpTmod21 real10 2097152.0
fpTmod42 real10 4398046511104.0
.DATA?
nQResult dq ?
szBuffer db 24 dup(?)
.CODE
_main PROC
;st(0) = temporary operand (N^(2^x)) start with Nmax
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = exponentiation accumulator start with 1
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = 1
;st(5) = operand (N) start with Nmax
;st(6) = additive accumulator (result) start with 0
;st(7) = 42-bit modulo divisor = 4,398,046,511,104
finit
mov edx,Nmax
fld fpTmod42
push edx
fld fpTzero
fild dword ptr [esp]
fld fpTone
fld fpTmod42
fld fpTone
fld fpTmod21
fild dword ptr [esp]
fwait
pop eax
jmp short exp_loop_entry
top_of_outer_loop:
fdecstp
mov eax,edx
fsub st(0),st(7)
fdecstp
fdecstp
ffree st(7)
fld st(1)
fdecstp
ffree st(7)
fld st(4)
jmp short exp_loop_entry
mul_operand_to_accumulator:
fincstp
fincstp
fmul st(0),st(6)
fprem ;lower 42 bits of exponentiation accumulator
fdecstp
fdecstp
square_temp_operand:
fmul st(0),st(0)
fprem ;lower 21 bits of temporary operand
exp_loop_entry:
shr eax,1
jc mul_operand_to_accumulator
jnz square_temp_operand
;add exponentiated accumulator to additive accumulator
fdecstp
fdecstp
dec edx
fadd st(0),st(4)
fprem ;lower 42 bits of additive accumulator
jnz top_of_outer_loop
fwait
fistp nQResult
ffree st(6)
ffree st(5)
ffree st(4)
ffree st(3)
ffree st(2)
ffree st(1)
ffree st(0)
INVOKE uArb2StrL9,offset nQResult,sizeof nQResult,offset szBuffer,sizeof szBuffer
sub ecx,10
jbe skip_it
add edx,ecx
skip_it:
print edx
print chr$(13,10)
inkey
exit
_main ENDP
Quote from: dedndave on August 15, 2011, 01:12:47 PM
welcome to the fray, Steve :bg
for 1,000, the first digit is a 9 and the last one is a 0
Hi,
Thanks. Stomped another bug, and I matched Clive's
10,000 value and the digits you mentioned for N = 1,000.
Regards,
Steve
nice work :U
i found that mine works perfectly.....
up to Nmax = 7 (http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/24.gif)
by changing the modulo divisors to 234, i get the correct result for Nmax=10
the problem comes when i have to multiply a 34-bit temporary operand onto a 34-bit exponential accumulator
( 234-1 )2 MOD 234
i'll have to extend the precision a little for that one operation
i need 68 bits of precision, just long enough to take the modulus - lol
i can do a sort of "partial multiply"
On a 3GHz Phenom II x4 the following straight forward single threaded (using 1 core) PowerBASIC program gives these results:
N=1,000 time= 4.28 ms Answer = 9110846700
N=10,000 time= 449.42 ms Answer = 6237204500
N=100,000 time= 49900.84 ms Answer = 3031782500
N=1,000,000 time= 5504076.67 ms Answer = 4077562500
These are the same results as Clive and on a similar CPU but significantly faster than his c version.
I also have a better algorithm but it's not quite finished. It's heading for about 1.5ms for N=1000 and 3.3s for N=1,000,000
This would be about 10x slower than the times quoted earlier by Raymond:
Quote1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms
..so I might not bother finishing it!
Paul.
%CPUfreq = 3000000000
%MaxValue = 100000
#DIM ALL
FUNCTION PBMAIN () AS LONG
LOCAL Sum,k, x AS QUAD
LOCAL t,r AS LONG
PRINT "N=";%MaxValue
TIX x
FOR r = 1 TO %MaxValue
k = 1
FOR t = 1 TO r
k = (k * r) MOD 10000000000
NEXT
sum = (sum + k) MOD 10000000000
NEXT
TIX END x
PRINT "Answer =";sum
PRINT "Tix";x
PRINT "time="x/%CPUfreq*1000;"ms"
WAITKEY$
END FUNCTION
hi Paul - good to see you
those are very impressive times
it would be interesting to see how the compiler implements the algo
Quoteit would be interesting to see how the compiler implements the algo
The above program FOR/NEXT loop compiles as follows:
CPU Disasm
Address Hex dump Command Comments
002010DA C7C7 01000000 MOV EDI,1 ; FOR r = 1 ..
002010E0 DB05 943A2000 FILD DWORD PTR DS:[203A94] ; k=1
002010E6 DDD9 FSTP ST(1)
002010E8 89BD 30FFFFFF MOV DWORD PTR SS:[EBP-0D0],EDI
002010EE C7C6 01000000 MOV ESI,1 ; FOR t = 1 ..
002010F4 E9 1E000000 JMP 00201117
002010F9 90 NOP
002010FA 90 NOP
002010FB 90 NOP
002010FC 90 NOP ;pad to align
002010FD 90 NOP
002010FE 90 NOP
002010FF 90 NOP
00201100 DF2D 983A2000 FILD QWORD PTR DS:[203A98] ; 10,000,000,000
00201106 897D 94 MOV DWORD PTR SS:[EBP-6C],EDI ; r
00201109 DB45 94 FILD DWORD PTR SS:[EBP-6C]
0020110C D8CA FMUL ST,ST(2) ; k * r
0020110E E8 B5190000 CALL 00202AC8 ; MOD 10,000,000,000
00201113 DDD9 FSTP ST(1) ; k = result
00201115 FFC6 INC ESI ; NEXT t
00201117 8BC6 MOV EAX,ESI
00201119 3B85 30FFFFFF CMP EAX,DWORD PTR SS:[EBP-0D0]
0020111F ^ 7E DF JLE SHORT 00201100
00201121 DF2D 983A2000 FILD QWORD PTR DS:[203A98] ; 10,000,000,000
00201127 D9C1 FLD ST(1) ; k
00201129 D8C3 FADD ST,ST(3) ; sum + k
0020112B E8 98190000 CALL 00202AC8 ; MOD 10,000,000,000
00201130 DDDA FSTP ST(2)
00201132 FFC7 INC EDI ; NEXT r
00201134 81FF 40420F00 CMP EDI,0F4240
0020113A ^ 7E A4 JLE SHORT 002010E0
00202AC9 D9EE FLDZ
00202ACB D8DA FCOMP ST(2) ; FLOAT 10000000000.00000000
00202ACD 9B WAIT
00202ACE DFE0 FSTSW AX
00202AD0 F6C4 40 TEST AH,40
00202AD3 75 0C JNE SHORT 00202AE1
00202AD5 D9F8 FPREM ; MOD
00202AD7 9B WAIT
00202AD8 DFE0 FSTSW AX
00202ADA F6C4 04 TEST AH,04
00202ADD ^ 75 F6 JNE SHORT 00202AD5
00202ADF D9C9 FXCH ST(1)
00202AE1 DDD8 FSTP ST
00202AE3 58 POP EAX
00202AE4 C3 RETN
QuoteThis would be about 10x slower than the times quoted earlier by Raymond:
Quote
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms
you never can tell - lol
he might have you beat for the small one, but yours may win on the big ones :bg
It can be sped up a little (4-5%) by a more efficient conversion to ASM but the huge majority of the time (about 80% of the total) is spent on the fprem instruction so it's not going to get a lot faster unless the algorithm is changed or the fprem is replaced with something better.
Paul.
FOR r = 1 TO %MaxValue
!fld1 'k=1
!fild r
!fild TenBillion
!mov eax,r 't
#ALIGN 16
lp1:
!fld st(1) 'replicate r
!fmul st(0),st(3) 'r*k
!fprem 'MOD 10,000,000,000
skip:
!fstp st(3) 'mov result back to k
!dec eax
!jne short lp1
!fcompp 'dump 2 items off stack
!fistp k
sum = sum + k
NEXT
Paul,
That 250 ms for 1000000 was to generate only the last 9 digits using 32-bit integers. A similar time would be expected to get the last 10-18 digits using 64-bit integers on a 64-bit machine (but not on a 32-bit machine).
As a side note, I'm really surprised that this thread is getting so much attention. 449 views at last count! :dazzled:
And I will be away without access to a computer nor to the internet starting tomorrow morning until the end of this week. :(
we'll save your seat, Ray :bg
Hi,
Probably off-topic, or maybe a by the way kind of thing. But
the second to last bug fix in my code that got me from N = 9
to N = 10 showed some unexpected behavior with the BCD
instructions. (Well it surprised me a bit.)
; Off by 16 code contained the following.
ADD AL,[DI] ; If there is any existing data.
XOR AH,AH
AAA ; Adjust after addition.
; The working code is now.
XOR AH,AH ; Required as AAA increments AH, not a put result.
ADD AL,[DI] ; If there is any existing data.
AAA ; ASCII Adjust after Addition.
Pickey, pickey, pickey.
Cheers,
Steve N.
that's because AAA works by using the AC and CF flags, i think
XOR clears them :P
as for my algo, it really needs to be coded in SSE/SSE2
but then, it would have to be compared with the other algo, also written using SSE/SSE2 - lol
not sure i want to spend the time to learn SSE at this point
QuoteThat 250 ms for 1000000 was to generate only the last 9 digits
I see. In that case, below is my solution.
Here are the results it gives:
N= 10 time= 0.003ms Answer = 405071317
N= 100 time= 0.030ms Answer = 9027641920
N= 1000 time= 0.447ms Answer = 9110846700 14.9x
N= 10000 time= 6.092ms Answer = 6237204500 13.6x
N= 100000 time= 76.722ms Answer = 3031782500 12.6x
N= 1000000 time= 926.138ms Answer = 4077562500 12.1x
N= 10000000 time= 10,878.353ms Answer = 4535362500 11.7x
N= 100000000 time= 125,184.016ms Answer = 9113362500 11.5x
N= 1000000000 time=1,412,944.585ms Answer = 4893362500 11.3x longer than the one above
It's not quite linear but isn't bad.
The program is in PowerBASIC but uses ASM to do the complicated (x * y) MOD 10000000000.
Paul.
%CPUfreq = 3000000000
%MaxValue = 1000000
#DIM ALL
FUNCTION PBMAIN () AS LONG
LOCAL Sum,k,j, x, TenBillion, temp AS QUAD
LOCAL t,r AS LONG
LOCAL prod3,prod2,prod1 AS LONG
LOCAL TwoToThePower34Mod1E10, TwoToThePower64Mod1E10 AS QUAD
TenBillion = 10000000000
TwoToThePower34Mod1E10 = 7179869184
TwoToThePower64Mod1E10 = 3709551616
TIX x
sum = 0
FOR r = 1 TO %MaxValue
k = 1
t = r
temp = r
WHILE t > 0
IF (t AND 1) THEN
'k = (k * temp) MOD 10000000000. k and temp are 34 bit (10 digits)
!lea eax,k
!call DoCalc
END IF
'temp = (temp x temp) MOD 1E10. temp is 10 digits (34 bits)
!lea eax,temp
!call DoCalc
SHIFT RIGHT t,1
WEND
sum = (sum + k) MOD 10000000000
NEXT
TIX END x
PRINT "N=";%MaxValue ,;
PRINT "time=";FORMAT$(x/%CPUfreq*1000,"###,###,###.000");"ms " ;
PRINT "Answer =";sum
WAITKEY$
EXIT FUNCTION
#ALIGN 16
DoCalc:
'on entry eax = address of the variable to update
'var = (var * temp) MOD 1E10 (all values 34 bit)
'could maybe be done in FPU in 2 blocks instead, might be quicker.
!pushad 'save registers
!mov esi,[eax] 'get the variable into edi:esi
!mov edi,[eax+4]
!mov eax,temp 'do low x low dword first
!mul esi
!mov ebx,eax 'store in ecx:ebx
!mov ecx,edx
!mov eax,temp[4] 'do low x high dword
!mul esi 'now finished with esi so I can reuse it as the high dword of result
!add ecx,eax 'add to result so far
!mov esi,edx
!adc esi,0 'carry into high register
!mov eax,temp 'do high x low dword
!mul edi
!add ecx,eax 'add to result so far
!adc esi,edx 'carry into high register
!mov eax,temp[4] 'high x high only a few bits in this one so it never overflows to the next register
!mul edi
!add esi,eax
!mov prod1,ebx 'store the full result
!mov prod2,ecx
!mov prod3,esi
!mov edi,ecx 'this will be the surplus bits (low part)
!shr edi,2 'shift bits into place
!and ecx,&h3 'remove bits 34 upwards
!mov eax,TwoToThePower34Mod1E10[4] 'correct for the 1st block of overflow bits
!mul edi
!xchg eax,edi
!mul dword ptr TwoToThePower34Mod1E10
!add edx,edi
!add ebx,eax
!adc ecx,edx
!mov eax,TwoToThePower64Mod1E10[4] 'correct for the 2nd block of overflow bits
!mul esi
!xchg eax,esi
!mul dword ptr TwoToThePower64Mod1E10
!add edx,esi
!add ebx,eax
!adc ecx,edx
'now store the result
!mov prod1,ebx
!mov prod2,ecx
'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
!fild TenBillion
!fild qword ptr prod1
!fprem
!mov eax,[esp+28] 'get address of answer
!fistp qword ptr [eax] 'save answer
!fcomp 'discard unused item off the stack
!popad 'restore registers
!ret
END FUNCTION
that beats my best effort using the FPU :U
'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
!fild TenBillion
!fild qword ptr prod1
!fprem
couldn't you simply AND off any excess bits ?
10 billion can be fully expressed with 34 bits - just whack off those above
Dave,
you can't just discard the most significant bits. The problem would be trivial if you could!
The first bit beyond the 10 digits is bit 34. 2^34=17,179,869,184
If you chop off that 1 bit then you effectively subtract 17,179,869,184 from the number and the values of all the lower digits are affected.
It would be different in decimal where discarding the 11th digit would be the same as subtracting 10,000,000,000 but all of those zeros mean that none of the lower digits would be affected.
Paul.
ahhhhhh
that's where i am going wrong !!!!
in the algo i have been playing with, i truncated bits - not decimal digits
let me see if it helps :P
well - it does help - it increases the range
however, not far enough
no matter how i slice it, i think my algo needs more precision
if we only needed 9 digits, mine would work
of course, if we only needed 9 digits, the whole problem is easier :bg
Quote from: dedndave on August 16, 2011, 01:53:05 PM
that's because AAA works by using the AC and CF flags, i think
XOR clears them :P
Hi,
That would explain it. I wonder why they did that? (Rhetorical.)
Reading a text book explains it. Must have saved some transistors.
I should have known that.
Regards,
Steve
ADD AL,[DI] ; If there is any existing data.
MOV AH,0
AAA ; Adjust after addition.
:bg
actually, the fact that AAA, DAA, AAS, DAS use the flags makes them usable in "trick" schemes
An improved attempt, Phenom II X4 2.8 GHz, in C
10000 6237204500 8 ms
100000 3031782500 95 ms
1000000 4077562500 1142 ms
10000000 4535362500 13411 ms
100000000 9113362500 154577 ms
2nd run
10000 6237204500 7 ms
100000 3031782500 94 ms
1000000 4077562500 1150 ms
10000000 4535362500 13472 ms
100000000 9113362500 154893 ms
1000000000 4893362500 1751345 ms
10000000000 2693362500 19500530 ms
Hi,
Woke up too early Monday morning, and thought up a way
to speed things up (maybe). After coding up a few disasters,
I convinced myself it wouldn't work. Then woke up too early
again today and decided it should work after all. After undisastering
the thing and adding some timing code I tried it out. Yipee!
Another insomnia event put to rest.
The after and before results follow. The old code was timed
using the TIME command in a BATch file. Code sizes in *.EXE
thrown in for good luck. Even though they are not strictly
comparable.
Regards,
Steve N.
Calculating sum of series S N**N N = 1, 1000.
Glorious second version.
18 Aug 2011 8:47:24
0000000000 0000000001 0000000000 0000000000 0000000000 9????????0
8:47:24
Calculating sum of series S N**N N = 1, 10,000.
Glorious second version.
18 Aug 2011 8:47:29
0000000000 0000000001 0000000000 0000000000 0000000000 6237204500
8:47:30
Current time is: 8:53:02.38
Calculating sum of series S N**N N = 1, 1000.
0000000000 0000001000 0000000000 9????????0
Current time is: 8:53:04.17
Calculating sum of series S N**N N = 1, 10,000.
0000000000 0000010000 0000000000 6237204500
Current time is: 8:56:22.80
RTC EXE 1225 1-07-10 3:10p
SUMS_EX EXE 1376 8-15-11 8:32a
SUMS_EX2 EXE 1856 8-18-11 8:47a
Improved attempt:
N= 10,000,000,000 time=3,638,023ms Answer = 2693362500
It's a slightly improved algorithm to the one above and then threaded to use all 4 cores of the Phenom II x4 3GHz cpu.
There is actually a huge short cut that can be made for the big numbers. Once you get to 1,000,000 things repeat so you can just calculate for n=1,000,000 and then a simple multiply and add will cover any number of whole millions beyond that.
e.g. 1,000,000 gives 4,077,562,500
The bit that repeats = 3,384,200,000 so:
10,000,000 = 4,077,562,500 + 9*3,384,200,000 = 34,535,362,500
100,000,000 = 4,077,562,500 + 99*3,384,200,000 = 339,113,362,500
1,000,000,000 = 4,077,562,500 + 999*3,384,200,000 = 3,384,893,362,500
10,000,000,000 = 4,077,562,500 + 9,999*3,384,200,000 = 33,842,693,362,500
100,000,000,000 = 4,077,562,500 + 99,999*3,384,200,000 = 338,420,693,362,500
At this point the 10 digits stay fixed if we increase the range by 10x more.
Paul.
i noticed that phenomenon earlier
http://www.masm32.com/board/index.php?topic=17223.msg144280#msg144280
with a little thought, i believe it could be the basis of a greatly simplified and faster algorithm :U
start small - extract a digit - multiply up
once you have the minimum amount of required information, you are done :bg
we could be on the verge of discovering a new algebraic law :U
Computing the modulo with a modulus exceeding 32 bits using 32-bit registers is not immediately evident. It would actually be impossible if such a modulus was itself a prime or had a prime factor exceeding 32 bits. Nevertheless, my algo now seems to be working properly. My timing for the 1000000 limit is 1050 ms (about 4x slower than using a modulus smaller than 32 bits). That timing should be the same for finding the last 10 up to the last 14 digits which would not involve products exceeding 96 bits.
QuoteThere is actually a huge short cut that can be made for the big numbers.
That may be true as long as the limit is a power of 10. What if the given limit is a random number?
glad to see you back, Ray :U
Hi,
Oh goodie. I modified my program and tried 100,000 and
1,000,000. 100,000 seems okay, 1,000,000 is not.
Regards,
Steve N.
Edit: Got it working. It was some extra code to prevent
an error that wasn't modified correctly with the last upgrade.
If it had been left out, there would not have been tne
incorrect result. Got to love it.
SRN
Raymond,
QuoteThat may be true as long as the limit is a power of 10. What if the given limit is a random number?
Do the first million, do the next million and multiply by the number of whole millions needed then do the small number of digits left over and add the three values.
Paul.