The MASM Forum Archive 2004 to 2012

General Forums => The Colosseum => Topic started by: raymond on August 12, 2011, 03:04:56 AM

Title: Dave's nickname hijacked??
Post by: raymond on August 12, 2011, 03:04:56 AM
Just noticed that Dave's nickname dedndave is being used by a new member on Project Euler but with some upper case letters, i.e. DednDave (although his e-mail address is with all lower case letters).

Could it be his cyber twin? :eek :dazzled: :dance:
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 03:33:03 AM
no, Raymond
that is me   :P

didn't realize my email was visible - thanks for the heads-up
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 12, 2011, 03:39:09 AM
I wasn't too sure since upper case letters were being used, and you never use them on this forum.

And don't worry Dave. Your e-mail is NOT visible. Remember I'm one of the admins having access to all the info.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 03:40:27 AM
oohhhhhh
then, you probably saw my first post, too
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 12, 2011, 03:45:59 AM
As I have replied to someone else on that forum, we are studying various possibilities to open up locked threads for new members. Stay tuned and be patient.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 04:06:28 AM
i was disappointed to see the previous discussion for that particular problem
hopefully, they aren't all like that
Title: Re: Dave's nickname hijacked??
Post by: MichaelW on August 12, 2011, 06:12:10 AM
What is "that particular problem"?
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 11:44:30 AM
Euler problem #48
QuoteThe series, 11 + 22 + 33 + ... + 1010 = 10405071317.

Find the last ten digits of the series, 11 + 22 + 33 + ... + 10001000.

http://projecteuler.net/index.php?section=problems&id=48

once you have provided the correct answer, you are allowed to see the forum for that problem
unfortunately, the discussion spans from 2004 to 2008, when the thread was locked
for the most part, they discussed their favorite languages
and showed how few lines and how short the source text was - lol
i am more interested in methodology than i am in syntax
Title: Re: Dave's nickname hijacked??
Post by: clive on August 12, 2011, 04:48:57 PM
Well that was straight forward. It just surprised me how many people computed it to full precision, and the level of complexity they used. This would make a great interview question.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 05:10:01 PM
it is certainly more efficient to tally only the low 10 digits
however, it is easier to write the code to go full precision
especially in my case, as i already had the bignum exponentiation and integer to ascii routines on hand

this is about all i had to write to get the full string...
loop00: INVOKE  uExpInt,ebx,ebx,offset Accum,sizeof Accum
        add     ecx,7
        shr     ecx,2
        call    AddAcc
        inc     ebx
        cmp     ebx,1000
        jbe     loop00

        INVOKE  uArb2StrL9,offset Result,sizeof Result,offset Buff,sizeof Buff
;
;
;
AddAcc  PROC

        mov     esi,offset Accum
        clc
        mov     edi,offset Result

AddAc0: mov     eax,[esi]
        adc     [edi],eax
        dec     ecx
        lea     esi,[esi+4]
        lea     edi,[edi+4]
        jnz     AddAc0

        ret

AddAcc  ENDP


i think i spent more time calculating how many digits there would be, which took a few minutes - lol
Title: Re: Dave's nickname hijacked??
Post by: clive on August 12, 2011, 08:50:01 PM
Quote from: dedndave
i think i spent more time calculating how many digits there would be, which took a few minutes - lol

I didn't have a library to hand, so jammed it out in C using 64-bit math, for both the power and accumulation. The assembler version is a bit more bulky, but you're probably hiding behind several hundred lines of library code.

Ok so how many digits would you have with n=10000, the 10 least significant digits should be 6237204500, I think.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 12, 2011, 10:07:13 PM
i am gonna guess about 30,000 digits   :bg
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 13, 2011, 01:21:30 AM
1000010000 = (104)10000 = 1040000 = 40000 digits

As a side note, in the post previous to yours, someone also suggested to look into modular exponentiation.

And everything could have been a lot easier (in assembly) if only the last 9 digits had been required. 10-digit remainders don't all fit in 32-bit!
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 01:50:13 AM
Yes, it dawned on me driving about town that 10000 has 4 zeros, so multiplying it 10000 times would be up at 40001 digits.

It was pretty clear 32-bit didn't work, the 64-bit method I used should work for values of n up to 100000000. Or some 800000001 digits
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 02:31:24 AM
my last ten digits agree with yours, Clive   :bg
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 02:59:45 AM
I didn't get too aggressive with this, another optimization would be to note that modulo 10 powers contribute nothing to the sum

unsigned __int64 computepuzzle48(void)
{
  __asm
  {
    xor   eax,eax ; accumulator
    xor   edx,edx

    mov   ecx,1 ; power

loopouter:

    push  ecx
    push  edx
    push  eax

    xor   edx,edx ; edx:eax = ecx
    mov   eax,ecx

    or    ecx,ecx ; paranoid, ecx = 0
    jz    skip2

    mov   ebx,ecx
    dec   ecx
    jz    skip2

loopinner:

; edx:eax *= ebx

    push  edx
    push  eax
    mov   eax,edx
    xor   edx,edx
    mul   ebx
    mov   edi,eax
    pop   eax
    pop   edx
    mul   ebx
    add   edx,edi

; edx:eax modulo 10000000000 (max 429496729599999)

    cmp   edx,2 ; cheap test for below 10000000000
    jb    skip1

    push  edx
    push  eax
    mov   edi,100000
    div   edi
    xor   edx,edx
    div   edi
    xor   edx,edx
    mul   edi
    mul   edi
    mov   edi,eax
    pop   eax
    sub   eax,edi
    mov   edi,edx
    pop   edx
    sbb   edx,edi

skip1:

    dec   ecx
    jnz   loopinner

skip2:

; edx:eax holds power to 10 least significant digits

; edx:eax += (accumulate)

    pop   ecx
    add   eax,ecx
    pop   ecx
    adc   edx,ecx

; edx:eax modulo 10000000000 (max 429496729599999)

    cmp   edx,2 ; cheap test for below 10000000000
    jb    skip3

    push  edx
    push  eax
    mov   edi,100000
    div   edi
    xor   edx,edx
    div   edi
    xor   edx,edx
    mul   edi
    mul   edi
    mov   edi,eax
    pop   eax
    sub   eax,edi
    mov   edi,edx
    pop   edx
    sbb   edx,edi

skip3:

    pop   ecx

    inc   ecx
    cmp   ecx,1000
    jb    loopouter

; edx:eax Answer, 10 least significant digits

  }
}
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 13, 2011, 03:39:31 AM
In those early days, I used plain BCD's for this problem, doing computations only with the last 10 digits. No need for the conversion of a big number at the end. Still runs in a decent time: 0.12 sec on my i7.

I'll try and see if using modular exponentiation and BCD's will improve that. What was your timing Dave with "big numbers"?
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 03:56:29 AM
for N=1000, it was about 78 mS, as i recall
that yielded the full length result

for N=10000, i had time to go to the bathroom and grab a fresh soda   :lol
it also yielded the full length result
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 01:50:50 PM
For the C version N=1000 was about 47 ms, N=10000 was about 5.8 s, N=100000 was about 571.3 s (3031782500 ?)
For the assembler version N=10000 was about 4.2 s, N=100000 was about 430.9 s but I had to adjust the balance of the modulus to keep results within 32-bits.
1.6 GHz Atom Netbook
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 03:39:46 PM
Quoteanother optimization would be to note that modulo 10 powers contribute nothing to the sum

well, it would be easy to add 1010101 (decimal) for N=1, 10, 100, 1000

i think there may be other simple shortcuts, as well - those that apply to powers of 2, for example
i looked at factoring, it doesn't seem like it would be very efficient
and - it may be faster, but it would require a lot more code

no - i think there is some other trick, here
most of the euler problems have "hidden" solutions   :bg
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 04:33:48 PM
My observation was that 10**10, 20**20, .. 110*110, 120*120, etc contribute nothing to the least significant digits. Thus eliminating ~10% (non linear, but whatever) of the work.

Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 04:46:34 PM
my mistake, Clive
you're right (again - lol)

my head may have been thinking exponents, but my hands were stuck on squares   :P
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 05:31:33 PM
Quote from: dedndave
no - i think there is some other trick, here, most of the euler problems have "hidden" solutions

Yes, there are probably a lot of tricks and short cuts, the site has a lot of interesting problems. Some languages are better at this than others, at least with the postings it seems that some languages handle large numbers in a much more fluid manner, and permit brute force solutions with very simple coding.

Doing it in assembler is a particular pig because you have to think in binary and decimal, and the scope of the numbers. Raymond's BCD suggestion at least keeps the decimal/decades in focus. I've done BCD on the 6502/68K/x86, but not sure how much effort has gone into optimizing those instructions over the years.

The trick with this one was to truncate the computation of unused digits, allowing the solution to scale 10**2 for each additional decade in N.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 06:16:18 PM
well - the challenge part is finding the efficient method
the fun part is coding it in assembler   :bg

for this problem, my first instinct was to make it a horner's rule solution
that would apply better if it were
11 x 22 x 33 x 44......   :bg

anyways, binary exponentiation looks promising, and we only need to go as far as the low-order 34 bits in the result

it is a 2-step process
1) exponentiation loop
2) conversion to decimal

horner could be used for step 2
but, from my experience, there are faster algorithms for converting values that are smaller than 2 x CPU register width
Title: Re: Dave's nickname hijacked??
Post by: clive on August 13, 2011, 08:56:04 PM
On a Phenom II X4 820, 2.8 GHz

For the C solution, skipping modulo 10 powers
N=10000, 1.512 seconds, 6237204500
N=100000, 149.378 seconds, 3031782500
N=1000000, 14955.012  seconds, 4077562500 ?

I could split the power computations into threads?

So Raymond, what kind of results are you getting with BCD, or other methodologies ?
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 13, 2011, 09:26:25 PM
that's pretty fast   :P
64-bit ?
not sure i can play on the same field with a 32-bit machine
Title: Re: Dave's nickname hijacked??
Post by: clive on August 14, 2011, 02:06:24 AM
Quote from: dedndavethat's pretty fast   :P
64-bit ? not sure i can play on the same field with a 32-bit machine

The machine is 64-bit, the C is MSVC 12.00, 32-bit, using 64-bit integers (unsigned __int64). It's about 2.8x the speed of the Atom, so figure 43000 seconds for N=1000000, say 12 hours or so on the netbook.

The run time for the previous assembler, with a slight modification to balance the mod 10000000000 a bit better so the first divide fits the result in 32-bits, but without the mod 10 power fix is only 8-9% slower.
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 14, 2011, 04:34:59 AM
Modular exponentiation is definitely faster. Got the timing down from 120 ms to 12 ms for the 1000 limit.

Result for the 10000 limit confirmed. Timing 150 ms.
Result for the 100000 limit different from clive: 4131782500, timing 1.88 sec.
Result for the 1000000 limit slightly different from clive: 5077562500, timing 23.4 sec.

I wonder what Dave would get on the last two.

All these should be significantly faster with a 64-bit program using integers instead of BCD's.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 06:06:59 AM
i am working on an FPU version, Ray - seeing as i do not have a 64-bit machine   :P
however, i am not using packed BCD values
i am using 64-bit integers
i don't recall what the different timings are for FPU integer math, so my method may not be that efficient
i will post code if i get it working - lol
Title: Re: Dave's nickname hijacked??
Post by: clive on August 14, 2011, 02:52:20 PM
I have a pretty high confidence in the limit numbers, the algorithm is sound, and I'm not seeing any overflow. I've pulled some x86-64 compilers, and built both 64-bit and 128-bit integer variants to run on the Phenom. My exponential/power computation is brute force rather than efficient/optimized, hence the 100x increase with each decade, as apposed to the 10x, but it should be immune to the potential overflow of other methods. I could generate a table of n**n computation to see where our solutions diverge.

I downloaded a 64-bit version of MSVC, which surprisingly didn't support __int128, so I pulled gcc/mingw64. Microsoft are so screwed.

Dave doesn't your Prescott do 64-bit? or did you get one of the crippled ones? I had Win32 running on mine, but it would dual boot to 32 or 64-bit Linux. My 2005 Prescott box died a while back, I had to transplant it's guts to an AMD X2 refurbished donor box, and picked up Quad Phenom to have a more current dev box.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 03:55:54 PM
actually, the prescott i have is em64t compatible
however, i have no idea how to execute 64-bit instructions - or even if i can - with 32-bit XP

my current solution is to use the 64-bit integer capability of the FPU
i am not too familiar with using integers with the FPU, and even less familiar with its' packed BCD capability - lol
but, i am making headway

i looked at multiple-precision multiplication and became disinterested   :P
i have done it before, the Ling Long Kai Fang routines use 64 x 64 multiply with 32-bit registers
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 14, 2011, 04:10:55 PM
Clive, your numbers are probably correct. I must have a minor bug with the BCD's which affects the higher significant digits. I'll try and find it.

The reason I think my BCD results may be wrong is that I wrote another algo using integers to get the lower 9 digits (i.e. <32 bits) and results agree with all your lower 9-digit results.

As expected, timings with integer math and modular exponentiation are considerably lower:
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

Dave, I'm still wondering why you would have chosen #48 as the first one to solve. It seems rather random.
Once you solve #12, you may like the PDF I wrote for that problem.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 04:32:02 PM
actually, one of the other members mentioned that specific problem
when i looked at it, i thought it was a horner's rule problem
as i mentioned, i had already written routines to exponentiate and convert the bignums
so, it looked easy and i gave it a whirl   :P

it has been many years since i took the math courses in college
i have forgotten most of what i learned because i rarely have to use it to design circuits
i use derivatives once in a while
now and then, i have to understand how something works that is based on higher math
only enough to grasp the underlying principal, though   :P

at any rate, many of the euler problems are a little over my head
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 05:03:38 PM
it just dawned on me.....

this does not apply to the euler problem because 10 digits are required and Nmax is only 1000
but, it does apply to these extended versions that you guys have created

the one thing that has remained constant in all these problems is that the result is limited to 10 digits
10 digits may be fully expressed with 34 binary bits
that means that only the lower 17 bits of N will effect the result (squareroot)

for the Nmax=1,000,000 problem, only the lower 5 digits of N will effect the 10 digit result
in other words, there are several repeated calculations   :bg
notice the result ends in 0's
you could perform the calculation for N=1 to 99,999, then multiply by 100   :U
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 14, 2011, 05:22:40 PM
Most (if not all) of the first 150-200 problems are accessible with high school math knowledge, a little logic, pattern spotting, or maybe a little extra knowledge about how some numbers can be computed.

My knowledge is very little above high school math so that many of the problems above #250 are also way above my head. Some of the exceptions above #300 are:
309, 311, 315, 321, 327, 333, 341, most of which I had a hand in their development.
That last one (#341) was fun since my algo ran 100 times faster than that of the original proposer who is an excellent problem solver.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 06:06:35 PM
i may look at them - after a bit   :P
too many irons in the fire, at the moment

another revelation - lol
i can use 80-bit extended reals, rather than 64-bit integers
so long as i keep everything under 64 bits, there should be no loss in precision
this being the "native language" of the FPU should be faster
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 14, 2011, 07:52:34 PM
Clive,
Found my stupid error. My initial algo was based on a maximum limit having at most 4 digits. I simply forgot to adjust that when I started raising the limit above 10000 and it was affecting all the odd numbers. All my results now agree with yours.

It's surprizing that I was still getting the last 8 digits correct with that error. Dave may have something regarding a shortcut. It may be worth investigating.

Quoteso long as i keep everything under 64 bits

Under is the correct word. Qwords are treated as signed by the FPU.
Title: Re: Dave's nickname hijacked??
Post by: clive on August 14, 2011, 08:18:34 PM
Well this was the original 5 min solution. I could probably do it with doubles, or extended precision. I seemed more viable than computing all the digits. There are probably far better ways of doing the power/exponent function, I was just worried about keeping the number within the scope of a 64-bit number, as it gets out of control very quickly, obviously.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int i, j;
  unsigned __int64 x, y;

  x = 0;

  for(i=1; i<=1000; i++)
  {
    y = i;

    for(j=1; j<i; j++)
    {
      y *= i;
      y %= 10000000000;
    }

    x += y;
  }

  x %= 10000000000;

  printf("%4d %010I64d %010I64d\n",i-1, y, x);
}
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 14, 2011, 08:31:35 PM
ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried

the game plan is to use right-to-left binary exponentiation
we only need 34 bits in the result, but i am designing it to use 42 bits
because we have room and i think we could get an 11th digit if we wanted it

my idea is to use all 8 registers of the FPU stack and rotate them like a revolver cylinder
(http://operatorchan.org/k/arch/src/k225479_627-8shot.jpg)
that way, we eliminate memory accesses inside the loop

i am designing the loops from the inside out
here is what i have so far...
        jmp short loop_entry

;-------------------------------------------------------------------

mul_operand_to_accumulator:
        fincstp
        fincstp

;st(0) = exponentiation accumulator
;st(1) = 42-bit modulo divisor = 4,398,046,511,104
;st(2) = addition accumulator
;st(3) = 21-bit modulo divisor = 2,097,152
;st(4) = 1
;st(5) = operand (N)
;st(6) = temporary operand (N^2^x)
;st(7) = 21-bit modulo divisor = 2,097,152

        fmul    st(6)
        fprem                     ;lower 42 bits of exponentiation accumulator
        fdecstp
        fdecstp

;-------------------------------------------------------------------

square_temp_operand:

;st(0) = temporary operand (N^2^x)
;st(1) = 21-bit modulo divisor
;st(2) = exponentiation accumulator
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = addition accumulator
;st(5) = 21-bit modulo divisor = 2,097,152
;st(6) = 1
;st(7) = operand (N)

        fmul    st(0)
        fprem                     ;lower 21 bits of temporary operand

;-------------------------------------------------------------------

loop_entry:
        shr     eax,1
        jc      mul_operand_to_accumulator

        jnz     square_temp_operand

;-------------------------------------------------------------------

        fincstp
        fincstp
        fincstp
        fincstp

;st(0) = addition accumulator
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = 1
;st(3) = operand (N)
;st(4) = temporary operand (N^2^x)
;st(5) = 21-bit modulo divisor
;st(6) = exponentiation accumulator
;st(7) = 42-bit modulo divisor = 4,398,046,511,104

        fadd    st(6)
        fprem                     ;lower 21 bits of addition accumulator

;-------------------------------------------------------------------


i was doing it with integers, but switched to extended reals and had to start over   :'(

at the end, there, i see i can re-arrange the registers and eliminate a couple rotations
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 12:57:48 AM
well - the loop assembles....
....and runs with no exceptions....
....but i don't get the right answer, yet

i am in "debug" mode, now   :P
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 04:43:17 AM
ok - the loop works, to some degree
and, it's very fast
i am starting to count clock cycles instead of mS   :U
something in the order of 2,000,000 cycles
if you prefer a time, it is in the order of 667 microseconds on my 3 GHz P4

however, i am having a precision problem - i haven't sorted it out, yet
the time shouldn't change signifigantly if fixed unless i have to extend precision via adding instructions

using fincstp and fdecstp works like a charm
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 15, 2011, 01:03:13 PM
Quote from: dedndave on August 14, 2011, 08:31:35 PM
ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried

Hi,

   Well actually I was trying a BCD version.  And while I eventually
got the N = 10 value correct, I can't duplicate Clive's numbers for
N = 10,000.  So I assume my N = 1,000 is bad as well.  Pegs the
CPU at 100% nicely though.  And it probably should be thrown
out and started over if I wanted reasonable performance.
N = 1,000 second(s), N = 10,000 minutes.

Cheers,

Steve N.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 01:12:47 PM
welcome to the fray, Steve   :bg
for 1,000, the first digit is a 9 and the last one is a 0

this code is somewhat correct
the problem is lack of precision
i may get to play with it today

uArb2StrL9 is my binary-to-ascii routine
i will write a special/faster version once i fix the loop
Nmax    EQU 1000

        .DATA

fpTzero  real10 0.0
fpTone   real10 1.0
fpTmod21 real10 2097152.0
fpTmod42 real10 4398046511104.0

        .DATA?

nQResult dq ?
szBuffer db 24 dup(?)

        .CODE

_main   PROC

;st(0) = temporary operand (N^(2^x))                  start with Nmax
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = exponentiation accumulator                   start with 1
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = 1
;st(5) = operand (N)                                  start with Nmax
;st(6) = additive accumulator (result)                start with 0
;st(7) = 42-bit modulo divisor = 4,398,046,511,104

        finit
        mov     edx,Nmax
        fld     fpTmod42
        push    edx
        fld     fpTzero
        fild dword ptr [esp]
        fld     fpTone
        fld     fpTmod42
        fld     fpTone
        fld     fpTmod21
        fild dword ptr [esp]
        fwait
        pop     eax
        jmp short exp_loop_entry

top_of_outer_loop:
        fdecstp
        mov     eax,edx
        fsub    st(0),st(7)
        fdecstp
        fdecstp
        ffree   st(7)
        fld     st(1)
        fdecstp
        ffree   st(7)
        fld     st(4)
        jmp short exp_loop_entry

mul_operand_to_accumulator:
        fincstp
        fincstp
        fmul    st(0),st(6)
        fprem                     ;lower 42 bits of exponentiation accumulator
        fdecstp
        fdecstp

square_temp_operand:
        fmul    st(0),st(0)
        fprem                     ;lower 21 bits of temporary operand

exp_loop_entry:
        shr     eax,1
        jc      mul_operand_to_accumulator

        jnz     square_temp_operand

;add exponentiated accumulator to additive accumulator

        fdecstp
        fdecstp
        dec     edx
        fadd    st(0),st(4)
        fprem                     ;lower 42 bits of additive accumulator
        jnz     top_of_outer_loop

        fwait
        fistp   nQResult
        ffree   st(6)
        ffree   st(5)
        ffree   st(4)
        ffree   st(3)
        ffree   st(2)
        ffree   st(1)
        ffree   st(0)
        INVOKE  uArb2StrL9,offset nQResult,sizeof nQResult,offset szBuffer,sizeof szBuffer
        sub     ecx,10
        jbe     skip_it

        add     edx,ecx

skip_it:
        print   edx
        print   chr$(13,10)
        inkey
        exit

_main   ENDP
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 15, 2011, 01:44:25 PM
Quote from: dedndave on August 15, 2011, 01:12:47 PM
welcome to the fray, Steve   :bg
for 1,000, the first digit is a 9 and the last one is a 0

Hi,

   Thanks.  Stomped another bug, and I matched Clive's
10,000 value and the digits you mentioned for N = 1,000.

Regards,

Steve
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 01:56:01 PM
nice work   :U

i found that mine works perfectly.....
up to Nmax = 7  (http://l.yimg.com/us.yimg.com/i/mesg/emoticons7/24.gif)
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 02:39:57 PM
by changing the modulo divisors to 234, i get the correct result for Nmax=10
the problem comes when i have to multiply a 34-bit temporary operand onto a 34-bit exponential accumulator
( 234-1 )2 MOD 234
i'll have to extend the precision a little for that one operation
i need 68 bits of precision, just long enough to take the modulus - lol
i can do a sort of "partial multiply"
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 15, 2011, 10:11:04 PM
On a 3GHz Phenom II x4 the following straight forward single threaded (using 1 core) PowerBASIC program gives these results:
N=1,000     time=       4.28 ms  Answer = 9110846700
N=10,000    time=     449.42 ms  Answer = 6237204500
N=100,000   time=   49900.84 ms  Answer = 3031782500
N=1,000,000 time= 5504076.67 ms  Answer = 4077562500


These are the same results as Clive and on a similar CPU but significantly faster than his c version.

I also have a better algorithm but it's not quite finished. It's heading for about 1.5ms for N=1000 and 3.3s for N=1,000,000
This would be about 10x slower than the times quoted earlier by Raymond:
Quote1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

..so I might not bother finishing it!

Paul.
%CPUfreq = 3000000000
%MaxValue = 100000

#DIM ALL

FUNCTION PBMAIN () AS LONG
LOCAL Sum,k, x AS QUAD
LOCAL t,r AS LONG

PRINT "N=";%MaxValue
TIX x

FOR r = 1 TO %MaxValue
    k = 1
    FOR t = 1 TO r
        k = (k * r) MOD 10000000000

    NEXT

    sum = (sum + k) MOD 10000000000
NEXT

TIX END x
PRINT "Answer =";sum
PRINT "Tix";x
PRINT "time="x/%CPUfreq*1000;"ms"
WAITKEY$

END FUNCTION
 
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 10:57:37 PM
hi Paul - good to see you

those are very impressive times
it would be interesting to see how the compiler implements the algo
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 15, 2011, 11:50:50 PM
Quoteit would be interesting to see how the compiler implements the algo
The above program FOR/NEXT loop compiles as follows:
CPU Disasm
Address   Hex dump          Command                                  Comments
002010DA    C7C7 01000000   MOV EDI,1                               ; FOR r = 1 ..
002010E0    DB05 943A2000   FILD DWORD PTR DS:[203A94]              ; k=1
002010E6    DDD9            FSTP ST(1)
002010E8    89BD 30FFFFFF   MOV DWORD PTR SS:[EBP-0D0],EDI
002010EE    C7C6 01000000   MOV ESI,1                               ; FOR t = 1 ..
002010F4    E9 1E000000     JMP 00201117
002010F9    90              NOP
002010FA    90              NOP
002010FB    90              NOP
002010FC    90              NOP                                     ;pad to align
002010FD    90              NOP
002010FE    90              NOP
002010FF    90              NOP
00201100    DF2D 983A2000   FILD QWORD PTR DS:[203A98]              ; 10,000,000,000             
00201106    897D 94         MOV DWORD PTR SS:[EBP-6C],EDI           ; r
00201109    DB45 94         FILD DWORD PTR SS:[EBP-6C]
0020110C    D8CA            FMUL ST,ST(2)                           ; k * r
0020110E    E8 B5190000     CALL 00202AC8                           ; MOD 10,000,000,000
00201113    DDD9            FSTP ST(1)                              ; k = result
00201115    FFC6            INC ESI                                 ; NEXT t
00201117    8BC6            MOV EAX,ESI
00201119    3B85 30FFFFFF   CMP EAX,DWORD PTR SS:[EBP-0D0]
0020111F  ^ 7E DF           JLE SHORT 00201100
00201121    DF2D 983A2000   FILD QWORD PTR DS:[203A98]              ; 10,000,000,000
00201127    D9C1            FLD ST(1)                               ; k
00201129    D8C3            FADD ST,ST(3)                           ; sum + k
0020112B    E8 98190000     CALL 00202AC8                           ; MOD 10,000,000,000
00201130    DDDA            FSTP ST(2)
00201132    FFC7            INC EDI                                 ; NEXT r
00201134    81FF 40420F00   CMP EDI,0F4240
0020113A  ^ 7E A4           JLE SHORT 002010E0


00202AC9    D9EE            FLDZ
00202ACB    D8DA            FCOMP ST(2)                              ; FLOAT 10000000000.00000000
00202ACD    9B              WAIT
00202ACE    DFE0            FSTSW AX
00202AD0    F6C4 40         TEST AH,40
00202AD3    75 0C           JNE SHORT 00202AE1
00202AD5    D9F8            FPREM                                    ; MOD
00202AD7    9B              WAIT
00202AD8    DFE0            FSTSW AX
00202ADA    F6C4 04         TEST AH,04
00202ADD  ^ 75 F6           JNE SHORT 00202AD5
00202ADF    D9C9            FXCH ST(1)
00202AE1    DDD8            FSTP ST
00202AE3    58              POP EAX
00202AE4    C3              RETN
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 15, 2011, 11:58:03 PM
QuoteThis would be about 10x slower than the times quoted earlier by Raymond:
Quote
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

you never can tell - lol

he might have you beat for the small one, but yours may win on the big ones   :bg
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 16, 2011, 12:23:30 AM
It can be sped up a little (4-5%) by a more efficient conversion to ASM but the huge majority of the time (about 80% of the total)  is spent on the fprem instruction so it's not going to get a lot faster unless the algorithm is changed or the fprem is replaced with something better.

Paul.
FOR r = 1 TO %MaxValue

!fld1               'k=1
!fild r
!fild TenBillion

!mov eax,r          't
#ALIGN 16
lp1:
!fld st(1)          'replicate r
!fmul st(0),st(3)   'r*k
!fprem              'MOD 10,000,000,000
skip:
!fstp st(3)         'mov result back to k

!dec eax
!jne short lp1

!fcompp             'dump 2 items off stack
!fistp k

    sum = sum + k
NEXT
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 16, 2011, 12:56:31 AM
Paul,
That 250 ms for 1000000 was to generate only the last 9 digits using 32-bit integers. A similar time would be expected to get the last 10-18 digits using 64-bit integers on a 64-bit machine (but not on a 32-bit machine).

As a side note, I'm really surprised that this thread is getting so much attention. 449 views at last count! :dazzled:

And I will be away without access to a computer nor to the internet starting tomorrow morning until the end of this week. :(
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 01:04:25 AM
we'll save your seat, Ray   :bg
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 16, 2011, 12:17:58 PM
Hi,

   Probably off-topic, or maybe a by the way kind of thing.  But
the second to last bug fix in my code that got me from N = 9
to N = 10 showed some unexpected behavior with the BCD
instructions.  (Well it surprised me a bit.)


; Off by 16 code contained the following.
        ADD     AL,[DI] ; If there is any existing data.
        XOR     AH,AH
        AAA             ; Adjust after addition.

; The working code is now.
        XOR     AH,AH   ; Required as AAA increments AH, not a put result.
        ADD     AL,[DI] ; If there is any existing data.
        AAA             ; ASCII Adjust after Addition.


   Pickey, pickey, pickey.

Cheers,

Steve N.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 01:53:05 PM
that's because AAA works by using the AC and CF flags, i think
XOR clears them   :P

as for my algo, it really needs to be coded in SSE/SSE2
but then, it would have to be compared with the other algo, also written using SSE/SSE2 - lol
not sure i want to spend the time to learn SSE at this point
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 16, 2011, 06:31:27 PM
QuoteThat 250 ms for 1000000 was to generate only the last 9 digits

I see. In that case, below is my solution.

Here are the results it gives:
N= 10         time=        0.003ms Answer =  405071317
N= 100        time=        0.030ms Answer = 9027641920
N= 1000       time=        0.447ms Answer = 9110846700  14.9x
N= 10000      time=        6.092ms Answer = 6237204500  13.6x
N= 100000     time=       76.722ms Answer = 3031782500  12.6x
N= 1000000    time=      926.138ms Answer = 4077562500  12.1x
N= 10000000   time=   10,878.353ms Answer = 4535362500  11.7x
N= 100000000  time=  125,184.016ms Answer = 9113362500  11.5x
N= 1000000000 time=1,412,944.585ms Answer = 4893362500  11.3x longer than the one above

It's not quite linear but isn't bad.

The program is in PowerBASIC but uses ASM to do the complicated (x * y) MOD 10000000000.

Paul.
%CPUfreq = 3000000000
%MaxValue = 1000000

#DIM ALL

FUNCTION PBMAIN () AS LONG
LOCAL Sum,k,j, x, TenBillion, temp AS QUAD
LOCAL t,r AS LONG
LOCAL prod3,prod2,prod1 AS LONG
LOCAL TwoToThePower34Mod1E10, TwoToThePower64Mod1E10 AS QUAD

TenBillion = 10000000000
TwoToThePower34Mod1E10 = 7179869184
TwoToThePower64Mod1E10 = 3709551616

TIX x

sum = 0

FOR r = 1 TO %MaxValue
    k = 1
    t = r
    temp = r
    WHILE  t > 0
        IF (t AND 1) THEN
            'k = (k * temp)  MOD 10000000000. k and temp are 34 bit (10 digits)
            !lea eax,k
            !call DoCalc

        END IF

        'temp = (temp x temp) MOD 1E10.  temp is 10 digits (34 bits)
        !lea eax,temp
        !call DoCalc

        SHIFT RIGHT t,1
    WEND

    sum = (sum + k) MOD 10000000000
NEXT

TIX END x

PRINT "N=";%MaxValue ,;
PRINT "time=";FORMAT$(x/%CPUfreq*1000,"###,###,###.000");"ms " ;
PRINT "Answer =";sum

WAITKEY$
EXIT FUNCTION


#ALIGN 16
DoCalc:
        'on entry eax = address of the variable to update
        'var = (var * temp) MOD 1E10   (all values 34 bit)
        'could maybe be done in FPU in 2 blocks instead, might be quicker.

        !pushad             'save registers

        !mov esi,[eax]      'get the variable into edi:esi
        !mov edi,[eax+4]

        !mov eax,temp       'do low x low dword first
        !mul esi
        !mov ebx,eax        'store in ecx:ebx
        !mov ecx,edx

        !mov eax,temp[4]    'do low x high dword
        !mul esi            'now finished with esi so I can reuse it as the high dword of result
        !add ecx,eax        'add to result so far
        !mov esi,edx
        !adc esi,0          'carry into high register



        !mov eax,temp       'do high x low dword
        !mul edi
        !add ecx,eax        'add to result so far
        !adc esi,edx        'carry into high register


        !mov eax,temp[4]    'high x high only a few bits in this one so it never overflows to the next register
        !mul edi
        !add esi,eax

        !mov prod1,ebx      'store the full result
        !mov prod2,ecx
        !mov prod3,esi

        !mov edi,ecx        'this will be the surplus bits (low part)
        !shr edi,2          'shift bits into place

        !and ecx,&h3        'remove bits 34 upwards


        !mov eax,TwoToThePower34Mod1E10[4]         'correct for the 1st block of overflow bits
        !mul edi
        !xchg eax,edi
        !mul dword ptr TwoToThePower34Mod1E10
        !add edx,edi

        !add ebx,eax
        !adc ecx,edx


        !mov eax,TwoToThePower64Mod1E10[4]         'correct for the 2nd block of overflow bits
        !mul esi
        !xchg eax,esi
        !mul dword ptr TwoToThePower64Mod1E10
        !add edx,esi

        !add ebx,eax
        !adc ecx,edx

        'now store the result
        !mov prod1,ebx
        !mov prod2,ecx

        'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
        !fild TenBillion
        !fild qword ptr prod1
        !fprem


        !mov eax,[esp+28]       'get address of answer
        !fistp qword ptr [eax]  'save answer

        !fcomp              'discard unused item off the stack

        !popad              'restore registers
        !ret


END FUNCTION

                     
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 06:35:41 PM
that beats my best effort using the FPU   :U

        'this sum may have overflowed a few bits due to all the adds and needs to be reduced mod 1e10
        !fild TenBillion
        !fild qword ptr prod1
        !fprem


couldn't you simply AND off any excess bits ?
10 billion can be fully expressed with 34 bits - just whack off those above
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 16, 2011, 06:51:50 PM
Dave,
you can't just discard the most significant bits. The problem would be trivial if you could!
The first bit beyond the 10 digits is bit 34. 2^34=17,179,869,184
If you chop off that 1 bit then you effectively subtract 17,179,869,184 from the number and the values of all the lower digits are affected.
It would be different in decimal where discarding the 11th digit would be the same as subtracting 10,000,000,000 but all of those zeros mean that none of the lower digits would be affected.

Paul.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 06:57:29 PM
ahhhhhh
that's where i am going wrong !!!!

in the algo i have been playing with, i truncated bits - not decimal digits
let me see if it helps   :P
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 07:02:54 PM
well - it does help - it increases the range
however, not far enough
no matter how i slice it, i think my algo needs more precision

if we only needed 9 digits, mine would work
of course, if we only needed 9 digits, the whole problem is easier   :bg
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 16, 2011, 08:51:56 PM
Quote from: dedndave on August 16, 2011, 01:53:05 PM
that's because AAA works by using the AC and CF flags, i think
XOR clears them   :P

Hi,

   That would explain it. I wonder why they did that?  (Rhetorical.)
Reading a text book explains it.  Must have saved some transistors.
I should have known that.

Regards,

Steve
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 16, 2011, 10:18:28 PM
        ADD     AL,[DI] ; If there is any existing data.
        MOV     AH,0
        AAA             ; Adjust after addition.

:bg

actually, the fact that AAA, DAA, AAS, DAS use the flags makes them usable in "trick" schemes
Title: Re: Dave's nickname hijacked??
Post by: clive on August 17, 2011, 12:44:59 AM
An improved attempt, Phenom II X4 2.8 GHz, in C
     10000 6237204500        8 ms
    100000 3031782500       95 ms
   1000000 4077562500     1142 ms
  10000000 4535362500    13411 ms
 100000000 9113362500   154577 ms


2nd run
          10000 6237204500        7 ms
         100000 3031782500       94 ms
        1000000 4077562500     1150 ms
       10000000 4535362500    13472 ms
      100000000 9113362500   154893 ms
     1000000000 4893362500  1751345 ms
    10000000000 2693362500 19500530 ms
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 18, 2011, 02:19:18 PM
Hi,

   Woke up too early Monday morning, and thought up a way
to speed things up (maybe).  After coding up a few disasters,
I convinced myself it wouldn't work.  Then woke up too early
again today and decided it should work after all.  After undisastering
the thing and adding some timing code I tried it out.  Yipee!
Another insomnia event put to rest.

   The after and before results follow.  The old code was timed
using the TIME command in a BATch file.  Code sizes in *.EXE
thrown in for good luck.  Even though they are not strictly
comparable.

Regards,

Steve N.


Calculating sum of series S N**N N = 1, 1000.
Glorious second version.

18 Aug 2011  8:47:24

0000000000 0000000001 0000000000 0000000000 0000000000 9????????0

  8:47:24

Calculating sum of series S N**N N = 1, 10,000.
Glorious second version.

18 Aug 2011  8:47:29

0000000000 0000000001 0000000000 0000000000 0000000000 6237204500

  8:47:30

Current time is:  8:53:02.38

Calculating sum of series S N**N N = 1, 1000.
0000000000 0000001000 0000000000 9????????0

Current time is:  8:53:04.17

Calculating sum of series S N**N N = 1, 10,000.
0000000000 0000010000 0000000000 6237204500

Current time is:  8:56:22.80

RTC      EXE      1225   1-07-10  3:10p
SUMS_EX  EXE      1376   8-15-11  8:32a
SUMS_EX2 EXE      1856   8-18-11  8:47a
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 19, 2011, 03:28:45 PM
Improved attempt:
N= 10,000,000,000              time=3,638,023ms Answer = 2693362500
It's a slightly improved algorithm to the one above and then threaded to use all 4 cores of the Phenom II x4 3GHz cpu.

There is actually a huge short cut that can be made for the big numbers. Once you get to 1,000,000 things repeat so you can just calculate for n=1,000,000 and then a simple multiply and add will cover any  number of whole millions beyond that.
e.g. 1,000,000 gives  4,077,562,500
The bit that repeats = 3,384,200,000 so:
10,000,000 =  4,077,562,500 + 9*3,384,200,000 = 34,535,362,500
100,000,000 =  4,077,562,500 + 99*3,384,200,000 = 339,113,362,500
1,000,000,000 =  4,077,562,500 + 999*3,384,200,000 = 3,384,893,362,500
10,000,000,000 =  4,077,562,500 + 9,999*3,384,200,000 = 33,842,693,362,500
100,000,000,000 =  4,077,562,500 + 99,999*3,384,200,000 = 338,420,693,362,500
At this point the 10 digits stay fixed if we increase the range by 10x more.

Paul.
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 19, 2011, 07:04:53 PM
i noticed that phenomenon earlier

http://www.masm32.com/board/index.php?topic=17223.msg144280#msg144280

with a little thought, i believe it could be the basis of a greatly simplified and faster algorithm   :U

start small - extract a digit - multiply up
once you have the minimum amount of required information, you are done   :bg
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 19, 2011, 07:34:38 PM
we could be on the verge of discovering a new algebraic law   :U
Title: Re: Dave's nickname hijacked??
Post by: raymond on August 25, 2011, 02:10:00 AM
Computing the modulo with a modulus exceeding 32 bits using 32-bit registers is not immediately evident. It would actually be impossible if such a modulus was itself a prime or had a prime factor exceeding 32 bits. Nevertheless, my algo now seems to be working properly. My timing for the 1000000 limit is 1050 ms (about 4x slower than using a modulus smaller than 32 bits). That timing should be the same for finding the last 10 up to the last 14 digits which would not involve products exceeding 96 bits.

QuoteThere is actually a huge short cut that can be made for the big numbers.

That may be true as long as the limit is a power of 10. What if the given limit is a random number?
Title: Re: Dave's nickname hijacked??
Post by: dedndave on August 25, 2011, 03:18:45 AM
glad to see you back, Ray   :U
Title: Re: Dave's nickname hijacked??
Post by: FORTRANS on August 25, 2011, 05:24:46 PM
Hi,

   Oh goodie.  I modified my program and tried 100,000 and
1,000,000.  100,000 seems okay, 1,000,000 is not.

Regards,

Steve N.

Edit:  Got it working.  It was some extra code to prevent
an error that wasn't modified correctly with the last upgrade.
If it had been left out, there would not have been tne
incorrect result.  Got to love it.

SRN
Title: Re: Dave's nickname hijacked??
Post by: dioxin on August 26, 2011, 11:26:38 AM
Raymond,
QuoteThat may be true as long as the limit is a power of 10. What if the given limit is a random number?
Do the first million, do the next million and multiply by the number of whole millions needed then do the small number of digits left over and add the three values.

Paul.