Dave's nickname hijacked??

clive · August 13, 2011, 02:59:45 AM

I didn't get too aggressive with this, another optimization would be to note that modulo 10 powers contribute nothing to the sum

Code Select

unsigned __int64 computepuzzle48(void)
{
  __asm
  {
    xor   eax,eax ; accumulator
    xor   edx,edx

    mov   ecx,1 ; power

loopouter:

    push  ecx
    push  edx
    push  eax

    xor   edx,edx ; edx:eax = ecx
    mov   eax,ecx

    or    ecx,ecx ; paranoid, ecx = 0
    jz    skip2

    mov   ebx,ecx
    dec   ecx
    jz    skip2

loopinner:

; edx:eax *= ebx

    push  edx
    push  eax
    mov   eax,edx
    xor   edx,edx
    mul   ebx
    mov   edi,eax
    pop   eax
    pop   edx
    mul   ebx
    add   edx,edi

; edx:eax modulo 10000000000 (max 429496729599999)

    cmp   edx,2 ; cheap test for below 10000000000
    jb    skip1

    push  edx
    push  eax
    mov   edi,100000
    div   edi
    xor   edx,edx
    div   edi
    xor   edx,edx
    mul   edi
    mul   edi
    mov   edi,eax
    pop   eax
    sub   eax,edi
    mov   edi,edx
    pop   edx
    sbb   edx,edi

skip1:

    dec   ecx
    jnz   loopinner

skip2:

; edx:eax holds power to 10 least significant digits

; edx:eax += (accumulate)

    pop   ecx
    add   eax,ecx
    pop   ecx
    adc   edx,ecx

; edx:eax modulo 10000000000 (max 429496729599999)

    cmp   edx,2 ; cheap test for below 10000000000
    jb    skip3

    push  edx
    push  eax
    mov   edi,100000
    div   edi
    xor   edx,edx
    div   edi
    xor   edx,edx
    mul   edi
    mul   edi
    mov   edi,eax
    pop   eax
    sub   eax,edi
    mov   edi,edx
    pop   edx
    sbb   edx,edi

skip3:

    pop   ecx

    inc   ecx
    cmp   ecx,1000
    jb    loopouter

; edx:eax Answer, 10 least significant digits

  }
}

raymond · August 13, 2011, 03:39:31 AM

In those early days, I used plain BCD's for this problem, doing computations only with the last 10 digits. No need for the conversion of a big number at the end. Still runs in a decent time: 0.12 sec on my i7.

I'll try and see if using modular exponentiation and BCD's will improve that. What was your timing Dave with "big numbers"?

dedndave · August 13, 2011, 03:56:29 AM

for N=1000, it was about 78 mS, as i recall
that yielded the full length result

for N=10000, i had time to go to the bathroom and grab a fresh soda :lol
it also yielded the full length result

clive · August 13, 2011, 01:50:50 PM

For the C version N=1000 was about 47 ms, N=10000 was about 5.8 s, N=100000 was about 571.3 s (3031782500 ?)
For the assembler version N=10000 was about 4.2 s, N=100000 was about 430.9 s but I had to adjust the balance of the modulus to keep results within 32-bits.
1.6 GHz Atom Netbook

dedndave · August 13, 2011, 03:39:46 PM

Quoteanother optimization would be to note that modulo 10 powers contribute nothing to the sum

well, it would be easy to add 1010101 (decimal) for N=1, 10, 100, 1000

i think there may be other simple shortcuts, as well - those that apply to powers of 2, for example
i looked at factoring, it doesn't seem like it would be very efficient
and - it may be faster, but it would require a lot more code

no - i think there is some other trick, here
most of the euler problems have "hidden" solutions :bg

clive · August 13, 2011, 04:33:48 PM

My observation was that 10**10, 20**20, .. 110*110, 120*120, etc contribute nothing to the least significant digits. Thus eliminating ~10% (non linear, but whatever) of the work.

dedndave · August 13, 2011, 04:46:34 PM

my mistake, Clive
you're right (again - lol)

my head may have been thinking exponents, but my hands were stuck on squares :P

clive · August 13, 2011, 05:31:33 PM

Quote from: dedndave
no - i think there is some other trick, here, most of the euler problems have "hidden" solutions

Yes, there are probably a lot of tricks and short cuts, the site has a lot of interesting problems. Some languages are better at this than others, at least with the postings it seems that some languages handle large numbers in a much more fluid manner, and permit brute force solutions with very simple coding.

Doing it in assembler is a particular pig because you have to think in binary and decimal, and the scope of the numbers. Raymond's BCD suggestion at least keeps the decimal/decades in focus. I've done BCD on the 6502/68K/x86, but not sure how much effort has gone into optimizing those instructions over the years.

The trick with this one was to truncate the computation of unused digits, allowing the solution to scale 10**2 for each additional decade in N.

dedndave · August 13, 2011, 06:16:18 PM

well - the challenge part is finding the efficient method
the fun part is coding it in assembler :bg

for this problem, my first instinct was to make it a horner's rule solution
that would apply better if it were
1¹ x 2² x 3³ x 4⁴...... :bg

anyways, binary exponentiation looks promising, and we only need to go as far as the low-order 34 bits in the result

it is a 2-step process
1) exponentiation loop
2) conversion to decimal

horner could be used for step 2
but, from my experience, there are faster algorithms for converting values that are smaller than 2 x CPU register width

clive · August 13, 2011, 08:56:04 PM

On a Phenom II X4 820, 2.8 GHz

For the C solution, skipping modulo 10 powers
N=10000, 1.512 seconds, 6237204500
N=100000, 149.378 seconds, 3031782500
N=1000000, 14955.012 seconds, 4077562500 ?

I could split the power computations into threads?

So Raymond, what kind of results are you getting with BCD, or other methodologies ?

dedndave · August 13, 2011, 09:26:25 PM

that's pretty fast :P
64-bit ?
not sure i can play on the same field with a 32-bit machine

clive · August 14, 2011, 02:06:24 AM

Quote from: dedndavethat's pretty fast :P
64-bit ? not sure i can play on the same field with a 32-bit machine

The machine is 64-bit, the C is MSVC 12.00, 32-bit, using 64-bit integers (unsigned __int64). It's about 2.8x the speed of the Atom, so figure 43000 seconds for N=1000000, say 12 hours or so on the netbook.

The run time for the previous assembler, with a slight modification to balance the mod 10000000000 a bit better so the first divide fits the result in 32-bits, but without the mod 10 power fix is only 8-9% slower.

raymond · August 14, 2011, 04:34:59 AM

Modular exponentiation is definitely faster. Got the timing down from 120 ms to 12 ms for the 1000 limit.

Result for the 10000 limit confirmed. Timing 150 ms.
Result for the 100000 limit different from clive: 4131782500, timing 1.88 sec.
Result for the 1000000 limit slightly different from clive: 5077562500, timing 23.4 sec.

I wonder what Dave would get on the last two.

All these should be significantly faster with a 64-bit program using integers instead of BCD's.

dedndave · August 14, 2011, 06:06:59 AM

i am working on an FPU version, Ray - seeing as i do not have a 64-bit machine :P
however, i am not using packed BCD values
i am using 64-bit integers
i don't recall what the different timings are for FPU integer math, so my method may not be that efficient
i will post code if i get it working - lol

clive · August 14, 2011, 02:52:20 PM

I have a pretty high confidence in the limit numbers, the algorithm is sound, and I'm not seeing any overflow. I've pulled some x86-64 compilers, and built both 64-bit and 128-bit integer variants to run on the Phenom. My exponential/power computation is brute force rather than efficient/optimized, hence the 100x increase with each decade, as apposed to the 10x, but it should be immune to the potential overflow of other methods. I could generate a table of n**n computation to see where our solutions diverge.

I downloaded a 64-bit version of MSVC, which surprisingly didn't support __int128, so I pulled gcc/mingw64. Microsoft are so screwed.

Dave doesn't your Prescott do 64-bit? or did you get one of the crippled ones? I had Win32 running on mine, but it would dual boot to 32 or 64-bit Linux. My 2005 Prescott box died a while back, I had to transplant it's guts to an AMD X2 refurbished donor box, and picked up Quad Phenom to have a more current dev box.

News:

Dave's nickname hijacked??