News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dave's nickname hijacked??

Started by raymond, August 12, 2011, 03:04:56 AM

Previous topic - Next topic

dedndave

actually, the prescott i have is em64t compatible
however, i have no idea how to execute 64-bit instructions - or even if i can - with 32-bit XP

my current solution is to use the 64-bit integer capability of the FPU
i am not too familiar with using integers with the FPU, and even less familiar with its' packed BCD capability - lol
but, i am making headway

i looked at multiple-precision multiplication and became disinterested   :P
i have done it before, the Ling Long Kai Fang routines use 64 x 64 multiply with 32-bit registers

raymond

Clive, your numbers are probably correct. I must have a minor bug with the BCD's which affects the higher significant digits. I'll try and find it.

The reason I think my BCD results may be wrong is that I wrote another algo using integers to get the lower 9 digits (i.e. <32 bits) and results agree with all your lower 9-digit results.

As expected, timings with integer math and modular exponentiation are considerably lower:
1000 0.2 ms
10000 2.7 ms
100000 35 ms
1000000 250 ms

Dave, I'm still wondering why you would have chosen #48 as the first one to solve. It seems rather random.
Once you solve #12, you may like the PDF I wrote for that problem.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

actually, one of the other members mentioned that specific problem
when i looked at it, i thought it was a horner's rule problem
as i mentioned, i had already written routines to exponentiate and convert the bignums
so, it looked easy and i gave it a whirl   :P

it has been many years since i took the math courses in college
i have forgotten most of what i learned because i rarely have to use it to design circuits
i use derivatives once in a while
now and then, i have to understand how something works that is based on higher math
only enough to grasp the underlying principal, though   :P

at any rate, many of the euler problems are a little over my head

dedndave

it just dawned on me.....

this does not apply to the euler problem because 10 digits are required and Nmax is only 1000
but, it does apply to these extended versions that you guys have created

the one thing that has remained constant in all these problems is that the result is limited to 10 digits
10 digits may be fully expressed with 34 binary bits
that means that only the lower 17 bits of N will effect the result (squareroot)

for the Nmax=1,000,000 problem, only the lower 5 digits of N will effect the 10 digit result
in other words, there are several repeated calculations   :bg
notice the result ends in 0's
you could perform the calculation for N=1 to 99,999, then multiply by 100   :U

raymond

Most (if not all) of the first 150-200 problems are accessible with high school math knowledge, a little logic, pattern spotting, or maybe a little extra knowledge about how some numbers can be computed.

My knowledge is very little above high school math so that many of the problems above #250 are also way above my head. Some of the exceptions above #300 are:
309, 311, 315, 321, 327, 333, 341, most of which I had a hand in their development.
That last one (#341) was fun since my algo ran 100 times faster than that of the original proposer who is an excellent problem solver.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dedndave

i may look at them - after a bit   :P
too many irons in the fire, at the moment

another revelation - lol
i can use 80-bit extended reals, rather than 64-bit integers
so long as i keep everything under 64 bits, there should be no loss in precision
this being the "native language" of the FPU should be faster

raymond

Clive,
Found my stupid error. My initial algo was based on a maximum limit having at most 4 digits. I simply forgot to adjust that when I started raising the limit above 10000 and it was affecting all the odd numbers. All my results now agree with yours.

It's surprizing that I was still getting the last 8 digits correct with that error. Dave may have something regarding a shortcut. It may be worth investigating.

Quoteso long as i keep everything under 64 bits

Under is the correct word. Qwords are treated as signed by the FPU.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

clive

Well this was the original 5 min solution. I could probably do it with doubles, or extended precision. I seemed more viable than computing all the digits. There are probably far better ways of doing the power/exponent function, I was just worried about keeping the number within the scope of a 64-bit number, as it gets out of control very quickly, obviously.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int i, j;
  unsigned __int64 x, y;

  x = 0;

  for(i=1; i<=1000; i++)
  {
    y = i;

    for(j=1; j<i; j++)
    {
      y *= i;
      y %= 10000000000;
    }

    x += y;
  }

  x %= 10000000000;

  printf("%4d %010I64d %010I64d\n",i-1, y, x);
}
It could be a random act of randomness. Those happen a lot as well.

dedndave

ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried

the game plan is to use right-to-left binary exponentiation
we only need 34 bits in the result, but i am designing it to use 42 bits
because we have room and i think we could get an 11th digit if we wanted it

my idea is to use all 8 registers of the FPU stack and rotate them like a revolver cylinder

that way, we eliminate memory accesses inside the loop

i am designing the loops from the inside out
here is what i have so far...
        jmp short loop_entry

;-------------------------------------------------------------------

mul_operand_to_accumulator:
        fincstp
        fincstp

;st(0) = exponentiation accumulator
;st(1) = 42-bit modulo divisor = 4,398,046,511,104
;st(2) = addition accumulator
;st(3) = 21-bit modulo divisor = 2,097,152
;st(4) = 1
;st(5) = operand (N)
;st(6) = temporary operand (N^2^x)
;st(7) = 21-bit modulo divisor = 2,097,152

        fmul    st(6)
        fprem                     ;lower 42 bits of exponentiation accumulator
        fdecstp
        fdecstp

;-------------------------------------------------------------------

square_temp_operand:

;st(0) = temporary operand (N^2^x)
;st(1) = 21-bit modulo divisor
;st(2) = exponentiation accumulator
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = addition accumulator
;st(5) = 21-bit modulo divisor = 2,097,152
;st(6) = 1
;st(7) = operand (N)

        fmul    st(0)
        fprem                     ;lower 21 bits of temporary operand

;-------------------------------------------------------------------

loop_entry:
        shr     eax,1
        jc      mul_operand_to_accumulator

        jnz     square_temp_operand

;-------------------------------------------------------------------

        fincstp
        fincstp
        fincstp
        fincstp

;st(0) = addition accumulator
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = 1
;st(3) = operand (N)
;st(4) = temporary operand (N^2^x)
;st(5) = 21-bit modulo divisor
;st(6) = exponentiation accumulator
;st(7) = 42-bit modulo divisor = 4,398,046,511,104

        fadd    st(6)
        fprem                     ;lower 21 bits of addition accumulator

;-------------------------------------------------------------------


i was doing it with integers, but switched to extended reals and had to start over   :'(

at the end, there, i see i can re-arrange the registers and eliminate a couple rotations

dedndave

well - the loop assembles....
....and runs with no exceptions....
....but i don't get the right answer, yet

i am in "debug" mode, now   :P

dedndave

ok - the loop works, to some degree
and, it's very fast
i am starting to count clock cycles instead of mS   :U
something in the order of 2,000,000 cycles
if you prefer a time, it is in the order of 667 microseconds on my 3 GHz P4

however, i am having a precision problem - i haven't sorted it out, yet
the time shouldn't change signifigantly if fixed unless i have to extend precision via adding instructions

using fincstp and fdecstp works like a charm

FORTRANS

Quote from: dedndave on August 14, 2011, 08:31:35 PM
ok - i am going to give up some code before i am done perfecting it
normally, i don't do that because someone else will come along and do it and claim it as theirs - lol
but, in this case, it seems to be just the 3 of us and i don't feel too worried

Hi,

   Well actually I was trying a BCD version.  And while I eventually
got the N = 10 value correct, I can't duplicate Clive's numbers for
N = 10,000.  So I assume my N = 1,000 is bad as well.  Pegs the
CPU at 100% nicely though.  And it probably should be thrown
out and started over if I wanted reasonable performance.
N = 1,000 second(s), N = 10,000 minutes.

Cheers,

Steve N.

dedndave

welcome to the fray, Steve   :bg
for 1,000, the first digit is a 9 and the last one is a 0

this code is somewhat correct
the problem is lack of precision
i may get to play with it today

uArb2StrL9 is my binary-to-ascii routine
i will write a special/faster version once i fix the loop
Nmax    EQU 1000

        .DATA

fpTzero  real10 0.0
fpTone   real10 1.0
fpTmod21 real10 2097152.0
fpTmod42 real10 4398046511104.0

        .DATA?

nQResult dq ?
szBuffer db 24 dup(?)

        .CODE

_main   PROC

;st(0) = temporary operand (N^(2^x))                  start with Nmax
;st(1) = 21-bit modulo divisor = 2,097,152
;st(2) = exponentiation accumulator                   start with 1
;st(3) = 42-bit modulo divisor = 4,398,046,511,104
;st(4) = 1
;st(5) = operand (N)                                  start with Nmax
;st(6) = additive accumulator (result)                start with 0
;st(7) = 42-bit modulo divisor = 4,398,046,511,104

        finit
        mov     edx,Nmax
        fld     fpTmod42
        push    edx
        fld     fpTzero
        fild dword ptr [esp]
        fld     fpTone
        fld     fpTmod42
        fld     fpTone
        fld     fpTmod21
        fild dword ptr [esp]
        fwait
        pop     eax
        jmp short exp_loop_entry

top_of_outer_loop:
        fdecstp
        mov     eax,edx
        fsub    st(0),st(7)
        fdecstp
        fdecstp
        ffree   st(7)
        fld     st(1)
        fdecstp
        ffree   st(7)
        fld     st(4)
        jmp short exp_loop_entry

mul_operand_to_accumulator:
        fincstp
        fincstp
        fmul    st(0),st(6)
        fprem                     ;lower 42 bits of exponentiation accumulator
        fdecstp
        fdecstp

square_temp_operand:
        fmul    st(0),st(0)
        fprem                     ;lower 21 bits of temporary operand

exp_loop_entry:
        shr     eax,1
        jc      mul_operand_to_accumulator

        jnz     square_temp_operand

;add exponentiated accumulator to additive accumulator

        fdecstp
        fdecstp
        dec     edx
        fadd    st(0),st(4)
        fprem                     ;lower 42 bits of additive accumulator
        jnz     top_of_outer_loop

        fwait
        fistp   nQResult
        ffree   st(6)
        ffree   st(5)
        ffree   st(4)
        ffree   st(3)
        ffree   st(2)
        ffree   st(1)
        ffree   st(0)
        INVOKE  uArb2StrL9,offset nQResult,sizeof nQResult,offset szBuffer,sizeof szBuffer
        sub     ecx,10
        jbe     skip_it

        add     edx,ecx

skip_it:
        print   edx
        print   chr$(13,10)
        inkey
        exit

_main   ENDP

FORTRANS

Quote from: dedndave on August 15, 2011, 01:12:47 PM
welcome to the fray, Steve   :bg
for 1,000, the first digit is a 9 and the last one is a 0

Hi,

   Thanks.  Stomped another bug, and I matched Clive's
10,000 value and the digits you mentioned for N = 1,000.

Regards,

Steve

dedndave

nice work   :U

i found that mine works perfectly.....
up to Nmax = 7