Adding up fractions with FPU

cobold · May 02, 2009, 04:16:03 PM

Hello,

I know that I'm slightly off-topic here, because my question is more related to mathematics than to assembler.
But I hope that someone here could help me:
I make fractions in the following way: 1/1, 1/2, 1/3, 1/4 and so on, then I add them in a loop using FPU like this:
sum = 1 + 1/2 + 1/3 + 1/4 and so on.

Assembler related question: Any comments to enhance the code? (I think I don't use FPU optimally)

Math related question: At the beginning, sum increases quickly then more and more slowly. That's logical, because the added fractions become smaller and smaller.
BUT if you add up (endlessly or at least up to max of 64bit value) will sum increase endlessly too, or will it stop to increase because of computational accuracy?

Code Select


;...
;...
.data
    RunTimeFmt  db "%ut:%02uh:%02um:%02us",0
    fmtsum      db "sum = %.15f",13,10,0
    fmterg      db " 1/ %.f = %.16f",9,0
align 8    
    c64     dq 0
    term    real8 0.0
    erg     real8 ?
    sum     real8 0.0

.code
start:
_c64loop:
        add dword ptr [c64],1
        adc dword ptr [c64+4],0
        
        fld1                    ; term := term+1
        fadd term
        fstp term
        
        fld1
        fdiv term
        fstp erg                ; erg := 1/term
        ;invoke crt_printf,ADDR fmterg,term,erg
        
        fld sum
        fadd erg
        fstp sum                ; sum := sum+erg
        ;invoke crt_printf,ADDR fmtsum,sum
        ;getkey
        
        cmp dword ptr[c64],0
        jb _c64loop
        cmp dword ptr[c64+4],01h
        jne _c64loop
        
    print "Ende ",13,10        
    ;invoke GetTickCount
    ;sub eax,msec
    ;invoke RunTime,eax
    invoke crt_printf,ADDR fmterg,term,erg
    invoke crt_printf,ADDR fmtsum,sum
exit
end start

0t:00h:00m:52s 1/ 4294967296 = 0.0000000002328306 sum = 22.757925442928084

thks & regards

cobold

dedndave · May 02, 2009, 05:27:17 PM

If you don't require all that accuracy, you may not want to use the FPU at all.

But, at first glance, I might be tempted to take a different approach altogether
to see if it is practical.

If you know the last value, then you know the common denominator.
Remember in school, to add fractions, we converted them all to a common
denominator, then add. By knowing in advance what that denominator is,
you might be able to gain a signifigant increase in speed by converting
them along the way, and adding numerators together in an accumulated value.

Just something to think about

Hmmmmm where is that FACTORIAL instruction when you need one - lol
you might google that and see what pops up

some guy smarter than me may have made a nice routine for that

raymond · May 02, 2009, 07:00:58 PM

QuoteMath related question:

In the decimal system, the reciprocal of a number would be infinitely precise only if the divisor is a power of 2 and/or 5 (the two prime factors of 10). In the binary system, it could be infinitely precise only if the divisor is a power of 2 (the only prime factor of 2).

Using the FPU, the precision will be limited to the Precision Control setting of the "Control Word". The best precision will be obtained with extended double precision where reciprocals will become 0 if the divisor exceeds approx. 10^-4950 (2^-16446. However, you only have a maximum of 64 significant binary digits with the maximum precision control. Once your rounded binary fraction becomes smaller than the last binary digit of your sum, that sum will stop increasing on the FPU unless you modify your algo into more than one loop.

CAUTION: When you open a program under Windows as the OS, a clean FPU is provided for the program. However, its precision control is only set for double precision, i.e. 64 bits of which only 53 bits for the mantissa precision. To get full precision (80 bits of which 64 bits is for the mantissa precision), you either use finit or modify the Control Word to extended double precision.

The FPU is always more efficient if you can keep all required data in its registers instead of in memory. All transfers from memory to the FPU registers take some time.

Most FPU operations can run in parallel with the ALU. Some FPU operations can also be performed while it's performing some other operations. If you can perform some basic CPU operations while the FPU is busy, it's like free time. Without all the printing, your algo could look as follows. (I also hate typing when it would not serve any practical purpose).

Code Select

.data
    c64     dd 0,0
    sum     real8,0

.code
start:
    finit
    fldz        ;S (sum)
    fld1        ;X (divisor)
_c64loop:
    fld1
    fdiv st,st(1)
    fld1
    faddp st(2),st    ;increment divisor, done while division being performed
    add c64,1
    .if CARRY?
        jmp   printit
        ;instructions to increment the counter beyond 32 bits coud be added here
        ;such as add c64[4],1, cmp c64[4],?, ja printit
    .endif
    faddp st(2),st
    jmp  _c64loop

printit:
    faddp st(2),st    ;last summation which was skipped
    fstp st            ;get rid of the divisor
    fstp sum
;ready for printing

Note: the above has not been tested

redskull · May 02, 2009, 07:28:43 PM

As a side note, 'mathematically', this is a divergent series that will never end; however, if you alternate the signs (subtract every even denomenator instead of adding), the infinite series converges to ln(2). If you add the squares of the numbers, you get pi^2/6. Just felt like being a dork :bg

-ac

dedndave · May 02, 2009, 07:37:14 PM

see - i knew there were some guys smarter than me :bg
(that guy in redskull's avatar pic ain't one of em - lol)

i have $10 says Ray's code works first time

MichaelW · May 02, 2009, 10:07:23 PM

My version is a little different than Raymond's.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      sum   REAL8 0.0
      denom dd    0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    N=1

    REPEAT 20

      fldz              ; s
      fld1              ; d,s
      mov ebx, N
    @@:
      fld1              ; 1,d,s
      fdiv  st, st(1)   ; 1/d,d,s
      faddp st(2), st   ; d,s
      fld1              ; 1,d,s
      faddp st(1), st   ; d,s
      dec ebx
      jnz @B

      fistp denom
      fstp sum
      dec denom
      invoke crt_printf, chr$("%d",9,"%f",10), denom, sum

      N=N+1

    ENDM

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

And this compares the cycle counts for the inner-loop code from my and cobold's version:

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      term  real8 0.0
      erg   real8 ?
      sum   real8 0.0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    fldz              ; s
    fld1              ; d,s

    counter_begin 1000, HIGH_PRIORITY_CLASS

      ;-------------------------------------------------------------
      ; The REPEAT was necessary to get the cycle count without the
      ; FDIV instruction up to something that could reasonably be
      ; measured.
      ;-------------------------------------------------------------

      REPEAT 10

        fld1              ; 1,d,s
        fdiv  st, st(1)   ; 1/d,d,s
        faddp st(2), st   ; d,s
        fld1              ; 1,d,s
        faddp st(1), st   ; d,s

        ;----------------------------------------
        ; Make sure the coprocessor is finished.
        ;----------------------------------------

        fwait

      ENDM

    counter_end
    print ustr$(eax), " cycles",13,10

    finit

    counter_begin 1000, HIGH_PRIORITY_CLASS

      REPEAT 10

        fld1                    ; term := term+1
        fadd term
        fstp term

        fld1
        fdiv term
        fstp erg                ; erg := 1/term

        fld sum
        fadd erg
        fstp sum                ; sum := sum+erg

        fwait

      ENDM

    counter_end
    print ustr$(eax), " cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Results on my P3:

Code Select


304 cycles
379 cycles

And the results with the FDIV instructions commented out:

Code Select


46 cycles
77 cycles

So it looks like the most effective optimization would need to somehow eliminate the FDIV instruction.

cobold · May 03, 2009, 08:33:39 PM

Thanks to everbody!

chrisw · May 08, 2009, 02:30:44 PM

without implementing it, i'd assume that it is faster (and more precise) to sum up resp. multiply up the nominator and the demoninator of the final result seperatedly and divide at the end of the calculation.

Something like that:

sum(n) = 1/1 + 1/2 + 1/3 + 1/4 + ... + 1/n = ((...((1*2+1)*3+2)*4+6)* ... )*n+(n-1)!) / n!

which reads in some pseudocode:

nom = 1
denom = 1

for i=2 to n
nom = nom * i + denom
denom = denom * i

sum = nom/denom

I'd expect some parallel execution of the arithhmetic operations.

To improve the load of the execution units in the ALU, the loop may be unrolled and more than one parallel result could be calculated. Something like that:

nom1 = 1
denom1 = 1

nom2 = 1
denom2 = 2

for i=3 to (n-1) step 2

nom1 = nom1 * i + denom1
denom1 = denom1 * i

nom2 = nom2 * (i+1) + denom2
denom2 = denom2 * ( i+1)

sum = (nom1*denom2 + nom2*denom1)/(denom1*denom2)

regards,
Christian

MichaelW · May 08, 2009, 05:17:52 PM

This isn't well tested, and I suspect that the FPU code could be further optimized.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      sum REAL8 0.0
      i   dd    2
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    fld1              ; d=1
    fld1              ; n=1,d=1
  @@:
    fild i            ; i,n,d
    fmul st(1), st    ; i,n=n*i,d
    fstp st           ; n=n*i,d
    fadd st, st(1)    ; n=n*i+d,d
    fild i            ; i,n=n*i+d,d
    fmulp st(2), st   ; n=n*i+d,d=d*i
    inc i
    cmp i, 20
    jna @B
    fdiv st, st(1)    ; n/d,d
    fstp sum          ; d
    fstp st           ; empty

    invoke crt_printf, chr$("%f",10), sum, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      term  real8 0.0
      erg   real8 ?
      sum   real8 0.0
      i     dd    0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    fldz              ; s
    fld1              ; d,s

    counter_begin 1000, HIGH_PRIORITY_CLASS

      ;-------------------------------------------------------------
      ; The REPEAT was necessary to get the cycle count without the
      ; FDIV instruction up to something that could reasonably be
      ; measured.
      ;-------------------------------------------------------------

      REPEAT 10

        fld1              ; 1,d,s
        fdiv  st, st(1)   ; 1/d,d,s
        faddp st(2), st   ; d,s
        fld1              ; 1,d,s
        faddp st(1), st   ; d,s

      ENDM

      ;----------------------------------------
      ; Make sure the coprocessor is finished.
      ;----------------------------------------

      fwait

    counter_end
    print ustr$(eax), " cycles",13,10

    finit

    counter_begin 1000, HIGH_PRIORITY_CLASS

      REPEAT 10

        fld1                    ; term := term+1
        fadd term
        fstp term

        fld1
        fdiv term
        fstp erg                ; erg := 1/term

        fld sum
        fadd erg
        fstp sum                ; sum := sum+erg

      ENDM

      fwait

    counter_end
    print ustr$(eax), " cycles",13,10

    finit
    fld1
    fld1

    counter_begin 1000, HIGH_PRIORITY_CLASS

      REPEAT 10

        fild i            ; i,n,d
        fmul st(1), st    ; i,n=n*i,d
        fstp st           ; n=n*i,d
        fadd st, st(1)    ; n=n*i+d,d
        fild i            ; i,n=n*i+d,d
        fmulp st(2), st   ; n=n*i+d,d=d*i

      ENDM

      fwait

    counter_end
    print ustr$(eax), " cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


317 cycles
398 cycles
112 cycles

I have doubts that unrolling the loop could help much because the loop overhead is so small compared to the instructions in the loop.

chrisw · May 08, 2009, 06:50:46 PM

Quote
I have doubts that unrolling the loop could help much because the loop overhead is so small compared to the instructions in the loop.

i thought of an unrolling and calculation of 2 or 4 results in parallel not to reduce the loop overhead but to break down the dependency chain and achieve more parallelism that way (and/or calculation in SSE2 in single or double precision with 2 or 4 calculations in parallel)

redskull · May 08, 2009, 07:08:25 PM

For high numbers of n, you can cut out all the middle-men and just simply use the euler constant plus the digamma of n+1; it's almost certainly quicker to approximate the digamma at one number than do millions of additions. Of course, that's not really the same problem, but I would think it would be the fastest way to get the partial sum.

-ac

chrisw · May 09, 2009, 10:00:37 AM

That's true for sure, redskull, but the really interesting part of the problem is not to get a value of the harmonic series but to accelerate the assembly program to calculate it :toothy

MichaelW · May 09, 2009, 02:30:42 PM

The problem with accumulating the numerator and denominator values in the loop is that they will eventually become too large. Stored in a REAL8 both values are interpreted as +INFINITY starting when the loop counter reaches 171. Retaining them in the FPU provides more range, but even so when the loop counter reaches 1755 the result of the divide operation stored as a REAL8 is interpreted as a Quiet NaN. Even if the result were stored as a REAL10, it would become invalid long before a result calculated entirely within the loop would.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

_FPCLASS_SNAN equ 0001h  ; signaling NaN
_FPCLASS_QNAN equ 0002h  ; quiet NaN
_FPCLASS_NINF equ 0004h  ; negative infinity
_FPCLASS_NN   equ 0008h  ; negative normal
_FPCLASS_ND   equ 0010h  ; negative denormal
_FPCLASS_NZ   equ 0020h  ; -0
_FPCLASS_PZ   equ 0040h  ; +0
_FPCLASS_PD   equ 0080h  ; positive denormal
_FPCLASS_PN   equ 0100h  ; positive normal
_FPCLASS_PINF equ 0200h  ; positive infinity

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      sum   REAL8 0.0
      i     dd    2
      n     REAL8 0.0
      d     REAL8 0.0
      skipn dd    0
      skipd dd    0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    fld1              ; d=1
    fld1              ; n=1,d=1
  @@:
    fild i            ; i,n,d
    fmul st(1), st    ; i,n=n*i,d
    fstp st           ; n=n*i,d
    fadd st, st(1)    ; n=n*i+d,d
    fild i            ; i,n=n*i+d,d
    fmulp st(2), st   ; n=n*i+d,d=d*i

    fst n
    fxch st(1)
    fst d
    fxch st(1)

    invoke crt__fpclass, n
    .IF eax != _FPCLASS_PN && skipn == 0
      push eax
      print ustr$(i), 9
      pop eax
      print uhex$(eax),13,10
      inc skipn
    .ENDIF

    invoke crt__fpclass, d
    .IF eax != _FPCLASS_PN && skipd == 0
      push eax
      print ustr$(i), 9
      pop eax
      print uhex$(eax),13,10
      inc skipd
    .ENDIF

    inc i
    cmp i, 1755
    jna @B

    fdiv st, st(1)    ; n/d,d
    fstp sum          ; d
    fstp st           ; empty

    dec i
    print ustr$(i),9
    invoke crt__fpclass, sum
    print uhex$(eax),13,10

    invoke crt_printf, chr$("%f",10), sum, 10
    invoke crt_printf, chr$("%f",10), n, 10
    invoke crt_printf, chr$("%f",10), d, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


171     00000200
171     00000200
1755    00000002
-1.#IND00
1.#INF00
1.#INF00

News:

Adding up fractions with FPU