Hello,
I know that I'm slightly off-topic here, because my question is more related to mathematics than to assembler.
But I hope that someone here could help me:
I make fractions in the following way: 1/1, 1/2, 1/3, 1/4 and so on, then I add them in a loop using FPU like this:
sum = 1 + 1/2 + 1/3 + 1/4 and so on.
Assembler related question: Any comments to enhance the code? (I think I don't use FPU optimally)
Math related question: At the beginning, sum increases quickly then more and more slowly. That's logical, because the added fractions become smaller and smaller.
BUT if you add up (endlessly or at least up to max of 64bit value) will sum increase endlessly too, or will it stop to increase because of computational accuracy?
;...
;...
.data
RunTimeFmt db "%ut:%02uh:%02um:%02us",0
fmtsum db "sum = %.15f",13,10,0
fmterg db " 1/ %.f = %.16f",9,0
align 8
c64 dq 0
term real8 0.0
erg real8 ?
sum real8 0.0
.code
start:
_c64loop:
add dword ptr [c64],1
adc dword ptr [c64+4],0
fld1 ; term := term+1
fadd term
fstp term
fld1
fdiv term
fstp erg ; erg := 1/term
;invoke crt_printf,ADDR fmterg,term,erg
fld sum
fadd erg
fstp sum ; sum := sum+erg
;invoke crt_printf,ADDR fmtsum,sum
;getkey
cmp dword ptr[c64],0
jb _c64loop
cmp dword ptr[c64+4],01h
jne _c64loop
print "Ende ",13,10
;invoke GetTickCount
;sub eax,msec
;invoke RunTime,eax
invoke crt_printf,ADDR fmterg,term,erg
invoke crt_printf,ADDR fmtsum,sum
exit
end start
0t:00h:00m:52s 1/ 4294967296 = 0.0000000002328306 sum = 22.757925442928084
thks & regards
cobold
If you don't require all that accuracy, you may not want to use the FPU at all.
But, at first glance, I might be tempted to take a different approach altogether
to see if it is practical.
If you know the last value, then you know the common denominator.
Remember in school, to add fractions, we converted them all to a common
denominator, then add. By knowing in advance what that denominator is,
you might be able to gain a signifigant increase in speed by converting
them along the way, and adding numerators together in an accumulated value.
Just something to think about
Hmmmmm where is that FACTORIAL instruction when you need one - lol
you might google that and see what pops up
some guy smarter than me may have made a nice routine for that
QuoteMath related question:
In the decimal system, the reciprocal of a number would be infinitely precise only if the divisor is a power of 2 and/or 5 (the two prime factors of 10). In the binary system, it could be infinitely precise only if the divisor is a power of 2 (the only prime factor of 2).
Using the FPU, the precision will be limited to the Precision Control setting of the "Control Word". The best precision will be obtained with extended double precision where reciprocals will become 0 if the divisor exceeds approx. 10
-4950 (2
-16446. However, you only have a maximum of 64 significant binary digits with the maximum precision control. Once your rounded binary fraction becomes smaller than the last binary digit of your sum, that sum will stop increasing on the FPU unless you modify your algo into more than one loop.
CAUTION: When you open a program under Windows as the OS, a clean FPU is provided for the program. However, its precision control is only set for double precision, i.e. 64 bits of which only 53 bits for the mantissa precision. To get full precision (80 bits of which 64 bits is for the mantissa precision), you either use finit or modify the Control Word to extended double precision.
The FPU is always more efficient if you can keep all required data in its registers instead of in memory. All transfers from memory to the FPU registers take some time.
Most FPU operations can run in parallel with the ALU. Some FPU operations can also be performed while it's performing some other operations. If you can perform some basic CPU operations while the FPU is busy, it's like free time. Without all the printing, your algo could look as follows. (I also hate typing when it would not serve any practical purpose).
.data
c64 dd 0,0
sum real8,0
.code
start:
finit
fldz ;S (sum)
fld1 ;X (divisor)
_c64loop:
fld1
fdiv st,st(1)
fld1
faddp st(2),st ;increment divisor, done while division being performed
add c64,1
.if CARRY?
jmp printit
;instructions to increment the counter beyond 32 bits coud be added here
;such as add c64[4],1, cmp c64[4],?, ja printit
.endif
faddp st(2),st
jmp _c64loop
printit:
faddp st(2),st ;last summation which was skipped
fstp st ;get rid of the divisor
fstp sum
;ready for printing
Note: the above has not been tested
As a side note, 'mathematically', this is a divergent series that will never end; however, if you alternate the signs (subtract every even denomenator instead of adding), the infinite series converges to ln(2). If you add the squares of the numbers, you get pi^2/6. Just felt like being a dork :bg
-ac
see - i knew there were some guys smarter than me :bg
(that guy in redskull's avatar pic ain't one of em - lol)
i have $10 says Ray's code works first time
My version is a little different than Raymond's.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
sum REAL8 0.0
denom dd 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
N=1
REPEAT 20
fldz ; s
fld1 ; d,s
mov ebx, N
@@:
fld1 ; 1,d,s
fdiv st, st(1) ; 1/d,d,s
faddp st(2), st ; d,s
fld1 ; 1,d,s
faddp st(1), st ; d,s
dec ebx
jnz @B
fistp denom
fstp sum
dec denom
invoke crt_printf, chr$("%d",9,"%f",10), denom, sum
N=N+1
ENDM
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
And this compares the cycle counts for the inner-loop code from my and cobold's version:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
term real8 0.0
erg real8 ?
sum real8 0.0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 3000
fldz ; s
fld1 ; d,s
counter_begin 1000, HIGH_PRIORITY_CLASS
;-------------------------------------------------------------
; The REPEAT was necessary to get the cycle count without the
; FDIV instruction up to something that could reasonably be
; measured.
;-------------------------------------------------------------
REPEAT 10
fld1 ; 1,d,s
fdiv st, st(1) ; 1/d,d,s
faddp st(2), st ; d,s
fld1 ; 1,d,s
faddp st(1), st ; d,s
;----------------------------------------
; Make sure the coprocessor is finished.
;----------------------------------------
fwait
ENDM
counter_end
print ustr$(eax), " cycles",13,10
finit
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
fld1 ; term := term+1
fadd term
fstp term
fld1
fdiv term
fstp erg ; erg := 1/term
fld sum
fadd erg
fstp sum ; sum := sum+erg
fwait
ENDM
counter_end
print ustr$(eax), " cycles",13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3:
304 cycles
379 cycles
And the results with the FDIV instructions commented out:
46 cycles
77 cycles
So it looks like the most effective optimization would need to somehow eliminate the FDIV instruction.
Thanks to everbody!
without implementing it, i'd assume that it is faster (and more precise) to sum up resp. multiply up the nominator and the demoninator of the final result seperatedly and divide at the end of the calculation.
Something like that:
sum(n) = 1/1 + 1/2 + 1/3 + 1/4 + ... + 1/n = ((...((1*2+1)*3+2)*4+6)* ... )*n+(n-1)!) / n!
which reads in some pseudocode:
nom = 1
denom = 1
for i=2 to n
nom = nom * i + denom
denom = denom * i
sum = nom/denom
I'd expect some parallel execution of the arithhmetic operations.
To improve the load of the execution units in the ALU, the loop may be unrolled and more than one parallel result could be calculated. Something like that:
nom1 = 1
denom1 = 1
nom2 = 1
denom2 = 2
for i=3 to (n-1) step 2
nom1 = nom1 * i + denom1
denom1 = denom1 * i
nom2 = nom2 * (i+1) + denom2
denom2 = denom2 * ( i+1)
sum = (nom1*denom2 + nom2*denom1)/(denom1*denom2)
regards,
Christian
This isn't well tested, and I suspect that the FPU code could be further optimized.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
sum REAL8 0.0
i dd 2
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
fld1 ; d=1
fld1 ; n=1,d=1
@@:
fild i ; i,n,d
fmul st(1), st ; i,n=n*i,d
fstp st ; n=n*i,d
fadd st, st(1) ; n=n*i+d,d
fild i ; i,n=n*i+d,d
fmulp st(2), st ; n=n*i+d,d=d*i
inc i
cmp i, 20
jna @B
fdiv st, st(1) ; n/d,d
fstp sum ; d
fstp st ; empty
invoke crt_printf, chr$("%f",10), sum, 10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
term real8 0.0
erg real8 ?
sum real8 0.0
i dd 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 3000
fldz ; s
fld1 ; d,s
counter_begin 1000, HIGH_PRIORITY_CLASS
;-------------------------------------------------------------
; The REPEAT was necessary to get the cycle count without the
; FDIV instruction up to something that could reasonably be
; measured.
;-------------------------------------------------------------
REPEAT 10
fld1 ; 1,d,s
fdiv st, st(1) ; 1/d,d,s
faddp st(2), st ; d,s
fld1 ; 1,d,s
faddp st(1), st ; d,s
ENDM
;----------------------------------------
; Make sure the coprocessor is finished.
;----------------------------------------
fwait
counter_end
print ustr$(eax), " cycles",13,10
finit
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
fld1 ; term := term+1
fadd term
fstp term
fld1
fdiv term
fstp erg ; erg := 1/term
fld sum
fadd erg
fstp sum ; sum := sum+erg
ENDM
fwait
counter_end
print ustr$(eax), " cycles",13,10
finit
fld1
fld1
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 10
fild i ; i,n,d
fmul st(1), st ; i,n=n*i,d
fstp st ; n=n*i,d
fadd st, st(1) ; n=n*i+d,d
fild i ; i,n=n*i+d,d
fmulp st(2), st ; n=n*i+d,d=d*i
ENDM
fwait
counter_end
print ustr$(eax), " cycles",13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
317 cycles
398 cycles
112 cycles
I have doubts that unrolling the loop could help much because the loop overhead is so small compared to the instructions in the loop.
Quote
I have doubts that unrolling the loop could help much because the loop overhead is so small compared to the instructions in the loop.
i thought of an unrolling and calculation of 2 or 4 results in parallel not to reduce the loop overhead but to break down the dependency chain and achieve more parallelism that way (and/or calculation in SSE2 in single or double precision with 2 or 4 calculations in parallel)
For high numbers of n, you can cut out all the middle-men and just simply use the euler constant plus the digamma of n+1; it's almost certainly quicker to approximate the digamma at one number than do millions of additions. Of course, that's not really the same problem, but I would think it would be the fastest way to get the partial sum.
-ac
That's true for sure, redskull, but the really interesting part of the problem is not to get a value of the harmonic series but to accelerate the assembly program to calculate it :toothy
The problem with accumulating the numerator and denominator values in the loop is that they will eventually become too large. Stored in a REAL8 both values are interpreted as +INFINITY starting when the loop counter reaches 171. Retaining them in the FPU provides more range, but even so when the loop counter reaches 1755 the result of the divide operation stored as a REAL8 is interpreted as a Quiet NaN. Even if the result were stored as a REAL10, it would become invalid long before a result calculated entirely within the loop would.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
_FPCLASS_SNAN equ 0001h ; signaling NaN
_FPCLASS_QNAN equ 0002h ; quiet NaN
_FPCLASS_NINF equ 0004h ; negative infinity
_FPCLASS_NN equ 0008h ; negative normal
_FPCLASS_ND equ 0010h ; negative denormal
_FPCLASS_NZ equ 0020h ; -0
_FPCLASS_PZ equ 0040h ; +0
_FPCLASS_PD equ 0080h ; positive denormal
_FPCLASS_PN equ 0100h ; positive normal
_FPCLASS_PINF equ 0200h ; positive infinity
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
sum REAL8 0.0
i dd 2
n REAL8 0.0
d REAL8 0.0
skipn dd 0
skipd dd 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
fld1 ; d=1
fld1 ; n=1,d=1
@@:
fild i ; i,n,d
fmul st(1), st ; i,n=n*i,d
fstp st ; n=n*i,d
fadd st, st(1) ; n=n*i+d,d
fild i ; i,n=n*i+d,d
fmulp st(2), st ; n=n*i+d,d=d*i
fst n
fxch st(1)
fst d
fxch st(1)
invoke crt__fpclass, n
.IF eax != _FPCLASS_PN && skipn == 0
push eax
print ustr$(i), 9
pop eax
print uhex$(eax),13,10
inc skipn
.ENDIF
invoke crt__fpclass, d
.IF eax != _FPCLASS_PN && skipd == 0
push eax
print ustr$(i), 9
pop eax
print uhex$(eax),13,10
inc skipd
.ENDIF
inc i
cmp i, 1755
jna @B
fdiv st, st(1) ; n/d,d
fstp sum ; d
fstp st ; empty
dec i
print ustr$(i),9
invoke crt__fpclass, sum
print uhex$(eax),13,10
invoke crt_printf, chr$("%f",10), sum, 10
invoke crt_printf, chr$("%f",10), n, 10
invoke crt_printf, chr$("%f",10), d, 10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
171 00000200
171 00000200
1755 00000002
-1.#IND00
1.#INF00
1.#INF00