It was back in March or April that I started working on re-building some of the assembly and systems level skills that I've had on the back burner for several years now. I've always liked numbers and I've always understood that they can become quite large although I've never really grasped the true meaning of things like +/- infinity. At any rate, I started working on multiple-precision routines to see how large I could get the Fibonacci series to grow on the little PC I was working with at the time. Well, I was really proud because I had written a dos based console app that could write the first 150000 terms into the Bit-Bucket, the nul device that is, in about 3 hours and a few odd minutes.
Enter Ed Beroset ... I searched on the net for someone who seemed to know something about this ole guy Fibonacci and the other fellow named Lucas and I found Ed! Turns out that he even knew about some fancy tricks with log's that could be used to figure out how many digits a particular Fibonacci term would have. So, I sent him a copy of my 'fine' program and asked him for comments. He was wonderful and in a few days sent me back a tiny program that did the same thing I was doing in about 3 minutes! It was written using 16-bit code where I had used 32-bits and what I had imagined would be pretty quick code! Ah, it was pretty and I don't even want to think about how many weeks it took me to get that first working model tuned to the point where it didn't take a day to do what I was trying to accomplish!
Well, I certainly learned a lot from his code and I changed my ways! Instead of using binary arithmetic and going thru the drudgery of converting all those bits back into base 10 digits I decided to give BCD a try in 32-bit mode. Ed had used unpacked BCD and the 16-bit instructions and his program flew! I searched around on the net and found an algorithm that I had seen implemented in hardware that did the same thing ... it was amazing and I've been trying to finish my program ever since I found it and actually got it working!
Now, I'm near the limit of what I can do in 32-bit ALU mode with instructions that still work on those ancient 386's. How close to that limit, I really don't know. When I get there I'll probably know. The algorithm I'm using is reasonable when I'm actually writing every term. The conversion cost, even when working with numbers stored using a base of 1,000,000,000 so you can easily convert 9 digits at a time, is more costly than doing the math in BCD as you add each term. At least, that's what I've experienced. With help and suggestions from many of you I can now produce the first 150000 terms in just under 10 seconds when dumping them into the Bit-Bucket. I've also verified that the output is actually correct by using md5sums to validate the huge output files. The time to write the files is variable and much more dependent on the state of the system and file system so that's why I use the Bit-Bucket for my standard elapsed time test.
Finally, enter MichaelW and his fantastic timers.asm ... I've built a little test here that produces the first 300 terms of the Fibonacci series and displays the last. Even created an ad-hoc bcd$ macro that works with the print macro like hex$ and chr$. I've really been learning quite a bit ... but I'm *still* not quite doing it *right* ... I know because it just doesn't feel right yet!
If anyone would like to have a look and help me refine this algorithm I would certainly be grateful. I know the world doesn't need yet another Fibonacci program but, for me it's the experience of making it better incrementally that's been very benificial for me. I am probably close to my own limits of being able to see new ways to approach further improvements ... at least, without suggestions or pointers. Anyway, here's the algorithm in its current state.
;****************************************************************************
; BCD addition algorithm courtesy Douglas W. Jones at University Of Iowa
; Details available at http://www.cs.uiowa.edu/~jones/bcd/bcd.html#packed
; This is my implementation of his posted C algorithm.
;****************************************************************************
; Quick and dirty implementation ... uses globals and preserves nothing!
; Called with:
; esi -> valid BCD source operands
; edi -> valid BCD destination operands
; wksize = number of dwords used in both source and destination
; ddsize = maximum number of dwords available for each
; Returns with:
; destination = source + destination
; wksize incremented if another dword was needed to store final carry
; carry clear if no errors occured
; carry flag set on error
; all registers destroyed
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AddBCD proc C
mov eax,wksize ; Number of dwords currently used
mov ecx,eax
shl eax,2
neg ecx ; Loop counts up from minus wksize
add esi,eax ; Point 1 dword past most significant
add edi,eax ; source and destination
nocry: mov eax,[esi+4*ecx] ; Start with no carry-in for least
mov ebx,[edi+4*ecx] ; significant source and destination
mov edx,eax ;
add eax,66666666h ; eax = source + excess_6
jc foul ; No carry for 6's + 9's, only F's
add eax,ebx ; eax = dst+src+excess_6
jc cry1 ; Do carry-out if set
nocry1: xor edx,ebx
xor edx,eax
not edx
and edx,11111110h ; Leave a 1 for every BCD digit
mov ebx,edx ; that did not produce a carry-out
shr edx,2
shr ebx,3
or edx,ebx
or edx,60000000h ; Leave a 6 for each digit that did
sub eax,edx ; not carry and remove excess_6
mov [edi+4*ecx],eax ; Store valid result in destination
add ecx,1 ; Count up to 0 and do each dword
jnz nocry ; Remember, no carry-in for the next
jmp short ok
cryx: add eax,ebx ; Source was 99999999 + carry-in
jmp short cry1
cry: mov eax,[esi+4*ecx]
mov ebx,[edi+4*ecx]
mov edx,eax
add eax,66666667h ; Excess_6 + carry-in
jc cryx
add eax,ebx
jnc nocry1 ; No carry-out or carry-in for next
cry1: xor edx,ebx
xor edx,eax
not edx
and edx,11111110h ; Same as above except all digits
mov ebx,edx ; can be handled with mask because
shr edx,2 ; the most significant digit DID
shr ebx,3 ; produce a carry and doesn't need
or edx,ebx ; to be adjusted to remove excess_6
sub eax,edx
mov [edi+4*ecx],eax ; Store valid result in destination
add ecx,1
jnz cry ; Do carry-in for next
mov eax,wksize ; Send carry to the next dword
cmp eax,ddsize
jae foul ; Oops! Not enough dwords available
add wksize,1 ; Bump number of dwords being used
mov dword ptr [esi],0 ; Clear next source
mov dword ptr [edi],1 ; Store the carry and return
ok: clc
ret
foul: stc
ret
AddBCD endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
And here are the results of 3 currently identical timing loops that are included in the zip.
C:\ASM\test>bcdtime
28826 AddBCD: 0222232244629420445529739893461909967206666939096499764990979600
28838 Test A: 0222232244629420445529739893461909967206666939096499764990979600
28813 Test B: 0222232244629420445529739893461909967206666939096499764990979600
Press enter to exit...
Test A and B are intened for anyone to modify and play with. Right now, they are identical to AddBCD. As you can see the timings won't be identical because I'm only looping 10000 times since the algorithm is so large. I'm looking for differences that make it up into the hundreds or thousands range ... should be possible and I'll keep working on it too. Just wanted to post it and give my little spiel about the wonders of BCD for very simplistic applications like this.
[attachment deleted by admin]