News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Dear Math and FPU-Experts - help needed!

Started by cobold, August 16, 2008, 03:47:03 AM

Previous topic - Next topic

Mark_Larson

 I modified the code to get rid of the BCD.  I have an array size of 1024 bytes.  I add the ENTIRE array size together every loop ( very inefficient).  And it is only taking me 2.980164767 micros to run it and find the 2749th Fib.  I will add a check to see if both the current values are 0, and then exit the loop early, that only works since the number grows, so if both quadwords we just read from F0 and F1 are a 0, then we exit.

here is the modified code.  I changed it to doing 32-bit adds, since we need to handle the carry in a 64-bit register.

there are NO memory variables at all.  F0 and F1 are both 8192 byte arrays.
   mov qword [f0], 1 ; f0 = 1
   mov qword [f1], 1 ; f1 = 1

mov ebx,NUM_FIBS ; number of fibs = 2
outer_loop:
;yasm automatically uses OFFSET so using it will break the code
; mov esi, OFFSET f0
mov rsi,f0 ; esi = f0
; mov edi, OFFSET f1
mov rdi,f1 ; edi = f1

;   xor eax,eax ; num_digits = 0 ( we're 0 based for the #)
; mov ecx,eax ; loop_cnt = num_digits = 0 ( we're 0 based for the #)

   pxor mm7,mm7 ; carry =  0

mov eax,ARR_SIZE_BYTES
do_adds:
movd mm0, [rsi] ; f0
movd mm1, [rdi] ; f1
paddq mm0,mm7 ; add f0 + carry, 64-bit add :)
paddq mm1,mm0 ; add f0 + carry + f1
movq [rdi],mm1
movq mm7,mm1
psrlq mm7,32

add rsi,4 ; we deal with 8 bytes at a time
add rdi,4
sub eax,4
jnz do_adds


sub ebx,1 ; number of fibs -= 1
jnz outer_loop



BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

QuoteThis is not hard.

I could possibly agree with that. BUT, can your approach solve this particular problem within 1 second?
As you can see, it's not simply a matter of finding F2749.

QuoteIt turns out that F541, which contains 113 digits, is the first Fibonacci number for which the last nine digits are 1-9 pandigital (contain all the digits 1 to 9, but not necessarily in order). And F2749, which contains 575 digits, is the first Fibonacci number for which the first nine digits are 1-9 pandigital.

Given that Fk is the first Fibonacci number for which the first nine digits AND the last nine digits are 1-9 pandigital, find k.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

raymond

Quotethere are NO memory variables at all.  F0 and F1 are both 8192 byte arrays.

I still have a problem with such a statement. If those 8192 byte arrays are not memory, what are they?
My previous quote was that my program was using less than 150 bytes of TOTAL memory, all of it being in the .data section.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

cobold

Quote from: codewarp on August 23, 2008, 10:58:36 PM

OK, let's think outside the box...  Fibonocci calculation is just simple addition.  A String is a sequence of bytes that can be an unlimited integer if you use it that way.  All you need is to be able to ADD two strings together like big integers, corresponding bytes, propagating the carry on up.  The sum needs more space only when the top bit is set.  Using my own dynamic Strings, my calculation of fib(#2749) using this approach takes 580,000 clock cycles in C++, and results in a string result of 476 hexdigits starting with 100c38d67c9f4... and ending ...479ee63a70b0d9.  I am counting fib( ) 1, 2, 3, 4, 5, 6, as 1, 2, 3, 5, 8, 13, and so on.

codewarp - I know that - I've written code to add in binary for as many bits you want, but .....

Quote from: codewarp on August 23, 2008, 10:58:36 PM
Also, BCD representation does not add anything to this process, except the ability to read off the answer in decimal.  But you can pull 9 decimal digits at a time, from a 2000 bit integer that you divide by one billion (decimal) and take the remainder.  This is not hard.
You should test which Fibonacci-number has both it's first and last 9 DECIMAL-digits pandigital, seriously - you pull 9 decimal digits from a 2000 bit integer? With 2000 bit you can represent a decimal number of 602 digits - the solution of the problem I mentioned is a number with 68855 dec-digits. I think the code you need to determine which 2000-bit-chunk to use would require more time than starting right away with BCD's. Anyway can you please explain the background of "dividing" 2000bit binary by one billion? Do you mean the result is a binary number that has 9-decimal-digits? Sorry, but I didn't understand exactly what you meant.

So how do you test a sequence of DECIMAL digits in a BINARY number - that was the reason why I came to using BCD for this specific problem.
Also, those that think Fibonocci is uninteresting and a waste of time should find out about the Golden Section and its vast applications in nature and engineering.

Quote

Mark_Larson

Quote from: raymond on August 24, 2008, 02:22:55 AM
Quotethere are NO memory variables at all.  F0 and F1 are both 8192 byte arrays.

I still have a problem with such a statement. If those 8192 byte arrays are not memory, what are they?
My previous quote was that my program was using less than 150 bytes of TOTAL memory, all of it being in the .data section.


you are doing a malloc() for your arrays, correct?  if not you aren't going to have enough digits.  I avoided the malloc and did the array in the .data section.  That's what I meant.  if you're arrays can only be 150 bytes, then you won't be able to solve the problem.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

Quote from: raymond on August 24, 2008, 02:13:10 AM
QuoteThis is not hard.

I could possibly agree with that. BUT, can your approach solve this particular problem within 1 second?
As you can see, it's not simply a matter of finding F2749.

I should be able to get it that fast.  Why don't we try something different.  Why don't you post the code you have, and I'll add in my stuff, since I know how it works?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

I fixed a few bugs, so it should be working now.  I got the "early out" also done.  And it runs a lot faster.  There are still stuff I can do to make it faster, but it's good enough for now.  I also got rid of the arrays, and now I am doing an aligned malloc.  It's 16-byte aligned.  You will need to have any memory that is allocated also aligned to a 16-byte boundary, if you want to use my code.  The new malloced size is 512k for both buffers.  If it gets bigger than 512k digits, then we can always re-allocate at runtime.  It took me 245.244360902 nanoseconds to compute fib 2749 with the early out code.  The original code took 706385.992481203 nanoseconds.  As you can see, a huge speed up.  I am still playing with different ways to do the early out code.

here is my data, and equates.  We already have the first 2 fibs calculated, so I subtract two from which FIB we are calculating.  The f0 and f1 have to be a quad word on my system due to it being 64-bit.  Yours will be a dword.  I am not going to show how I allocate the memory since it'll be confusing.  They don't use the stack anymore to pass parameters.  That is why I am staying with 64-bit linux.  It's a lot cooler than 32-bit.

%define NUM_FIBS 2749-2
%define ARR_SIZE_BYTES 512*1024
align 16
f0 dq 0
f1 dq 0


all the instructions on the FAR Left of the screen are what I added to check if we were adding two 0's together and if so exit the loop.  I have two loops.  One loop adds up the two values f0 and f1.  The second loop resets my variables when we move onto the next pair of values.  In my code, the outer loop runs NUM_FIBS times.  On your code you would keep running until you found the pandigital for both ends.



section .code


mov rsi,[f0]
mov rdi,[f1]
       mov qword [rsi], 1 ; f0 = 1
       mov qword [rdi], 1 ; f1 = 1

mov ecx,NUM_FIBS ; number of fibs = 2

outer_loop:
;in yasm you can't use the keyword 'OFFSET'.  So the below to lines are how the code would look in MASM
; mov esi, OFFSET f0
; mov edi, OFFSET f1

;not the fastest way to do this, unroll it twice.
test ecx,1
je swap
mov rsi,[f0] ; esi = f0
mov rdi,[f1] ; edi = f1
jmp done_swap
swap:
mov rsi,[f1] ; esi = f1
mov rdi,[f0] ; edi = f0
done_swap:

   pxor mm7,mm7 ; carry =  0

mov eax,ARR_SIZE_BYTES
do_adds:
movd mm0, [rsi] ; f0
movd mm1, [rdi] ; f1

paddq mm7,mm0 ; add f0 + carry, 64-bit add :)
paddq mm7,mm1 ; add f0 + carry + f1

;check to see if mm7 is a 0, if so exit the loop
movq mm6,mm7
movd [rdi],mm7
movd edx,mm7
psrlq mm6,32
movd ebx,mm6
or edx,ebx
je do_outer_loop
psrlq mm7,32

add rsi,4 ; we deal with 8 bytes at a time
add rdi,4
sub eax,4
jnz do_adds

do_outer_loop:
sub ecx,1 ; number of fibs -= 1
jnz outer_loop


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

raymond

Quoteyou are doing a malloc() for your arrays, correct?

WRONG!!!
I said and repeat once more less than 150 bytes of TOTAL memory. I do NOT allocate any memory outside my .data section.

Quoteif you're arrays can only be 150 bytes, then you won't be able to solve the problem.

Wrong again. Those 150 bytes of TOTAL memory even include all the other variables I need. I DON'T use any LOCAL variables either and DON'T use the stack to store anything. (To be more exact, I use 108 bytes of TOTAL memory with some room to spare.)
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

Mark_Larson

  can I see your code?

EDIT: you are only saving the tail and head?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

codewarp

Quote from: cobold on August 24, 2008, 09:14:36 AM

You should test which Fibonacci-number has both it's first and last 9 DECIMAL-digits pandigital, seriously - you pull 9 decimal digits from a 2000 bit integer? With 2000 bit you can represent a decimal number of 602 digits - the solution of the problem I mentioned is a number with 68855 dec-digits. I think the code you need to determine which 2000-bit-chunk to use would require more time than starting right away with BCD's. Anyway can you please explain the background of "dividing" 2000bit binary by one billion? Do you mean the result is a binary number that has 9-decimal-digits? Sorry, but I didn't understand exactly what you meant.

So how do you test a sequence of DECIMAL digits in a BINARY number - that was the reason why I came to using BCD for this specific problem.

Ok, fair enough, I missed the reference to the decimal requirement.  But for that, I would just change the addition into large integer representation where each successive 32-bits holds values 0 to 999,999,999.  Addition with that is only slightly more complicated by the "carry" propagation, occurring when any of the 9-digit counters wraps.  But the test for leading and trailing 9 digits becomes rather simple, in comparison to the binary case, as you pointed out.  Extracting 9 digits out of 32-bits is certainly much easier than the load of computing the entire thing in bcd.

Also, it seems to me that limiting such algorithms to 150 bytes might be some kind of line in the sand, but it also leads to methods that are unscalable to arbitrary numerical complexity.

raymond

QuoteEDIT: you are only saving the tail and head?

I was sure you would finally solve the enigma. It's too late tonight but I will provide more explanations tomorrow night (I have a busy day tomorrow). Meanwhile, here's my entire data section.

.data
      tailsrc     dd    0
      taildes     dd    0
      headsrc     dd    0
      headdes     dd    0
                  db    8 dup(0)
      headbuf     db    24 dup(0)
      headbuf1    db    24 dup(0)
      tailbuf     db    1,11 dup(0)
      tailbuf1    db    1,11 dup(0)
      fibnum      dd    2
      time0       dd    0
      bytecnt     dd    1


I would guess that the melodrama has lasted long enough. In a few days, I will also divulge the details of my improved program (0.22 sec.).
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dsouza123

Figured out a way, practically instant. :bg

Roughly 250 bytes memory used for variables, not optimized.

This is just the first9 not integrated with the last9,
for a day I have been working on a version with both together
but it has some elusive bug.

So I have attached the streamlined code with only the first9 test.

[attachment deleted by admin]

dsouza123

A more advanced version of the instant code.

This one calculates both first9 and last9.

It shows both values, but it still does exactly the number of
iterations k needed for both pandigital.

The first9 and last9 digits are correct, they matched the values derived
from a previous executable that takes 21 seconds.

The next version will be open ended only stopping
when both first9 and last9 pass the pandigital tests.

[attachment deleted by admin]

raymond

codewarp mentionned to "think outside the box".

Although Fibonacci numbers may contain an infinite number of digits, the problem specifies to find the first one where the last 9 digits (least significant) are pandigital AND the first 9 digits (most significant) are also pandigital. Therefore, there is no necessity to know ALL the digits in between.

The last 9 digits (least significant) are easy to compute in numerous ways. It's evident that adding binary numbers would be the fastest. However, each number must also be checked for pandigitality. Converting a 9-digit binary number to a decimal string may thus be significantly slower than using the slower BCDs for the additions of at most 9 digits (unless someone knows how to test for pandigitality without converting to decimals).

Adding packed BCDs is also faster than adding the same number of digits using unpacked BCDs. However, unless a fast algo is developed to check for pandigitality of packed BCDs, the unpacking process may exceed the time savings of the additions.

As for extracting the first 9 decimal digits of Fibonacci numbers generated from binary additions, I am still waiting to see valid code which will not need to compute the entire number. In my original program, I retained only the most significant 18 digits. When the addition overflowed to 19 digits, I would check the value of the least significant of those digits (of both numbers) and use it to round the remaining digits before dropping those last digits. (N.B. For this specific problem, retaining 15 digits is the minimum if rounding is done properly.)

Finally, there is no need to check the pandigitality of both the first and last 9 digits of each Fibonacci number. If one of them is not pandigital, no use checking the other.

As for my data section,
- The tailsrc/taildes are the addresses of the source and destination buffers for the last 9 digits. The content of the source is added to the content of the destination, overwriting the latter. The addresses are interchanged before computing each Fibonacci number. (The same applies for the headscr/headdes variables related to the first 9 digits.)
- The headbuf/headbuf1 buffers are for computing the first 9 digits using up to 19 digits. One of them becomes the source, the other one becomes the destination. The tailbuf/tailbuf1 buffers are for computing the last 9 digits. (The extra declared bytes are mainly for alignment and/or simplified code.)
- fibnum is the Fibonacci number of the data currently in the destination buffers.
- dest0 is to hold the TickCount at the start of the program.
- bytecnt is to hold the digit count for the source buffer of the last 9 digits until it exceeds 9 (a copy of the digits is then made to the buffers for the first 9 digits), and then to hold the digit count for the source buffer of the first 9 digits until the end of the program.

The improved program in a few days!!!
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dsouza123

Fully working instant version, fraction of a second.

Tests for both first9, last9 pandigital,
both needed for loop exit condition.

Output is three Messageboxes showing first9, last9, k.

Source and executable in FibB06_g.zip

[attachment deleted by admin]