News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

extra large integer divison

Started by Geryon, January 03, 2005, 11:04:42 AM

Previous topic - Next topic

Candy

I was thinking you might be able to take off N bits at a time if you precompute a certain amount

you split up the source in pieces of K bits each (K ~= 2log(sqrt(N))). You precalculate for all possible bit patterns K can assume (2^K) which amount of divisor you can substract (positive number between 0 and 2^K preferred).

For each set of K bits in the source, shift your index right until you hit a 1, add the amount that can divide by the amount you can at least keep. Substract the amount you actually removed (precise calculation) and continue parsing. This modified method should run in order (sqrt(N) * 2^(sqrt(N))), which imho should be faster than most methods mentioned up to now. It's not memory efficient though, for K=4 (probable end solution) it's quite a memory hog with 16 precalculated values between 0 and 16 with each a precise multiple of the divisor (128-bit) attached. You can keep each in a separate array though, yet I doubt it'll be cache friendly.

HTH, Candy

gluespill

Polizei,

QuoteFactorial is just multiplying ...
I did not say factorial, I said factoring
A factorial, like 3! is 3 * 2 * 1 and does indeed only involve multiplication.
Factoring, on the other hand, requires division. 
I was trying to answer your post in which you asked:
QuoteWell guys, what's the idea of dividing 128bit values or dividing by 128bit bla-bla-bla...
I was just giving you an example of why I would want to divide by an extra larger integer.
I have other reasons if you're interested, or if you like, I could explain what factoring is.

And it would be difficult to appropriately answer your question without knowing whether you are asking
1) why someone would have any practical reason for dividing by 128 or more bits, or
2) why one would attempt to do so on a computer that can not hold numbers that large in a single register?
Or are you hinting that
3) you know of some method that can be employed to divide large numbers without using a CPU, or perhaps a method that does not involve division but can arrive at the same answer as a division operation?

Thanks for the insight on multiplying 64b*64b, and then dividing the 128b result by a 64b divisor.  But my personal interests lie in dividing by up to 17 million bits with full precision.

Cheers.




gluespill

Candy,

Your idea sounds similar to a method to which I have given considerable thought. 
Is your first step critical, the one where you calculate the approximate value of K?
If so, how would you perform the square root without already having an algorithm capable of dividing large integers?
You are also performing a logarithm calculation in your initial step.  Are you just looking up the log in a small lookup table?
When you refer to the 'Source', are you refering to the dividend or to the divisor?

I know it can be difficult to get an entire idea or concept into a single post, at least for me.  But I must say your proposal looks interesting.

laytuhr


Candy

Quote from: gluespill on February 03, 2005, 12:14:31 AM
Your idea sounds similar to a method to which I have given considerable thought. 
Is your first step critical, the one where you calculate the approximate value of K?
That one is the step that makes it differ from a simple linear approach (which would take about 128 steps). This one takes with K=4 16 (2^K) steps of calculation and 32 (128/K) steps of iteration.
Quote
If so, how would you perform the square root without already having an algorithm capable of dividing large integers?
You can decide the optimal value of K in advance. Only for codes with lots of binary zeroes at the start (IE, actually smaller integers) is using a smaller K useful. But then again, it wouldn't make much of a difference either, since this is about large integer division. I'm expecting him not to use it for small integers since he asked it explicitly.
Quote
You are also performing a logarithm calculation in your initial step.  Are you just looking up the log in a small lookup table?
It's the initial step which you still take on paper, since you predecide the value of K for faster code. Especially since K=4 is optimal (more or less) for 64 and 128 bits, with K=4 possibly doing slightly more than K=5 on 256-bits values.

The running time of the algorithm is (with K=the value decided upon, N = the number of bits) 2^K * precalc time + N/K * division time. The concept of the algorithm is that by choosing a small K (explicitly small) you can cut the division time in half while not adding much precalc time. If you have a constant divisor you can optimize for it in a similar way to most encryption routines, you "key schedule" the divisor and then can feed any dividend to it to be divided in 1/K times the original time taken. Of course, if you are going to do lots (and I mean lots, like 2^12 or more) divisions on the same divisor, you can precalculate a larger K for quicker runs. Don't take K > 8 or it's going to outrun the other step quite probably in the foreseeable future.

Quote
When you refer to the 'Source', are you refering to the dividend or to the divisor?

You can replace source with dividend. I'm not naturally english so occasionally I pick the wrong word.

Candy

#34
in an example:

first step:
a97093b7 /03a8238a987ac8b6e \ 4
           2a5c24edc       
           ----------
           102613bbc7ac8b6e

second step:

a97093b7 /03a8238a987ac8b6e \ 58
           2a5c24edc       
           ----------
           102613bbc7ac8b6e
   0fe28dd928
   ----------
    04385e29fac8b6e

Total completion:
a97093b7 /03a8238a987ac8b6e \ 58660494 leaving 0d5e49a2
           2a5c24edc       
           ----------
           102613bbc7ac8b6e
           0fe28dd928
           ----------
            04385e29fac8b6e
            03f8a3764a
            ----------
             03fbab3b0c8b6e
             02a5c24edc
             ----------
              155e8ec308b6e
      1484a1e329
              ----------
               0d9ecdfdfb6e
               0c935af695
               ----------
                10b7307666e
                0fe28dd928
                ----------
                 0d4a29d3ee
                 0c935af695
                 00b6cedd59 -> end of algorithm. See if there's a last fit
                 00a97093b7
                 ----------
                  00d5e49a2

Table used for computing it:
0000000000   0 00
00a97093b7   1 01
01fc51bb25   3 02
02a5c24edc   4 03
03f8a3764a   6 04
04a2140a01   7 05
05f4f5316f   9 06
069e65c526   a 07
07f146ec94   c 08
089ab7804b   d 09
09ed98a7b9   f 0a
0a97093b70  10 0b
0be9ea62de  12 0c
0c935af695  13 0d
0de63c1e03  15 0e
0e8facb1ba  16 0f
0fe28dd928  18 10
108bfe6cdf  19 11
11dedf944d  1b 12
1288502804  1c 13
13db314f72  1e 14
1484a1e329  1f 15
15d7830a97  21 16
1680f39e4e  22 17
17d3d4c5bc  24 18
...


In each step, it takes the first two nibbles from the previous result and looks up the value corresponding to it in the table (third value in the table). It substracts the first value in the table in that row from the previous result, shifted left to match up with the value, then shifts the resulting value left 4 bits and adds the second value in the table row to it. It does this until it cannot continue because the shift is 0. You can apply this algorithm to larger integers in the same way. If the quotient is low the algorithm is slightly slower than the substracting algorithm, yet if it's larger it's substantially faster.

The code used to generate the table (in C, sorry dudes) :

#include <stdio.h>

int main() {
   int i, k=0;
   long long t=0;
   for (i=0; i<32; i++) {
      while (((t + 0xA97093B7) >> 32 & 0x1F != i) {
         t += 0xA97093B7;
         k++;
      }
      printf("%02x%08x  %02x %02x\n", (unsigned int)((t >> 32) & 0xFF), (unsigned int)(t & 0xFFFFFFFF), k, i);
   }
}


Additional sidenotes:

The table for computation must be at least 2 * 2^K, because of edge cases and because it doesn't always decide the most optimal case but the quickest to decide upon. It might leave a 1 at the start of the next sequence.

It requires the divisor to start with a 1 in binary in the bit it uses for dividing. You can make it start at any bit so it is not a really big problem, but it might be.

I expect calculating the factors to take slightly longer than dividing it actually, since the factor-creation adds the divisor once or twice, while the other is a table lookup and a single substract. It's in time complexity 1.5*2^K*time(factor_create) + N / K * time(factor_substract)

Hope it's any good in actual use :)

Note, at the end because of the edge cases a single factor may still be left in. Please do check for it (the example also has it).

[edit] I can't do maths. Forgot to add an E somewhere :)[/edit]
[later edit] and I can't even do proper algorithm design. Made an off-by-8 error. [/later edit]

Hugge

Hi guys...

I'm not following your discussion (don't have time to read it all), but here's a very good paper (pdf) describing a division algorithm. My implementation of it is many times faster than my multiplication algo (Karatsuba, divide and conquer).

Hugge

[attachment deleted by admin]

Candy

Quote from: Hugge on February 05, 2005, 12:35:07 PM
Hi guys...

I'm not following your discussion (don't have time to read it all), but here's a very good paper (pdf) describing a division algorithm. My implementation of it is many times faster than my multiplication algo (Karatsuba, divide and conquer).

Hugge

It's identical aside from two speedups in my algorithm:

1. You don't make an accurate estimate. Instead, you "guess" that you can fit a precomputed amount in the quotient, such that you reduce the first number that's non-0 to either 1 or 0. You then procede to the next number, which saves you from doing an incremental compare on that number (you don't increase your guess, you just live with an inaccurate guess). This also allows the algorithm to modify the quotient for up to 2 decimals, or more in case of overflow, upon being increased.
2. Since you don't have to do this, but calculate in K-bits at a time mode, you can also use any radix you like for it. However, you have to re-compute the precomputed table each time you switch divisor. If you switch with each division, it's unpractical. Since in most cases (except this particular factorization case) you want the remainder rather than the quotient, it's useful. Most cases in my point of view, which is mainly encryption and hashing for long numbers.

You can invert these two to make it faster for constantly switching quotients, should you find that precomputing the table is unpractical for any K>1. This would result in the algorithm itself as described in the paper.

I'm off implementing it in a C++ library for myself :)

bitRAKE

How might modular division combined with Chinese Remainder Theorem perform?

windog

Hi everyone,
      A bit late to this post thread :tdown, but I have some assembly code that I found a while back that divides a 128bit number by a 64 bit number and leaves the results in the registers  :dance:. The code is below, you'll have to do a little work to use it in masm.
The WinDog

PROCEDURE CpuDiv128
;In:
;   edx:ecx:ebx:eax = 128-bit dividend
;   edi:esi = 64-bit divisor
;Out:
;  ebx:eax = quotient
;  edx:ecx = remainder
;Modifies:
;  edi,esi,ebp
    mov     ebp,64          ; 64 bits to process

CpuDiv128_l1:
    add     eax,eax         ; shift left edx:ecx:ebx:eax 1 bit
    adc     ebx,ebx
    adc     ecx,ecx
    adc     edx,edx
    inc     eax
    sub     ecx,esi         ; edx:ecx -= edi:esi
    sbb     edx,edi
    jnb     CpuDiv128_l2
    dec     eax
    add     ecx,esi         ; edx:ecx += edi:esi (to restore)
    adc     edx,edi

CpuDiv128_l2:
    dec     ebp
    jnz     CpuDiv128_l1
    ret
ENDPROCEDURE

AeroASM

Quote from: gluespill on February 02, 2005, 11:52:14 PM
Polizei,

But my personal interests lie in dividing by up to 17 million bits with full precision.


Are you prime-searching or cracking RSA? ( :wink )

gluespill

AeroASM,
I am prime searching for the most part.  How did you guess?  You must be doing some yourself.

As for cracking RSA, no, I'm not tyring to.  But a factoring algo that can factor say about 20,000 decimal digit (67,000 bit) numbers in a matter of hours or less would certainly put a hurt on some of the fairly recent encryption routines.  I'm constantly trying to come up with better and faster ways of factoring.  Right now it looks like parrallel processing on a very large scale is the only feasible way of factoring numbers greater than 1024 bits in any reasonable amount of time.  Must be why the US govt requires that networks of this type be registered as 'supercompters.' 

I like to solve, not crack.  I'll leave the cracking to those with interest.  And even though RSA will eventually melt under a good algo, there's already better encryption routines in use.  I think the encryptors are always trying to stay a step ahead of the crackers.

windog

Hi gluespill,
     Noticed you lived in Grand Rapids, MI.  That's just a stones through away from where I live, Holland, MI.
How long have you lived there?

AeroASM

Quote from: gluespill on February 26, 2005, 05:06:34 AM
AeroASM,
I am prime searching for the most part. How did you guess? You must be doing some yourself.

Nope, I had a fanatical ambition a few months ago (when I first got the hang of asm) to find the next Mersenne prime, then I realised it would probably blow my laptop out of this world.

About RSA, factoring 20000 decimal digits is hoping for a bit too much, especially in under a year. The really important world bank transactions only use 500 (i think). However the fastest algorithm to  factor a random 100 digit number ran in 15 seconds on the CRAY.

Candy

Quote from: windog on March 01, 2005, 09:19:58 PM
That's just a stones through away from where I live, Holland, MI.
*waves* Greetings from Holland, Europe :)

aaustin

#44
I've been working on a factoring algorithm. I've been programming a modulus function for use with a GCD function and I'm trying to speed it up with P4 code.

I have a better idea for precalculating. You can simply use division in order to approximate a certain size multiplier. For instance, let's say you wish to divide really large numbers by each other using standard 32-bit division. You can start by finding the high 32-bits of the divisor, then take the high 64 or 63 bits of the dividend depending on whether or not the high dword of the dividend is larger than the high dword of the divisor. Then you increment the divisor 32-bits by 1 (and if zero, replace division with a 32-bit shift - this ensures that the multiplier times the divisor is less than the dividend). Then you divide the dividend 63/4 bits by the divisor 32-bits and the result is a 32-bit multiplier that you can use to save time by multiplying the full divisor before subtracting. The divisor should initially be virtually placed shifted so that the highest bit of the divisor is at the same location as the highest bit of the low dword of the 63/64 bit dividend. You can also align the dividend and divisor before any operations begin to avoid excessive shifting and can avoid shifting completely by simply taking the top two unaligned dwords of the dividend as the 63/4 bit dividend (removes the need for shifting entirely but only is efficient for the divisor bits approximation size divided by 2 (ie. 32/2 = 16 bits at a time on average)- the unshifted high dword must be less than the divisor to avoid a division exception). I'm trying to use P4 code and I'm trying to see if EP FP can do a full 64-bit division, as this would speed things up even more. Unfortunately I'm still looking for my CD with the intel help files on it in a stack of about 100 unlabeled CDs so I'm not sure if FP would be usable or not.