News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Large Integer Square Root optimization

Started by dsouza123, November 29, 2005, 11:40:07 PM

Previous topic - Next topic

dsouza123

Optimize !

What can be done to optimize the provided large unsigned integer square root routine ?

The routine can be scaled to handle even larger integers, as written 32 dword/128 byte/1024 bit unsigned integers.
Just by allocating bigger integers and their length holding variables.

It uses no floating point, no mul or div instructions, (well one mul after the calc to help determine if the root is from a square).

1 Million square root calculations using 1024 bit integers in 157 seconds (0:02:37) of CPU time on a 1.2 Ghz AMD Athlon.

The main code for the calculation, majority of time spent here.

;
;   Fairly close Delphi/Pascal code using a 16 bit / 2 byte number.
;   The sp filling routine is the only difference,
;   The MASM code doesn't shift, pulls in place.
;
;   dd := 65535;                        (* 16 bit number     *)
;   rm := 0;
;   rt := 0;
;   for i := 0 to 7 do                  (* 0 to (bits / 2)-1 *)
;     sp[i] := (dd shr (i*2)) and 3;    (* 2 bits at a time  *)
;
;   for i := 7 downto 0 do begin
;     rt :=  (rt shl 1);                (* rt * 2            *)
;     rm :=  (rm shl 2) + sp[i];        (* rm * 4  +  sp[i]  *)
;
;     if (rt <  rm) then begin
;       rm := rm - rt - 1;
;       rt := rt + 2;
;     end;
;   end;
;   rt := rt shr 1;                     (* rt / 2  when done *)
;

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; for k = ksp downto 0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  mov ebx, ksp
  mov k, ebx
  .while  k < 0ffffffffh   ; ? maybe ?   sdword ptr k >= 0

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt * 2
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  mov edx, jbr
  dec edx
  mov ecx, jbr
  .repeat
     mov  eax, [brt+edx*4]
     shld [brt+ecx*4], eax, 1
     dec  edx
  .untilcxz

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brm * 4   + bsp[k]
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  mov edx, jbr
  dec edx
  mov ecx, jbr
  .repeat
     mov  eax, [brm+edx*4]
     shld [brm+ecx*4], eax, 2
     dec  edx
  .untilcxz
  mov ecx, k
  mov al, [bsp+ecx]
  or  byte ptr brm, al   ; brm 7..2  bsp[k] 1..0
;  add byte ptr brm, al

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; compare  brt with brm    > = <   3, 2, 1
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  mov j, 2
  mov ecx, jbr
  inc ecx
  mov edx, jbr
  .repeat
    mov esi, [brt+edx*4]
    mov edi, [brm+edx*4]
    .if esi > edi
       mov j, 3
    .elseif esi < edi
       mov j, 1
    .endif
  .untilcxz (j != 2)

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brm = brm - brt - 1
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  .if eax == 1
    mov esi, 0
    stc
    .repeat
      mov eax, [brt+esi*4]
      sbb [brm+esi*4], eax
      inc esi
    .until esi > jbr
 
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt = brt + 2
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    add brt, 2
  .endif
 
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; endfor
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    dec k
  .endw


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt / 2   Square root calc almost done just scale down brt (shr 1)
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  mov edx, 0
  mov ecx, 1
  .while edx <= jbr
     mov  eax, [brt+ecx*4]
     shrd [brt+edx*4], eax, 1
     inc  edx
     inc  ecx
  .endw



[attachment deleted by admin]

dsouza123

An optimization that reduces the size of the data evaluated in upto 30 iterations of the k loop.

After analyzing the content of brt and brm during the main loop k = ksp downto 0
found that 30 full size iterations could be replaced by ones that work with a single dword for data size.


  .if ksp < 30
     mov zsp, 0
  .else
     mov ecx, ksp
     sub ecx, 29
     mov zsp, ecx
  .endif   

;   for k := ksp downto zsp do begin
;      brt * 2
;      brm * 4 + bsp[k]
;      if brt < brm then begin
;        brm = brm - brt - 1
;        brt = brt + 2
;      end
;   end
;   brt / 2

   mov ecx, ksp
   mov k, ecx
   .while (ksp >= ecx) && (ecx >= zsp)
      shl brt, 1
      shl brm, 2
      mov al, [bsp+ecx]
      add byte ptr brm, al
      .if brt < brm
         mov edx, brt
         inc edx
         sub brm, edx
         add brt, 2
      .endif
      dec ecx
   .endw

  .if ksp < 30
     shr brt, 1
     mov k, 0ffffffffh
  .endif


For a given number of bits of bsp, for example 60,
there are 32 bits max for brt and brm.  So bits of brt,brm = bits bsp / 2 + 2.

So the first 60 bits of bsp can be replaced by the above 32 bit data size code.
If the number is 60 bits or less the full size code is skipped, if not it gets used afterwards.

I have more ideas to cut down on the size of the data evaluated in each iteration,
but haven't done any coding.
--------------------------------------------------------
A general question, what is efficient code to replace the following

hi_limit := 5000;
lo_limit := 0;
for i := hi_limit downto lo_limit begin
end;

or when the lo_limit is non zero such as
lo_limit := 10;

The best I could come up with was
.while (hi_limit >= i) && (i >= lo_limit)
  the body of the loop
  dec i
.endw


For lo_limit := 0 tried
.while (i >= 0)
  dec i
.endw

but when i == 0 and is decremented it becomes 0FFFFFFFFh
which as an unsigned dword is still >= 0.

dsouza123

Corrected a bug that ignored the result of brt < brm.
It was skipping the brm - (brt + 1) and brt + 2.

The new time was 3:35 instead of 2:37.

Added the 30 iteration optimization.
The time dropped down to 3:10.

Added code at the end that displays bdd, brt, brm in hex.

[attachment deleted by admin]

dioxin

dsouza123 ,
   what is the purpose of your code?

   Do you want your version optimising or do you just want a fast, large number square root?

   Will it need decimal digits in and out or is it part of a larger package which will do the decimal conversion elsewhere so it's just a binary square root that's needed?

   I have a rough square root program (mostly in inline ASM in PowerBASIC), binary only, which finds the 512 bit root of a 1024 bit number in 23usec  on a 1.9GHz Athlon.
   This version is not fully optimised, I think there's at least 1 significant improvement possible.

   For comparison your 3:10 code takes 1:55 on my CPU so mine runs in about 60% of the time it takes yours for the same thing.
   So, I'd expect my code to run in around 40s on yours with the same 1,000,000 iterations.

   My version doesn't set out to find integers only but can be set to give an arbitrary precision and will calculate as long as needed to get the desired accuracy.
               As it happens, if you need decimal in and decimal out then the same method I used can operate directly on decimal digits but I'm not sure if it'll be quicker to convert from decimal to binary, do the SQRT and convert back to decimal or to calculate directly in decimal.

   I think I also have some code lying around somewhere for converting large binary numbers into text so the answer can be read by a human.

   Be warned, it'll not be MASM ready if I post it (inline ASM in PowerBASIC) so if you're interested you'll need to tidy it up.

Paul.

dsouza123

Mostly interested in fast square root but
also other large number operations.

I've been working on large integer code, currently focused on sqrt,
first in delphi to get it to work and now recoded in assembly for speed.
Hope to get efficient addition, subtraction, shift, compare.
The square root routine has subtraction, shift and compare,
not sure how optimimal they are.

Picked binary unsigned integer because it is a format that the CPU/ALU
can work with natively.  Also experimented in delphi with binary ascii,
the shifting was instant, and sometimes the compare
but the subtraction was more involved.

Since I hadn't found any code for MASM, I wanted to make it available
for others who might be interested.

Having the number in binary is fine.
Converting to/from other formats is a separate issue,
which is straight forward for hex ascii, already have convert to hex,
not much harder for decimal ascii, though I am sure
there are many algorithms and optimizations for the conversions.

Decimal ascii to binary and back for large integers could be a good thread !

I am interested in seeing your large number square root code,
the more routines available gives more opportunities to learn and optimize !

dsouza123

Added :
Decimal ascii string to binary
Hex ascii string to binary
Write string to logfile.

Alternatives codes, commented out, for the various routines.
Examples table lookup for bin2hex and hex2bin.

Here is the decimal string to binary conversion.  Converts one word at a time.

; =================== decimal string szbdd4 to binary bdd

  invoke lstrlen,ADDR szbdd4      ;  309 (dec str)  0..308 + 0 byte
  mov jds, eax
  dec jds

  xor eax, eax    ; get first dec char  str(0)
  mov al, szbdd4
  and al, 15
  add bdd, eax    ; convert and store in lo dword of bdd array

  mov wcnt, 1     ; word count 0..1
  mov esi, 1      ; set index to next char  str(1)

  .while (jds >= esi) ; for each  dec char
     mov eax, esi
     and eax, 7
     .if eax == 0     ; every 8 dec chars uses 2 words
                      ; done to prevent full array width (bdd) multiply * 10
                      ; only does wcnt words
                      ; it does have some margin
                      ; really 12 dec chars = 10 hex chars = 5 bytes   hex chars * 1.2 = dec chars
                      ; this   16 dec chars = 16 hex chars = 8 bytes

        cmp wcnt, 63         ; dwords 0..31   words 0..63
        setc al
        add wcnt, eax
        add wcnt, eax
     .endif

     xor ecx, ecx     ; current dec char  to  dword carry
     mov cl, byte ptr [szbdd4+esi]
     and cl, 15

     mov edi, 0
     .repeat
       xor eax, eax                   ; bdd * 10   one word at a time
       mov ax, word ptr [bdd+edi*2]
       lea eax, [eax*4 + eax]         ; * 5
       add eax, eax                   ; * 2       ; slower shl eax, 1
       add eax, ecx                   ; + carry  (1st time is dec char)
       mov j, eax
       mov word ptr [bdd+edi*2], ax   ; store  lo word in bdd
       mov cx, word ptr j+2           ; carry  hi word
       inc edi
     .until edi > wcnt
     
     inc esi
  .endw

[attachment deleted by admin]

dsouza123

Added:
Binary to binary ascii string.

Squashed a bug that caused wrong results.
It was a missing  dec edx  in the  brt < brm routine.

Compared the results to another program, in ubasic, they agree.

[attachment deleted by admin]

dsouza123

Optimizations:
Quick two dword code.
Only use as many dwords as needed for brt*2, brm*4, compare, brm - (brt + 1).

New time 1:49.

[attachment deleted by admin]

dsouza123

Optimization:
Combined  brt*2 and brm*4 in one loop.

New time 1:30.

[attachment deleted by admin]

dioxin

#9
Dsouza123,
   sorry for the delay but I found a small error in the routine I had and I didn't have time to correct it until now.
   The attached program is in PowerBASIC but the important bits for extracting the SQRT are all inline ASM so conversion to MASM should not be too difficult. The BASIC parts are just to get the numbers in and out in the right format.

   The source code as shown calculates the square root of a 1024 bit number in around 78,000 clks on my Athlon which I reckon is over 30% faster than the numbers you are posting.
   This is NOT optimised code. I can think of at least 2 significant improvements but they would make the code more difficult for you to follow.
   One is to scan forward until the first non-zero bit is found and then begin calculating. At the moment all the leading zeroes are calculated unnecessarily.
   The other is to subtract only the DWORDs that are needed in the subtraction rather than the whole number. This value increases as the calculation progresses but in the early stages most of the bits don't take part in the calculation so time could be saved.


   As posted, the code will run from a command line with the HEX number you want as a parameter. e.g.
bigsqrt 1f234e 
   will calculate the square root of 1f234e. If no parameter is given then the 1024 bit number you used in your last posting is used instead... but because of the restriction of this methed that the top 2 bits must be 0 the number is padded to 2 DWORDs bigger, that's why it takes longer to calculate than the figure above.
   The remainder I get is identical to yours so it may well be that the methods are just slightly different implementations of the same thing. I haven't examined your code to check.


   Feel free to modify it yourself if you think it may be useful.

Paul.

[attachment deleted by admin]

dsouza123

Thanks Dioxin

Will examine it and try to convert it to MASM. 
Having multiple large number SQRT routines to pick from can come in handy. 

I can't think of any more significant optimizations for the code I've been posting,
just small items/low level, add instead of inc (not sure if it makes a difference),
removing stalls, using all six registers, using a register instead of variable where possible,
replacing if > max then max code with cmp, adc. 
Not sure how many seconds these small optimizations could remove.

I am curious what the algorithm for your SQRT code is.
The one I've used is.

Put the number in binary, grouping two bits together  ie the byte E5 becomes the 4 pairs 11 10 01 01.

root = 0
remainder = 0
Loop for each pair high to low
   root * 2
   remainder * 4 + current pair
   if root < remainder
     remainder - ( root + 1)
     root + 2
   endif
endLoop
root \ 2
if remainder == 0
  root is from a square  ie root*root == number

aaustin

If you need some help, here is karatsuba square root algorithm.

for number a3*b^3 + a2*b^2 + a1*b + a0 where b is a power of two and b > a[n] => 0 and b > a3 => b/4.


function for SqrtRem(a3*b^3 + a2*b^2 + a1*b + a0)

SqrtRem(a3*b+a2) => (S', R')
DivRem(R'*b + a1, 2*S') => (Q, U)
SqrtWholePart = S'*b + Q
SquareRootRemainder = U*b + a0 - Q^2
if SquareRootRemainder < 0 then
SquareRootRemainder += 2*SqrtWholePart - 1
SqrtWholePart-- (decrement)
end if
return(SqrtWholePart, SquareRootRemainder)

hope that helps.

Almost forgot: to do this you may need to shift the number by an even power of two. Then you can easily remove the even power of two from the answer. The reason for this is the a3 => b/4 restriction.

dioxin

   <<I am curious what the algorithm for your SQRT code is.>>

   I learned the method when I was 12 for use on a hand operated adding machine.

   It's harder to explain than to demonstrate and gives square roots quite quickly.
   It just happens that the exact same method works even better in binary.

   It goes something like this:

   Enter the number you want the root of in register 1
   Enter the lowest odd number in the highest digit in register 2.

   repeat
      Rotate handle to subtract 2 from 1
   
      If the bell rings (which signals underflow) then
         add the number back in
         reduce the current "odd digit" by 1,
         increase the next lower digit by 1
         shift register 1 left

      else
         add 2 to the subtractor
      endif

   until precision is good enough.

   The answer actually appears in a third register which counts the number of subtractions that took place for each

digit

   Let's try a simple example, sqrt(2) to 3 decimal digits

20000000   the number I want the square root of, decimal point is implied after first digit
-1      the initial subtractor, 1 in the same column as the first digit
subtract   digit1-count=1 (i.e. I've done 1 subtract so far)
10000000
-3      add 2 to the current digit in the subtractor
subtract
-20000000   Ding! Underflow, the bell rings so:
add      add it back in
10000000
-21      reduce current subtractor digit by 1 and set next lower digit to 1 (so 3000 becomes 2100)

      Now shift register1 left one place (same as shifting register 2 right 1 place)
10000000
- 21
      Since 1 subtract succeded without overflow the first digit of the result is 1

   Now repeat for the next digit:
10000000
- 21
subtract
07900000   digit2-count=1 since 1 subtract has taken place
- 23      add 2 to the current subtractor digit
subtract
05600000   digit2-count is now 2 since we've done 2 subtracts
- 25      add 2 to the current subtractor digit
subtract
03100000   digit2-count = 3
- 27      add 2 to the current subtractor digit
subtract
00400000   digit2-count = 4
- 29      add 2 to the current subtractor digit
subtract
-02500000   Ding! Underflow, the bell rings so:
add      add it back in
00400000
- 281      reduce current subtractor digit by 1 and set next lower digit to 1 (so 2900 becomes 2810)
      Now shift register1 left one place (same as shifting register 2 right 1 place)
00400000
-  281

   Since we did 4 successful subtractions on the second column, the 2nd digit of the answer is 4
   So the answer so far is 1.4, not accurate enough so continue with next digit

   Now repeat for the next digit:
0040000
-  281
subtract
0011900   digit3-count=1 since 1 subtract has taken place
-  283      add 2 to the current subtractor digit
subtract
-0016400   Ding! Underflow, the bell rings so:
add      add it back in
0011900
-  2821      reduce current subtractor digit by 1 and set next lower digit to 1 (so 2830 becomes 2821)
      Now shift register1 left one place (same as shifting register 2 right 1 place)
0011900
-   2821
   Since we did 1 successful subtractions on the 3rd column, the 3rd digit of the answer is 1
   So the answer so far is 1.41, not accurate enough so continue with next digit

   Now repeat for the next digit:
0011900
-   2821
subtract
0009079   digit4-count=1 since 1 subtract has taken place
-   2823   add 2 to the current subtractor digit
subtract
0006256   digit4-count=2 since 2 subtracts have taken place
-   2825   add 2 to the current subtractor digit
subtract
0003431   digit4-count=3 since 3 subtracts have taken place   
-   2827   add 2 to the current subtractor digit
subtract
0000606   digit4-count=4 since 4 subtracts have taken place
-   2829
Subtract
-0002223   Ding! Underflow, the bell rings so:

   at this point we could carry on to gain unlimited accuracy but we now have the 3rd decimal place we need
   the third digit is 4 so the final result is 1.414



   Like I said, it's a lot easy to have demonstrated than to explain.

   The thing is, the exact same method works in binary but, because there can NEVER be more than 1 subtraction with any

digit then each subtraction fails (and give a zero in the answer) or succedes (giving a 1 in the answer)

   You may also notice that register 2 in the above example is becoming 2.828..=sqrt(8)=2 x sqrt(2)
   This means we can actually not bother to count the number of subtractions when we do it in Binary and we can still easily extract the correct answer for register 2.


   I hope that helps!

Paul.





dsouza123

Thanks aaustin and Paul.

aaustin, I'll have to study that method further.

Paul, I've studied your code, it far smaller than i imagined, it does so much in so little code.
still have to comprehend it better, then will MASMize it.


Optimizations:
brm - (brt + 1) used more registers and partial unroll. Time 1:12.
did a suggested opcode replacement rcl instead of shld in brt*2.
New time 1:02.

[attachment deleted by admin]