Optimize !
What can be done to optimize the provided large unsigned integer square root routine ?
The routine can be scaled to handle even larger integers, as written 32 dword/128 byte/1024 bit unsigned integers.
Just by allocating bigger integers and their length holding variables.
It uses no floating point, no mul or div instructions, (well one mul after the calc to help determine if the root is from a square).
1 Million square root calculations using 1024 bit integers in 157 seconds (0:02:37) of CPU time on a 1.2 Ghz AMD Athlon.
The main code for the calculation, majority of time spent here.
;
; Fairly close Delphi/Pascal code using a 16 bit / 2 byte number.
; The sp filling routine is the only difference,
; The MASM code doesn't shift, pulls in place.
;
; dd := 65535; (* 16 bit number *)
; rm := 0;
; rt := 0;
; for i := 0 to 7 do (* 0 to (bits / 2)-1 *)
; sp[i] := (dd shr (i*2)) and 3; (* 2 bits at a time *)
;
; for i := 7 downto 0 do begin
; rt := (rt shl 1); (* rt * 2 *)
; rm := (rm shl 2) + sp[i]; (* rm * 4 + sp[i] *)
;
; if (rt < rm) then begin
; rm := rm - rt - 1;
; rt := rt + 2;
; end;
; end;
; rt := rt shr 1; (* rt / 2 when done *)
;
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; for k = ksp downto 0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov ebx, ksp
mov k, ebx
.while k < 0ffffffffh ; ? maybe ? sdword ptr k >= 0
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt * 2
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov edx, jbr
dec edx
mov ecx, jbr
.repeat
mov eax, [brt+edx*4]
shld [brt+ecx*4], eax, 1
dec edx
.untilcxz
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brm * 4 + bsp[k]
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov edx, jbr
dec edx
mov ecx, jbr
.repeat
mov eax, [brm+edx*4]
shld [brm+ecx*4], eax, 2
dec edx
.untilcxz
mov ecx, k
mov al, [bsp+ecx]
or byte ptr brm, al ; brm 7..2 bsp[k] 1..0
; add byte ptr brm, al
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; compare brt with brm > = < 3, 2, 1
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov j, 2
mov ecx, jbr
inc ecx
mov edx, jbr
.repeat
mov esi, [brt+edx*4]
mov edi, [brm+edx*4]
.if esi > edi
mov j, 3
.elseif esi < edi
mov j, 1
.endif
.untilcxz (j != 2)
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brm = brm - brt - 1
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.if eax == 1
mov esi, 0
stc
.repeat
mov eax, [brt+esi*4]
sbb [brm+esi*4], eax
inc esi
.until esi > jbr
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt = brt + 2
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
add brt, 2
.endif
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; endfor
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
dec k
.endw
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; brt / 2 Square root calc almost done just scale down brt (shr 1)
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov edx, 0
mov ecx, 1
.while edx <= jbr
mov eax, [brt+ecx*4]
shrd [brt+edx*4], eax, 1
inc edx
inc ecx
.endw
[attachment deleted by admin]
An optimization that reduces the size of the data evaluated in upto 30 iterations of the k loop.
After analyzing the content of brt and brm during the main loop k = ksp downto 0
found that 30 full size iterations could be replaced by ones that work with a single dword for data size.
.if ksp < 30
mov zsp, 0
.else
mov ecx, ksp
sub ecx, 29
mov zsp, ecx
.endif
; for k := ksp downto zsp do begin
; brt * 2
; brm * 4 + bsp[k]
; if brt < brm then begin
; brm = brm - brt - 1
; brt = brt + 2
; end
; end
; brt / 2
mov ecx, ksp
mov k, ecx
.while (ksp >= ecx) && (ecx >= zsp)
shl brt, 1
shl brm, 2
mov al, [bsp+ecx]
add byte ptr brm, al
.if brt < brm
mov edx, brt
inc edx
sub brm, edx
add brt, 2
.endif
dec ecx
.endw
.if ksp < 30
shr brt, 1
mov k, 0ffffffffh
.endif
For a given number of bits of bsp, for example 60,
there are 32 bits max for brt and brm. So bits of brt,brm = bits bsp / 2 + 2.
So the first 60 bits of bsp can be replaced by the above 32 bit data size code.
If the number is 60 bits or less the full size code is skipped, if not it gets used afterwards.
I have more ideas to cut down on the size of the data evaluated in each iteration,
but haven't done any coding.
--------------------------------------------------------
A general question, what is efficient code to replace the following
hi_limit := 5000;
lo_limit := 0;
for i := hi_limit downto lo_limit begin
end;
or when the lo_limit is non zero such as
lo_limit := 10;
The best I could come up with was
.while (hi_limit >= i) && (i >= lo_limit)
the body of the loop
dec i
.endw
For lo_limit := 0 tried
.while (i >= 0)
dec i
.endw
but when i == 0 and is decremented it becomes 0FFFFFFFFh
which as an unsigned dword is still >= 0.
Corrected a bug that ignored the result of brt < brm.
It was skipping the brm - (brt + 1) and brt + 2.
The new time was 3:35 instead of 2:37.
Added the 30 iteration optimization.
The time dropped down to 3:10.
Added code at the end that displays bdd, brt, brm in hex.
[attachment deleted by admin]
dsouza123 ,
what is the purpose of your code?
Do you want your version optimising or do you just want a fast, large number square root?
Will it need decimal digits in and out or is it part of a larger package which will do the decimal conversion elsewhere so it's just a binary square root that's needed?
I have a rough square root program (mostly in inline ASM in PowerBASIC), binary only, which finds the 512 bit root of a 1024 bit number in 23usec on a 1.9GHz Athlon.
This version is not fully optimised, I think there's at least 1 significant improvement possible.
For comparison your 3:10 code takes 1:55 on my CPU so mine runs in about 60% of the time it takes yours for the same thing.
So, I'd expect my code to run in around 40s on yours with the same 1,000,000 iterations.
My version doesn't set out to find integers only but can be set to give an arbitrary precision and will calculate as long as needed to get the desired accuracy.
As it happens, if you need decimal in and decimal out then the same method I used can operate directly on decimal digits but I'm not sure if it'll be quicker to convert from decimal to binary, do the SQRT and convert back to decimal or to calculate directly in decimal.
I think I also have some code lying around somewhere for converting large binary numbers into text so the answer can be read by a human.
Be warned, it'll not be MASM ready if I post it (inline ASM in PowerBASIC) so if you're interested you'll need to tidy it up.
Paul.
Mostly interested in fast square root but
also other large number operations.
I've been working on large integer code, currently focused on sqrt,
first in delphi to get it to work and now recoded in assembly for speed.
Hope to get efficient addition, subtraction, shift, compare.
The square root routine has subtraction, shift and compare,
not sure how optimimal they are.
Picked binary unsigned integer because it is a format that the CPU/ALU
can work with natively. Also experimented in delphi with binary ascii,
the shifting was instant, and sometimes the compare
but the subtraction was more involved.
Since I hadn't found any code for MASM, I wanted to make it available
for others who might be interested.
Having the number in binary is fine.
Converting to/from other formats is a separate issue,
which is straight forward for hex ascii, already have convert to hex,
not much harder for decimal ascii, though I am sure
there are many algorithms and optimizations for the conversions.
Decimal ascii to binary and back for large integers could be a good thread !
I am interested in seeing your large number square root code,
the more routines available gives more opportunities to learn and optimize !
Added :
Decimal ascii string to binary
Hex ascii string to binary
Write string to logfile.
Alternatives codes, commented out, for the various routines.
Examples table lookup for bin2hex and hex2bin.
Here is the decimal string to binary conversion. Converts one word at a time.
; =================== decimal string szbdd4 to binary bdd
invoke lstrlen,ADDR szbdd4 ; 309 (dec str) 0..308 + 0 byte
mov jds, eax
dec jds
xor eax, eax ; get first dec char str(0)
mov al, szbdd4
and al, 15
add bdd, eax ; convert and store in lo dword of bdd array
mov wcnt, 1 ; word count 0..1
mov esi, 1 ; set index to next char str(1)
.while (jds >= esi) ; for each dec char
mov eax, esi
and eax, 7
.if eax == 0 ; every 8 dec chars uses 2 words
; done to prevent full array width (bdd) multiply * 10
; only does wcnt words
; it does have some margin
; really 12 dec chars = 10 hex chars = 5 bytes hex chars * 1.2 = dec chars
; this 16 dec chars = 16 hex chars = 8 bytes
cmp wcnt, 63 ; dwords 0..31 words 0..63
setc al
add wcnt, eax
add wcnt, eax
.endif
xor ecx, ecx ; current dec char to dword carry
mov cl, byte ptr [szbdd4+esi]
and cl, 15
mov edi, 0
.repeat
xor eax, eax ; bdd * 10 one word at a time
mov ax, word ptr [bdd+edi*2]
lea eax, [eax*4 + eax] ; * 5
add eax, eax ; * 2 ; slower shl eax, 1
add eax, ecx ; + carry (1st time is dec char)
mov j, eax
mov word ptr [bdd+edi*2], ax ; store lo word in bdd
mov cx, word ptr j+2 ; carry hi word
inc edi
.until edi > wcnt
inc esi
.endw
[attachment deleted by admin]
Added:
Binary to binary ascii string.
Squashed a bug that caused wrong results.
It was a missing dec edx in the brt < brm routine.
Compared the results to another program, in ubasic, they agree.
[attachment deleted by admin]
Optimizations:
Quick two dword code.
Only use as many dwords as needed for brt*2, brm*4, compare, brm - (brt + 1).
New time 1:49.
[attachment deleted by admin]
Optimization:
Combined brt*2 and brm*4 in one loop.
New time 1:30.
[attachment deleted by admin]
Dsouza123,
sorry for the delay but I found a small error in the routine I had and I didn't have time to correct it until now.
The attached program is in PowerBASIC but the important bits for extracting the SQRT are all inline ASM so conversion to MASM should not be too difficult. The BASIC parts are just to get the numbers in and out in the right format.
The source code as shown calculates the square root of a 1024 bit number in around 78,000 clks on my Athlon which I reckon is over 30% faster than the numbers you are posting.
This is NOT optimised code. I can think of at least 2 significant improvements but they would make the code more difficult for you to follow.
One is to scan forward until the first non-zero bit is found and then begin calculating. At the moment all the leading zeroes are calculated unnecessarily.
The other is to subtract only the DWORDs that are needed in the subtraction rather than the whole number. This value increases as the calculation progresses but in the early stages most of the bits don't take part in the calculation so time could be saved.
As posted, the code will run from a command line with the HEX number you want as a parameter. e.g.
bigsqrt 1f234e
will calculate the square root of 1f234e. If no parameter is given then the 1024 bit number you used in your last posting is used instead... but because of the restriction of this methed that the top 2 bits must be 0 the number is padded to 2 DWORDs bigger, that's why it takes longer to calculate than the figure above.
The remainder I get is identical to yours so it may well be that the methods are just slightly different implementations of the same thing. I haven't examined your code to check.
Feel free to modify it yourself if you think it may be useful.
Paul.
[attachment deleted by admin]
Thanks Dioxin
Will examine it and try to convert it to MASM.
Having multiple large number SQRT routines to pick from can come in handy.
I can't think of any more significant optimizations for the code I've been posting,
just small items/low level, add instead of inc (not sure if it makes a difference),
removing stalls, using all six registers, using a register instead of variable where possible,
replacing if > max then max code with cmp, adc.
Not sure how many seconds these small optimizations could remove.
I am curious what the algorithm for your SQRT code is.
The one I've used is.
Put the number in binary, grouping two bits together ie the byte E5 becomes the 4 pairs 11 10 01 01.
root = 0
remainder = 0
Loop for each pair high to low
root * 2
remainder * 4 + current pair
if root < remainder
remainder - ( root + 1)
root + 2
endif
endLoop
root \ 2
if remainder == 0
root is from a square ie root*root == number
If you need some help, here is karatsuba square root algorithm.
for number a3*b^3 + a2*b^2 + a1*b + a0 where b is a power of two and b > a[n] => 0 and b > a3 => b/4.
function for SqrtRem(a3*b^3 + a2*b^2 + a1*b + a0)
SqrtRem(a3*b+a2) => (S', R')
DivRem(R'*b + a1, 2*S') => (Q, U)
SqrtWholePart = S'*b + Q
SquareRootRemainder = U*b + a0 - Q^2
if SquareRootRemainder < 0 then
SquareRootRemainder += 2*SqrtWholePart - 1
SqrtWholePart-- (decrement)
end if
return(SqrtWholePart, SquareRootRemainder)
hope that helps.
Almost forgot: to do this you may need to shift the number by an even power of two. Then you can easily remove the even power of two from the answer. The reason for this is the a3 => b/4 restriction.
<<I am curious what the algorithm for your SQRT code is.>>
I learned the method when I was 12 for use on a hand operated adding machine.
It's harder to explain than to demonstrate and gives square roots quite quickly.
It just happens that the exact same method works even better in binary.
It goes something like this:
Enter the number you want the root of in register 1
Enter the lowest odd number in the highest digit in register 2.
repeat
Rotate handle to subtract 2 from 1
If the bell rings (which signals underflow) then
add the number back in
reduce the current "odd digit" by 1,
increase the next lower digit by 1
shift register 1 left
else
add 2 to the subtractor
endif
until precision is good enough.
The answer actually appears in a third register which counts the number of subtractions that took place for each
digit
Let's try a simple example, sqrt(2) to 3 decimal digits
20000000 the number I want the square root of, decimal point is implied after first digit
-1 the initial subtractor, 1 in the same column as the first digit
subtract digit1-count=1 (i.e. I've done 1 subtract so far)
10000000
-3 add 2 to the current digit in the subtractor
subtract
-20000000 Ding! Underflow, the bell rings so:
add add it back in
10000000
-21 reduce current subtractor digit by 1 and set next lower digit to 1 (so 3000 becomes 2100)
Now shift register1 left one place (same as shifting register 2 right 1 place)
10000000
- 21
Since 1 subtract succeded without overflow the first digit of the result is 1
Now repeat for the next digit:
10000000
- 21
subtract
07900000 digit2-count=1 since 1 subtract has taken place
- 23 add 2 to the current subtractor digit
subtract
05600000 digit2-count is now 2 since we've done 2 subtracts
- 25 add 2 to the current subtractor digit
subtract
03100000 digit2-count = 3
- 27 add 2 to the current subtractor digit
subtract
00400000 digit2-count = 4
- 29 add 2 to the current subtractor digit
subtract
-02500000 Ding! Underflow, the bell rings so:
add add it back in
00400000
- 281 reduce current subtractor digit by 1 and set next lower digit to 1 (so 2900 becomes 2810)
Now shift register1 left one place (same as shifting register 2 right 1 place)
00400000
- 281
Since we did 4 successful subtractions on the second column, the 2nd digit of the answer is 4
So the answer so far is 1.4, not accurate enough so continue with next digit
Now repeat for the next digit:
0040000
- 281
subtract
0011900 digit3-count=1 since 1 subtract has taken place
- 283 add 2 to the current subtractor digit
subtract
-0016400 Ding! Underflow, the bell rings so:
add add it back in
0011900
- 2821 reduce current subtractor digit by 1 and set next lower digit to 1 (so 2830 becomes 2821)
Now shift register1 left one place (same as shifting register 2 right 1 place)
0011900
- 2821
Since we did 1 successful subtractions on the 3rd column, the 3rd digit of the answer is 1
So the answer so far is 1.41, not accurate enough so continue with next digit
Now repeat for the next digit:
0011900
- 2821
subtract
0009079 digit4-count=1 since 1 subtract has taken place
- 2823 add 2 to the current subtractor digit
subtract
0006256 digit4-count=2 since 2 subtracts have taken place
- 2825 add 2 to the current subtractor digit
subtract
0003431 digit4-count=3 since 3 subtracts have taken place
- 2827 add 2 to the current subtractor digit
subtract
0000606 digit4-count=4 since 4 subtracts have taken place
- 2829
Subtract
-0002223 Ding! Underflow, the bell rings so:
at this point we could carry on to gain unlimited accuracy but we now have the 3rd decimal place we need
the third digit is 4 so the final result is 1.414
Like I said, it's a lot easy to have demonstrated than to explain.
The thing is, the exact same method works in binary but, because there can NEVER be more than 1 subtraction with any
digit then each subtraction fails (and give a zero in the answer) or succedes (giving a 1 in the answer)
You may also notice that register 2 in the above example is becoming 2.828..=sqrt(8)=2 x sqrt(2)
This means we can actually not bother to count the number of subtractions when we do it in Binary and we can still easily extract the correct answer for register 2.
I hope that helps!
Paul.
Thanks aaustin and Paul.
aaustin, I'll have to study that method further.
Paul, I've studied your code, it far smaller than i imagined, it does so much in so little code.
still have to comprehend it better, then will MASMize it.
Optimizations:
brm - (brt + 1) used more registers and partial unroll. Time 1:12.
did a suggested opcode replacement rcl instead of shld in brt*2.
New time 1:02.
[attachment deleted by admin]