Hi, I am currently working on writing a prog that does 32-by-32bit multiplication using the shift/add algorithm. I managed to get a 16-by-16bit multiplication working using DX-AX to store the result and storing the multiplicand in BX register.
But can't seem to be able to extend that to 32bit version. If someone can point me out in the right direction, that would be great. I'm using x86. Thanks
Post the code you have for the 16x16-bit version and where you've got with the 32x32-bit.
Are you trying to implement it with the 16-bit instruction set, or can you use the 32-bit instructions/registers of the 80386?
You could also probably optimize by having a 32x16 version, or switching the multiplier/multiplicand.
For non-homework assignments it's easier just to use the multiply instructions.
... and faster :P
welcome to the forum :U
well the assignment is to use shift/add to multiply two 32-bit numbers. I can only use the 16-bit registers in my emulator.
Well here is the code I have so far, I know it's not the best way to do things but I'm kinda new to assembly and did the best I could.
mov si, offset A
mov di, offset B
mov bx, offset product
mov dx, 0
mov ax, 0
mov bx, ds:di
mov cx, 0
BEGIN:
push bx
and bx, 1h
cmp bx, 0
JE SHIFT_IT
add dx, word ptr ds:si
SHIFT_IT:
pop bx
push dx
and dx, 01h
cmp dx, 0
JE DX_ZERO
shr ax, 1
or ax, 8000h
pop dx
shr dx, 1
shr bx, 1
inc cx
JMP BEGIN
DX_ZERO:
shr ax, 1
pop dx
shr dx, 1
shr bx, 1
inc cx
cmp bx, 0h
JE DONE_PROD
JMP BEGIN
DONE_PROD:
mov temp, 10h
sub temp, cx
mov cx, temp
LOOP_SET:
push dx
and dx, 1h
cmp dx, 0h
JE NO_ONE
shr ax, 1
or ax, 8000h
pop dx
shr dx, 1
JMP END_LOOP
NO_ONE:
shr ax, 1
pop dx
shr dx, 1
END_LOOP:
loop LOOP_SET
so at the end of all this the product is stored in DX-AX and i store them in the memory allocated for the product. I'm lost as to how I can extend this to 32bits. I'm using up all my registers here. If I do store the multiplicand in DX-AX and multiplier in BX-CX then where do i temporarily store my 64bit product? That is my main concern as to how I can extend this to 32 bits. This code works a charm for 16bit multiplication.
mov bx, ds:di
add dx, word ptr ds:si
try replacing those lines with
mov bx,[di]
add dx,[si]
the brackets tell the assembler you want the word at the address that is in DI or SI
the DS: segment override is not required, as DS is the default segment register for DI and SI in this addressing mode
WORD PTR is not required, because DX is a word size register - it can be no other size
that looks like a lot of code for a relatively simple operation :P
(i have only glanced through it briefly)
although, you are out of registers quick because the product is 64 bits
good effort on your part :U
Quote from: dedndave on June 28, 2011, 03:34:24 AM
mov bx, ds:di
add dx, word ptr ds:si
try replacing those lines with
mov bx,[di]
add dx,[si]
the brackets tell the assembler you want the word at the address that is in DI or SI
the DS: segment override is not required, as DS is the default segment register for DI and SI in this addressing mode
WORD PTR is not required, because DX is a word size register - it can be no other size
that looks like a lot of code for a relatively simple operation :P
(i have only glanced through it briefly)
although, you are out of registers quick because the product is 64 bits
good effort on your part :U
thanks for your help, I knew I had some redundant stuff that could be replaced with something faster or more efficient. I guess I could try making it more efficient. thing is I need to extend this to 32-by-32bit, any suggestions on how that can be achieved. whether it be pushing or storing results in temporary variables?
Here is a really crude ADD version. :bg
Note that if it exceeds DWORD range that it just rolls over EAX and keeps adding.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
slowmul PROTO :DWORD,:DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL var :DWORD
mov var, 1024*1024
invoke slowmul,var,1024
print ustr$(eax),13,10
invoke slowmul,1024,var
print ustr$(eax),13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
slowmul proc arg1:DWORD,arg2:DWORD
xor eax, eax
mov ecx, arg1
.if ecx > arg2
mov ecx, arg1
mov edx, arg2
.else
mov edx, arg1
mov ecx, arg2
.endif
lbl0:
add eax, ecx
sub edx, 1
jnz lbl0
ret
slowmul endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
well - the product needs to be 64 bits, as i said
because you are limited to 16-bit registers, i would use the product variable as the accumulator
then, use 4 general registers as the shifter (which is the multiplicand or addend, depending how you look at it)
this requires that you zero out the product before starting
this isn't tested and may have bugs - lol
also, it may not be the fastest
but, it gives you an idea of where to start
.MODEL Small
.STACK 4096
;-------------------------------------------------------------
.DATA
A dd 12345678h
B dd 23456789h
;------------------------
.DATA?
product dw 4 dup(?)
count dw ?
;-------------------------------------------------------------
.CODE
ASSUME DS:DGROUP
_main PROC
mov ax,@DATA
mov ds,ax
mov si,offset A
mov di,offset B
mov bx,offset product
call LongMul
mov ax,4C00h
int 21h
_main ENDP
;-------------------------------------------------------------
LongMul PROC
push si
push di
push bp
xor ax,ax ;zero AX register
;zero out the product
mov [bx],ax
mov [bx+2],ax
mov [bx+4],ax
mov [bx+6],ax
;load the multiplicand into DX:AX:DI:CX
mov dx,ax
mov cx,[di]
mov di,[di+2]
;test bit in BP and initialize count
mov bp,1
mov count,2
top_of_loop:
test bp,[si]
jz next_bit
add [bx],cx
adc [bx+2],di
adc [bx+4],ax
adc [bx+6],dx
next_bit:
shl cx,1
rcl di,1
rcl ax,1
rcl dx,1
rol bp,1
jnc top_of_loop
add si,2
dec count
jnz top_of_loop
pop bp
pop di
pop si
ret
LongMul ENDP
;-------------------------------------------------------------
END _main
From the description of the original assignment, the term Russian Peasant Multiply comes to mind as it uses and ADD / div by 2 (shr reg, 1) technique and from memory its no big deal to do on a computer.
Somewhat later: This is a simple example of peasant multiply using SHIFT ADD. Not exhaustively tested but appears to be generating the correct numbers.
This is the URL for the method used. http://mathforum.org/dr.math/faq/faq.peasant.html
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
peasant_multiply PROTO :DWORD,:DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
invoke peasant_multiply,65536,16
print ustr$(eax),13,10
invoke peasant_multiply,9,9
print ustr$(eax),13,10
invoke peasant_multiply,1024,16
print ustr$(eax),13,10
invoke peasant_multiply,13,59
print ustr$(eax),13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
peasant_multiply proc arg1:DWORD,arg2:DWORD
mov ecx, arg1 ; load 1st arg into ECX
mov edx, arg2 ; load 2nd arg into EDX
xor eax, eax ; zero EAX
jmp testit ; jump to test if arg2 is ODD or EVEN
lbl0:
add ecx, ecx ; double arg1
shr edx, 1 ; div arg2 by 2
cmp edx, 1 ; compare it to 1
jbe lbl1 ; exit loop if below or equal
testit:
test edx, 00000000000000000000000000000001b
je lbl0 ; jump back if its even
add eax, ecx ; accumulate ECX in EAX if EDX is odd
jmp lbl0
lbl1:
add eax, ecx ; add last ECX to EAX and exit
ret
peasant_multiply endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Quote from: delhibelly on June 28, 2011, 03:18:10 AM
well the assignment is to use shift/add to multiply two 32-bit numbers. I can only use the 16-bit registers in my emulator.
Wow, this is all great. Thanks a lot for your help. I'll look into what you guys said and try implementing something. My assignment is due tomorrow, I'll let you guys know what happens. Thanks a again.
i wanted a 64-bit to ASCII decimal routine in 16-bit, so i added it to the program
see attached...
Implementing the algorithm described here (http://courses.cs.vt.edu/~cs1104/BuildingBlocks/multiply.040.html), for a 64-bit result from two 32-bit multiplicands, you can use BSR to find the bit position of the leftmost 1 in the first multiplicand, SHLD to shift a 64-bit working copy of the second multiplicand into position, do a 64-bit addition of the shifted copy with the result using ADD followed by ADC, and use BTR to clear the leftmost 1 in the first multiplicand. Note that the 64-bit working copy is simply the second multiplicand zero-extended into the high-order DWORD, and it must be reloaded from the second multiplicand for each loop, discarding the previously shifted value. In my test the algorithm duplicated the results of MUL over almost the entire 32-bit range (actually 0 to 4294901760), and ran in ~150 cycles on a P3.
Quote from: dedndave on June 28, 2011, 07:08:23 PM
i wanted a 64-bit to ASCII decimal routine in 16-bit, so i added it to the program
see attached...
thanks. understood most of the code, works fine for me. Still working on my code though and think I am almost there.
Quote from: MichaelW on June 29, 2011, 04:32:16 AM
Implementing the algorithm described here (http://courses.cs.vt.edu/~cs1104/BuildingBlocks/multiply.040.html), for a 64-bit result from two 32-bit multiplicands, you can use BSR to find the bit position of the leftmost 1 in the first multiplicand, SHLD to shift a 64-bit working copy of the second multiplicand into position, do a 64-bit addition of the shifted copy with the result using ADD followed by ADC, and use BTR to clear the leftmost 1 in the first multiplicand. Note that the 64-bit working copy is simply the second multiplicand zero-extended into the high-order DWORD, and it must be reloaded from the second multiplicand for each loop, discarding the previously shifted value. In my test the algorithm duplicated the results of MUL over almost the entire 32-bit range (actually 0 to 4294901760), and ran in ~150 cycles on a P3.
Actually did come across your website earlier. algorithm is explained very well and I kind of did use it for my 16-bit result but was having trouble extended it to 64 bits as I ran out of registers.
Quote from: dedndave on June 28, 2011, 10:34:32 AM
well - the product needs to be 64 bits, as i said
because you are limited to 16-bit registers, i would use the product variable as the accumulator
then, use 4 general registers as the shifter (which is the multiplicand or addend, depending how you look at it)
this requires that you zero out the product before starting
Excellent solution. thanks.
You are limited to using 16-bit registers? Using 32-bit registers and the instructions that I listed would make the task much easier. Where your 16/16 code uses 48 instructions, my 32/64 code encapsulated in a procedure uses only 17 instructions.
This is a demo of the support code that I used:
;==============================================================================
include \masm32\include\masm32rt.inc
;==============================================================================
printf MACRO format:REQ, args:VARARG
IFNB <args>
invoke crt_printf, cfm$(format), args
ELSE
invoke crt_printf, cfm$(format)
ENDIF
EXITM <>
ENDM
;==============================================================================
bin$ MACRO val
ifndef __bin$_buff__
.data
align 4
__bin$_buff__ db 68 dup(0)
.code
endif
IF TYPE val EQ 8
mov eax, DWORD PTR val
mov edx, DWORD PTR val+4
ELSEIF TYPE val EQ 4
mov eax, DWORD PTR val
mov edx, 0
ELSE
ECHO -----------------------------------
ECHO bin$ arg must be DWORD or QWORD
ECHO -----------------------------------
.ERR
ENDIF
invoke crt__ui64toa, edx::eax, ADDR __bin$_buff__, 2
EXITM <eax>
ENDM
;==============================================================================
.data
p dq 0
q dq 0
.code
;==============================================================================
start:
;==============================================================================
xor ebx, ebx
.WHILE ebx < 100
printf( "%s\n", bin$(ebx) )
inc ebx
.ENDW
.WHILE ebx < 200
mov DWORD PTR p, ebx
mov DWORD PTR p+4, ebx
printf( "%s\n", bin$(p) )
inc ebx
.ENDW
printf( "\n%I64Xh\n", p )
printf( "%I64d\n\n", p )
inkey "Press any key to exit..."
exit
;==============================================================================
end start
@ MichaelW - I'm using the emu8086 which is emulator for 8086 processor. That is why I cannot access the 32 bit registers, hence running out of registers. If I could my life would be a whole lot easier and I could definitely use the code you have :bg
Quote from: dedndave on June 28, 2011, 11:09:39 AM
also, it may not be the fastest
Hi,
I scarfed up your code to check against my implementation
of the shift and add multiply. Finally got around to timing the
two fixed point multiplies I wrote for my arbitrary precision
set of routines. One is a shift and add as above and one is
using a multiply terms and accumulate algorithm.
Test with both multiply routines on the HP200LX.
W/WO display FixMulM ~7.8/1.0 Seconds FixMulA ~13.8/7.1 Seconds.
80186 MUL mem 41-43 cycles, RCL mem 5.
Regards.
Steve