Multiplication using shift/add 32-bit

delhibelly · June 29, 2011, 04:58:49 AM

Quote from: MichaelW on June 29, 2011, 04:32:16 AM
Implementing the algorithm described here, for a 64-bit result from two 32-bit multiplicands, you can use BSR to find the bit position of the leftmost 1 in the first multiplicand, SHLD to shift a 64-bit working copy of the second multiplicand into position, do a 64-bit addition of the shifted copy with the result using ADD followed by ADC, and use BTR to clear the leftmost 1 in the first multiplicand. Note that the 64-bit working copy is simply the second multiplicand zero-extended into the high-order DWORD, and it must be reloaded from the second multiplicand for each loop, discarding the previously shifted value. In my test the algorithm duplicated the results of MUL over almost the entire 32-bit range (actually 0 to 4294901760), and ran in ~150 cycles on a P3.

Actually did come across your website earlier. algorithm is explained very well and I kind of did use it for my 16-bit result but was having trouble extended it to 64 bits as I ran out of registers.

delhibelly · June 29, 2011, 05:02:03 AM

Quote from: dedndave on June 28, 2011, 10:34:32 AM
well - the product needs to be 64 bits, as i said
because you are limited to 16-bit registers, i would use the product variable as the accumulator
then, use 4 general registers as the shifter (which is the multiplicand or addend, depending how you look at it)
this requires that you zero out the product before starting

Excellent solution. thanks.

MichaelW · June 29, 2011, 08:11:25 AM

You are limited to using 16-bit registers? Using 32-bit registers and the instructions that I listed would make the task much easier. Where your 16/16 code uses 48 instructions, my 32/64 code encapsulated in a procedure uses only 17 instructions.

This is a demo of the support code that I used:

Code Select


;==============================================================================
    include \masm32\include\masm32rt.inc
;==============================================================================

printf MACRO format:REQ, args:VARARG
    IFNB <args>
        invoke crt_printf, cfm$(format), args
    ELSE
        invoke crt_printf, cfm$(format)
    ENDIF
    EXITM <>
ENDM

;==============================================================================

bin$ MACRO val
    ifndef __bin$_buff__
        .data
        align 4
            __bin$_buff__  db 68 dup(0)
        .code
    endif
    IF TYPE val EQ 8
        mov eax, DWORD PTR val
        mov edx, DWORD PTR val+4
    ELSEIF TYPE val EQ 4
        mov eax, DWORD PTR val
        mov edx, 0
    ELSE
        ECHO -----------------------------------
        ECHO bin$ arg must be DWORD or QWORD
        ECHO -----------------------------------
        .ERR
    ENDIF
    invoke crt__ui64toa, edx::eax, ADDR __bin$_buff__, 2
    EXITM <eax>
ENDM

;==============================================================================
    .data
        p dq 0
        q dq 0
    .code
;==============================================================================
start:
;==============================================================================

    xor ebx, ebx
    .WHILE ebx < 100
        printf( "%s\n", bin$(ebx) )
        inc ebx
    .ENDW

    .WHILE ebx < 200
        mov DWORD PTR p, ebx
        mov DWORD PTR p+4, ebx
        printf( "%s\n", bin$(p) )
        inc ebx
    .ENDW

    printf( "\n%I64Xh\n", p )
    printf( "%I64d\n\n", p )

    inkey "Press any key to exit..."
    exit
;==============================================================================
end start

delhibelly · June 29, 2011, 08:54:40 AM

@ MichaelW - I'm using the emu8086 which is emulator for 8086 processor. That is why I cannot access the 32 bit registers, hence running out of registers. If I could my life would be a whole lot easier and I could definitely use the code you have :bg

FORTRANS · September 23, 2011, 09:52:01 PM

Quote from: dedndave on June 28, 2011, 11:09:39 AM
also, it may not be the fastest

Hi,

I scarfed up your code to check against my implementation
of the shift and add multiply. Finally got around to timing the
two fixed point multiplies I wrote for my arbitrary precision
set of routines. One is a shift and add as above and one is
using a multiply terms and accumulate algorithm.

Test with both multiply routines on the HP200LX.
W/WO display FixMulM ~7.8/1.0 Seconds FixMulA ~13.8/7.1 Seconds.
80186 MUL mem 41-43 cycles, RCL mem 5.

Regards.

Steve

News:

Multiplication using shift/add 32-bit

delhibelly

delhibelly

MichaelW

delhibelly

FORTRANS