News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Simple multiply algorithm.

Started by hutch--, November 04, 2008, 01:09:12 PM

Previous topic - Next topic

hutch--

This is very uncontentious stuff but I thought someone may see a faster way to do it.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

umul proc num:DWORD, mult:DWORD

    xor edx, edx        ; clear EDX
    mov eax, [esp+4]    ; load number into EAX
    mov ecx, [esp+8]    ; load multiplier into ECX
    mul ecx             ; perform unsigned multiply

    ret 8

umul endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

Huch,

Just delete the xor edx,edx, it will be overwritten. This is not necessary for a multiply which yields the result in edx,eax. DIV needs to have edx cleared because the divide is edx,eax by the divisor yielding the quotient in eax, the remainder in edx (or overflow if the quotient exceeds 32 bits).

Dave.

Jimg

I may be having a senior moment, but what's wrong with
mul dword ptr [esp+8]
instead of using ecx?  Is that slower?

KeepingRealBusy

I thought it had to be an immediate or a reg, let me check.

Dave.

KeepingRealBusy

Jimg,

Wrong (me), right (you). It can be either reg, or mem, or immediate. Thus only occupying 2 regs, eax and edx. I'm not sure about the timing consequences of reg vs mem for the mul itself, but when you have to load the value into a reg to begin with, it just takes one more instruction, so your way should be faster.

Dave.

Jimg

And it seems to me that a macro that just inserted the instructions would be smaller and faster than invoking a proc, so Hutch must have had some ulterior motives for this whole thread.

KeepingRealBusy

Save all of the push of the arguments and  the eip and then the ret with the stack adjustment (all take cycles).

MichaelW

Running on a P3 and using an unsigned multiply I cannot find any coding that is significantly faster.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul proc num:DWORD, mult:DWORD

    xor edx, edx        ; clear EDX
    mov eax, [esp+4]    ; load number into EAX
    mov ecx, [esp+8]    ; load multiplier into ECX
    mul ecx             ; perform unsigned multiply

    ret 8

umul endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul1 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    mov ecx, [esp+8]
    mul ecx

    ret 8

umul1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul2 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    mul DWORD PTR [esp+8]

    ret 8

umul2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul3 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    imul eax, [esp+8]

    ret 8

umul3 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke umul, 123, 456
      invoke umul, 123, 456
      invoke umul, 123, 456
      invoke umul, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke umul1, 123, 456
      invoke umul1, 123, 456
      invoke umul1, 123, 456
      invoke umul1, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke umul2, 123, 456
      invoke umul2, 123, 456
      invoke umul2, 123, 456
      invoke umul2, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke umul3, 123, 456
      invoke umul3, 123, 456
      invoke umul3, 123, 456
      invoke umul3, 123, 456
    counter_end
    print ustr$(eax),13,10

    inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


30
29
30
24


eschew obfuscation

hutch--

These are the timings on my old PIV.


66
58
58
38
Press any key to exit...


Inlining the code is obviously faster which is what I have normally done for years but I wanted a callable procedure so I could put it into the masm32 library so others could use it as they were learning. The XOR EDX, EDX was to ensure the EDX register was not loaded with a random  value so you could reliably check for numbers larger than 4 gig. I tend to do the 1 to 10 range with combinations of shifts and LEA.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: MichaelW on November 04, 2008, 08:07:50 PM
Running on a P3 and using an unsigned multiply I cannot find any coding that is significantly faster.


Celeron M:
24
24
20
20

KeepingRealBusy

Hutch,

I still don't see where the XOR is needed. You are only loading EAX and ECX with a DWORD so you would never see if one of them was > 32 bits, in fact, the values are pushed on the stack where they are accessed as DWORDS. If someone pushed a 64 but number on the stack, then the loads would get strange results and the ret 8 would not return to the correct point. Once the mul is done, EDX will always be overwritten with the upper half of the quotient, which can be checked upon return from the function for a value > 32 bits, it makes no difference what was in EDX at the time of the call.

Dave.

jj2007

Is there something wrong with my code?? I always get zero cycles for the macro version... :dazzled:

24
24
20
20
0

123*456=56088

umul4 MACRO accu:REQ, mult:REQ
  ifdif <accu>, <eax>
   mov eax, accu
  endif
  imul eax, mult
ENDM

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul proc num:DWORD, mult:DWORD

    xor edx, edx        ; clear EDX
    mov eax, [esp+4]    ; load number into EAX
    mov ecx, [esp+8]    ; load multiplier into ECX
    mul ecx             ; perform unsigned multiply

    ret 8

umul endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul1 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    mov ecx, [esp+8]
    mul ecx

    ret 8

umul1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul2 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    mul DWORD PTR [esp+8]

    ret 8

umul2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

umul3 proc num:DWORD, mult:DWORD

    mov eax, [esp+4]
    imul eax, [esp+8]

    ret 8

umul3 endp

umul4 MACRO accu:REQ, mult:REQ
  ifdif <accu>, <eax>
mov eax, accu
  endif
  imul eax, mult
ENDM

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 1000
    Loops = 10000000

    counter_begin Loops, HIGH_PRIORITY_CLASS
      invoke umul, 123, 456
      invoke umul, 123, 456
      invoke umul, 123, 456
      invoke umul, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin Loops, HIGH_PRIORITY_CLASS
      invoke umul1, 123, 456
      invoke umul1, 123, 456
      invoke umul1, 123, 456
      invoke umul1, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin Loops, HIGH_PRIORITY_CLASS
      invoke umul2, 123, 456
      invoke umul2, 123, 456
      invoke umul2, 123, 456
      invoke umul2, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin Loops, HIGH_PRIORITY_CLASS
      invoke umul3, 123, 456
      invoke umul3, 123, 456
      invoke umul3, 123, 456
      invoke umul3, 123, 456
    counter_end
    print ustr$(eax),13,10

    counter_begin Loops, HIGH_PRIORITY_CLASS
      umul4 123, 456
      umul4 123, 456
      umul4 123, 456
      umul4 123, 456
    counter_end
    print ustr$(eax),13,10,10
    print "123*456="
    umul4 123, 456
    print ustr$(eax),13,10

    ; inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

KeepingRealBusy

JJ,

Please correct me if I'm wrong, but "ifdif..." should evaluate to an assembly time compare of the content of the parameter "accu" and the content of the EAX register. You need a run time check to see if this is true, but this check will involve the same memory access that the MOV would, and would have an extra jump to skip the MOV if EAX just happened to contain the value to begin with. What have you gained?

Dave.

jj2007

Quote from: KeepingRealBusy on November 04, 2008, 11:58:25 PM
JJ,

Please correct me if I'm wrong, but "ifdif..." should evaluate to an assembly time compare of the content of the parameter "accu" and the content of the EAX register. You need a run time check to see if this is true, but this check will involve the same memory access that the MOV would, and would have an extra jump to skip the MOV if EAX just happened to contain the value to begin with. What have you gained?

Dave.

Dave,
The ifdifi checks at assembly time if the argument is eax, and thus avoids an unnecessary mov eax, eax.

smul4 MACRO accu:REQ, mult:REQ
  ifdif <accu>, <eax>
mov eax, accu
  endif
  imul eax, mult
ENDM

...
   mov eax, 12345678h
   nop
   smul4 eax, 456h
   nop

   mov ecx, 12345678h
   nop
   smul4 ecx, 456h
   nop


Disassembly:

Address    Hex dump                       Command                                     Comments
00401039   ³. B8 78563412                 mov eax, 12345678
0040103E   ³. 90                          nop
0040103F   ³? 69C0 56040000               imul eax, eax, 456   <--- one instruction
00401045   ³. 90                          nop
00401046   ³? B9 78563412                 mov ecx, 12345678
0040104B   ³? 90                          nop
0040104C   ³. 8BC1                        mov eax, ecx            <--- additional mov
0040104E   ³? 69C0 56040000               imul eax, eax, 456

jj2007

Analogous, you can check twice for eax and edx when using the unsigned multiply (I use edx as the second register because it's trashed anyway):

umul4 MACRO accu:REQ, mult:REQ
  ifdif <accu>, <eax>
mov eax, accu
  endif
  ifdif <mult>, <edx>
mov edx, mult
  endif
  mul edx
ENDM


I might go for this version:

xmul MACRO accu:REQ, mult:REQ
  ifdif <accu>, <eax>
mov eax, accu
  endif
  imul eax, mult
  EXITM <eax>
ENDM

.data

V1u dd 456
V2s SDWORD 456
Result dd 0

.code
start:

  mov Result, xmul(V1u, V2s)
  push xmul(V1u, V2s)
  pop eax