Is there an FPU instruction to multiply two unsigned DWORD integers and store the result in 64-bit format? I'm looking for a faster alternative to the MUL instruction, which stores its result in EDX:EAX.
.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1
fimul in2
fistp out
Because you declared the variables as dd and dq, MASM will generate the right data size encoding.
In what format is the QWORD stored? I've tried moving the low DWORD into EAX and the high DWORD into EDX, and vice versa, but the values are different than what MUL yields.
According to the Intel docs, "The FIMUL instructions convert an integer source operand to double extended-precision floating-point format before performing the multiplication." I'm probably missing something here, but I don't want floating-point format, I want integer format.
Hi Posit,
Also from the Intel documents:
Quote
Internally, the FPU holds all number in a uniform 80-bit extended format. Operands that may be represented in memory as 16-, 32-, or 64-bit integers, 32-, 64-, or 80-bit floating point numbers, or 18-digit packed BCD numbers, are automatically converted into extended format as they are loaded into the FPU registers.
http://www.website.masmforum.com/tutorials/fptute/fpuchap2.htm#real10
In his example Aero forgot to deal with the problem that FIMUL performs a signed multiply.
http://www.website.masmforum.com/tutorials/fptute/fpuchap9.htm
And I doubt that using the FPU will be faster on any processor.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
dword1 dd 2
dword2 dd 4
dword3 dd 2
dword4 dd 80000000h
result dq 0
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov eax, dword1
mul dword2
mov ebx, eax
print uhex$(edx)
print uhex$(ebx), 13, 10
fild dword1
fimul dword2
fabs
fistp result
print uhex$(DWORD PTR result+4)
print uhex$(DWORD PTR result), 13, 10
mov eax, dword3
mul dword4
mov ebx, eax
print uhex$(edx)
print uhex$(ebx),13,10
fild dword3
fimul dword4
fabs
fistp result
print uhex$(DWORD PTR result+4)
print uhex$(DWORD PTR result),13,10,13,10
LOOP_COUNT EQU 10000000
REPEAT_COUNT EQU 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov eax, dword1
mul dword2
mov DWORD PTR result, eax
mov DWORD PTR result+4, edx
ENDM
counter_end
print ustr$(eax)
print chr$(" cycles (* 10)",13,10)
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
fild dword1
fimul dword2
fabs
fistp result
ENDM
counter_end
print ustr$(eax)
print chr$(" cycles (* 10)",13,10)
mov eax, input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Result on my P3:
0000000000000008
0000000000000008
0000000100000000
0000000100000000
32 cycles (* 10)
94 cycles (* 10)
Can FMUL somehow be used, or does it just yield a DWORD result?
FMUL also performs a signed multiplication, and it cannot handle integer operands.
http://www.website.masmforum.com/tutorials/fptute/fpuchap8.htm#fmul
I think for a speed increase you will need to use MMX, SSE, or SSE2.
Wait a minute, FMUL will work just fine if you load both integers into FPU registers.
.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1 ; in1 is converted from a DWORD integer to a REAL10, ST(0) = 2.0
fild in2 ; in2 is converted from a DWORD integer to a REAL10, ST(0) = 3.0, ST(1) = 2.0
fmul ; ST(0) = 6.0
fistp out ; ST(0) is converted from a REAL10 to a QWORD integer and saved to out
Once an integer is loaded into an FPU register with FILD it is converted to REAL10 floating-point format. The values in FPU registers are in REAL10 format. The only time you have to worry about different formats is when loading from or storing to memory, or when accessing values in memory. FIMUL is for multiplying ST(0) by an integer located in memory.
AeroASM's code would work just fine, you don't need the FABS. In fact you dont want the FABS.
.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1 ; in1 is converted from a DWORD integer to a REAL10, ST(0) = 2.0
fimul in2 ; in2 is converted from a DWORD integer to a REAL10, ST(0) = ST(0) * 3.0, ST(0) = 6.0
fistp out ; ST(0) is converted from a REAL10 to a QWORD integer and saved to out
FABS clears the sign bit of the REAL10 value in ST(0). You don't want to do that here, it would cause errors for negative values. ie. if in1 = -2 and in2 = 3.
No offense MichaelW, I just couldn't let that be.
Greg,
No offense taken. My goal here is to provide correct answers, and if I don't then I should be, and would prefer to be, corrected.
Regarding the FABS, from Posit's initial post, emphasis added:
Quote
Is there an FPU instruction to multiply two unsigned DWORD integers and store the result in 64-bit format?
AeroAsm's code will not work over the full range of unsigned values. The FABS was an attempt to correct the result that, unfortunately, will also not work over the full range of unsigned values. At this point I can't think of any clean method of converting the value, and in any case the FPU version will still be slower than the ALU version.
Regarding FMUL, it will not accept an integer
memory operand. When I considered using FMUL it seemed to me that the extra instruction would make the code execute slower. Now that I test it, on a P3, the FMUL sequence and the FIMUL sequence both take 72 cycles (without the FABS). FMUL may be faster on other processors, but again, the FPU version will still be slower than ALU version.
Hi MichaelW,
After re-reading the posts and running your code, I see where you were coming from, I was taking part of what you were saying in the wrong context.
You're right, for speed Posit is best off with the ALU code.