News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

FPU Question

Started by Posit, May 28, 2005, 07:23:13 AM

Previous topic - Next topic

Posit

Is there an FPU instruction to multiply two unsigned DWORD integers and store the result in 64-bit format? I'm looking for a faster alternative to the MUL instruction, which stores its result in EDX:EAX.

AeroASM


.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1
fimul in2
fistp out


Because you declared the variables as dd and dq, MASM will generate the right data size encoding.

Posit

In what format is the QWORD stored? I've tried moving the low DWORD into EAX and the high DWORD into EDX, and vice versa, but the values are different than what MUL yields.

According to the Intel docs, "The FIMUL instructions convert an integer source operand to double extended-precision floating-point format before performing the multiplication." I'm probably missing something here, but I don't want floating-point format, I want integer format.

MichaelW

Hi Posit,

Also from the Intel documents:
Quote
Internally, the FPU holds all number in a uniform 80-bit extended format. Operands that may be represented in memory as 16-, 32-, or 64-bit integers, 32-, 64-, or 80-bit floating point numbers, or 18-digit packed BCD numbers, are automatically converted into extended format as they are loaded into the FPU registers.

http://www.website.masmforum.com/tutorials/fptute/fpuchap2.htm#real10

In his example Aero forgot to deal with the problem that FIMUL performs a signed multiply.

http://www.website.masmforum.com/tutorials/fptute/fpuchap9.htm

And I doubt that using the FPU will be faster on any processor.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
        dword1  dd 2
        dword2  dd 4
        dword3  dd 2
        dword4  dd 80000000h
        result  dq 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    mov   eax, dword1
    mul   dword2
    mov   ebx, eax
    print uhex$(edx)
    print uhex$(ebx), 13, 10

    fild  dword1
    fimul dword2
    fabs   
    fistp result
    print uhex$(DWORD PTR result+4)
    print uhex$(DWORD PTR result), 13, 10

    mov   eax, dword3
    mul   dword4
    mov   ebx, eax
    print uhex$(edx)
    print uhex$(ebx),13,10

    fild  dword3
    fimul dword4
    fabs   
    fistp result
    print uhex$(DWORD PTR result+4)
    print uhex$(DWORD PTR result),13,10,13,10
       
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax, dword1
        mul   dword2
        mov   DWORD PTR result, eax
        mov   DWORD PTR result+4, edx
      ENDM       
    counter_end
    print ustr$(eax)
    print chr$(" cycles (* 10)",13,10)

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        fild  dword1
        fimul dword2
        fabs   
        fistp result
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" cycles (* 10)",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Result on my P3:

0000000000000008
0000000000000008
0000000100000000
0000000100000000

32 cycles (* 10)
94 cycles (* 10)

eschew obfuscation

Posit

Can FMUL somehow be used, or does it just yield a DWORD result?

MichaelW

FMUL also performs a signed multiplication, and it cannot handle integer operands.

http://www.website.masmforum.com/tutorials/fptute/fpuchap8.htm#fmul

I think for a speed increase you will need to use MMX, SSE, or SSE2.


eschew obfuscation

GregL

#6
Wait a minute, FMUL will work just fine if you load both integers into FPU registers.

.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1      ; in1 is converted from a DWORD integer to a REAL10, ST(0) = 2.0
fild in2      ; in2 is converted from a DWORD integer to a REAL10, ST(0) = 3.0, ST(1) = 2.0
fmul          ; ST(0) = 6.0
fistp out     ; ST(0) is converted from a REAL10 to a QWORD integer and saved to out


Once an integer is loaded into an FPU register with FILD it is converted to REAL10 floating-point format. The values in FPU registers are in REAL10 format. The only time you have to worry about different formats is when loading from or storing to memory, or when accessing values in memory. FIMUL is for multiplying ST(0) by an integer located in memory.

AeroASM's code would work just fine, you don't need the FABS. In fact you dont want the FABS.

.data
in1 dd 2
in2 dd 3
out dq 0
.code
fild in1      ; in1 is converted from a DWORD integer to a REAL10, ST(0) = 2.0
fimul in2     ; in2 is converted from a DWORD integer to a REAL10,  ST(0) = ST(0) * 3.0, ST(0) = 6.0
fistp out     ; ST(0) is converted from a REAL10 to a QWORD integer and saved to out



FABS clears the sign bit of the REAL10 value in ST(0). You don't want to do that here, it would cause errors for negative values. ie. if in1 = -2 and in2 = 3.

No offense MichaelW, I just couldn't let that be.


MichaelW

Greg,

No offense taken. My goal here is to provide correct answers, and if I don't then I should be, and would prefer to be, corrected.

Regarding the FABS, from Posit's initial post, emphasis added:
Quote
Is there an FPU instruction to multiply two unsigned DWORD integers and store the result in 64-bit format?
AeroAsm's code will not work over the full range of unsigned values. The FABS was an attempt to correct the result that, unfortunately, will also not work over the full range of unsigned values. At this point I can't think of any clean method of converting the value, and in any case the FPU version will still be slower than the ALU version.

Regarding FMUL, it will not accept an integer memory operand. When I considered using FMUL it seemed to me that the extra instruction would make the code execute slower. Now that I test it, on a P3, the FMUL sequence and the FIMUL sequence both take 72 cycles (without the FABS). FMUL may be faster on other processors, but again, the FPU version will still be slower than ALU version.
eschew obfuscation

GregL

Hi MichaelW,

After re-reading the posts and running your code, I see where you were coming from, I was taking part of what you were saying in the wrong context.

You're right, for speed Posit is best off with the ALU code.