News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

FPU arithmetic and console output

Started by ptoulis, December 01, 2008, 09:16:06 AM

Previous topic - Next topic

ptoulis

Hi,

I would like to make the assembly equivalent of the following simple C program:

int main(int argc, char **args) {
int A=0;
double B;
for(A=0;A<100000000;A++)
    {
    B=cos(A);
    }
}


To do this I would like to use the FCOS instruction, but I have run into some serious trouble.
After searching through MASM and The Campus I am at this point:
include \masm32\include\masm32rt.inc

.data
A dd 0.0
B dq 0.0
buffer db 100 dup (0),13,10,0
eax_buffer db 100 dup(0),13,10,0
message db '(cos) The value of eax = %d',13,10,0

.code

main:
    mov eax,A
   
loop_here:
    push eax
    invoke  wsprintf,ADDR eax_buffer,ADDR message,eax
    invoke  StdOut,ADDR eax_buffer
    pop eax
    cmp eax, 10
    je last
    add eax,1
    mov A,eax
    push eax
    fld A
    fcos
    fst B
    ffree st(5)
    invoke FloatToStr,B, addr buffer
    print addr buffer
    pop eax
    jmp loop_here

last:
    invoke ExitProcess,0

end main


In the output I have something like:
1 (cos) The value of eax=1
1 (cos) The value of eax=2 ...
So, EAX is incremented properly and I am exiting the loop, however I am not able to add 1 in the operand of FCOS and seems to 'get stuck' in the initialization value (0.0)

What should I do in order to compute, cos(1),cos(2),cos(3)...etc?
thanx!

Mirno

You're loading an integer (A) into the FPU, not a float, so use fild.

Mirno

herge


Hi ptoulis:

Google simply Fpu
and radians.
the trig arguement must be in radians.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

MichaelW

ptoulis,

Instead of moving A into the EAX register, and then having to preserve it around the function calls, it would be easier to just use it directly. You can pass a DWORD variable to a function just as you would pass a 32-bit register, and where a function can (and probably will) change the value of EAX, it will not change A. You can also use A directly to control your loop:

  loop_label:
    inc A
    cmp A, 100000000
    jb loop_label           ;  jump if A is below 100000000


As Mirno pointed out, A is an integer so you should load it as an integer:

fild A

The instruction:

fld A

Will load A into the FPU no problem, but A will be interpreted as a 32-bit float, and this will not produce the expected result.

If when you store B you use the fstp form of the instruction, in addition to storing the result that fcos left in the ST(0) register to B, it will pop the ST(0) register value off the FPU stack, leaving the FPU stack empty.

Also, you should expect this loop to take a very long time to complete,
eschew obfuscation

raymond

You don't even have to "google" simply FPU. Look in the upper right corner of this forum's page and click on the "Forum Links and Website" and then select the "Floating Point Tute". Then look for the FPU tutorial on that website.

Once you've learned the basics of dealing with floating points, you can also look at the source code of the FPULIB (and its Help file) which comes with the MASM32 package and is also available from the same site as the above tutorial.

Don't expect to succeed with floating point computation unless you make the effort of learning the intricacies of the FPU and its assembly instructions (or rely strictly on the library).
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

ptoulis

Thanks for the feedback guys! I should be using the fild instruction. Thanks also to MichaelW for the improvements.
Actually the program takes about 4'' to run in my 2.6Ghz machine.
It is weird but the equivalent code in C runs in 8'' about 2 times slower.
I run gcc -S to get the assembly code and found out out it uses the library _cos function.
My assembly program is not that better from what gcc creates, but it is still 2 times faster.

Is there a good explanation for this?



raymond

QuoteMy assembly program is not that better from what gcc creates, but it is still 2 times faster.

Is there a good explanation for this?

You have not confirmed that your assembly program is producing the exact same results as your C program.

If they are producing the same results, we would have to look at the disassembled C code to find an explanation (possibly significant overhead in the _cos function or a different float conversion function, or ...).

However, if they are not producing the same result, the C library _cos function may be converting the input parameter from degree to radian (an extra multiplication and division taking time) which is inexistant in your assembly code.
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

ptoulis

Quote from: raymond on December 03, 2008, 01:50:56 AM

If they are producing the same results, we would have to look at the disassembled C code to find an explanation (possibly significant overhead in the _cos function or a different float conversion function, or ...).

You are right raymond. Here is the C source code.
int main(int argc, char **args) {
int i=0;
double c;
for(i=0;i<100000000;i++)
    {
    c=cos(i);
    }
}


and here is the assembly code generated by GCC:
.file "test_gcc.c"
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
pushl %ebp
movl %esp, %ebp
subl $40, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -20(%ebp)
movl -20(%ebp), %eax
call __alloca
call ___main
movl $0, -4(%ebp)
movl $0, -4(%ebp)
L2:
cmpl $99999999, -4(%ebp)
jg L3
fildl -4(%ebp)
fstpl (%esp)
call _cos
fstpl -16(%ebp)
leal -4(%ebp), %eax
incl (%eax)
jmp L2
L3:
leave
ret
.def _cos; .scl 2; .type 32; .endef


My assembly code is this:
include \masm32\include\masm32rt.inc

.data
A dd 1
B dq 0.0
buffer db 100 dup (0),13,10,0
eax_buffer db 100 dup(0),13,10,0
message db '(cos) The value of eax = %d',13,10,0

.code

main:
    mov eax,A
   
loop_here:
    cmp eax, 100000000
    je last
    add eax,1
    mov A,eax
    push eax
    call my_cos
    pop eax
    jmp loop_here

last:
    invoke ExitProcess,0

my_cos proc
    fild A
    fcos
    fst B
    ffree st(5)
    ret
my_cos endp

end main


I checked it out and my code produces the same results with the C code.
But why is it 2x faster? Is the implementation of _cos in the C library slower?

MichaelW

It looks like your C code was compiled without any specific optimization. The effect of the optimization depends on the optimization flag and the particular code, but it could make a large difference in speed.

http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Optimize-Options.html#Optimize-Options

I didn't test the speed, but this is the assembly output I get using -O3:

.file "test.c"
.def ___main; .scl 2; .type 32; .endef
.text
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
pushl %ebx
xorl %ebx, %ebx
pushl %edx
andl $-16, %esp
call __alloca
call ___main
.p2align 4,,7
L6:
pushl %eax
pushl %eax
pushl %ebx
incl %ebx
fildl (%esp)
subl $4, %esp
fstpl (%esp)
call _cos
fstp %st(0)
addl $16, %esp
cmpl $99999999, %ebx
jle L6
movl -4(%ebp), %ebx
leave
ret
.def _cos; .scl 2; .type 32; .endef

eschew obfuscation

MichaelW

This code compares the cycle count for the MSVCRT cos function to the cycle count for an asm version, in both cases including the overhead for the call and for storing the result to memory:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      rads  REAL8 0.0
      rval  REAL8 0.0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
cos_asm proc x:REAL8
    fld x
    fcos
    ret
cos_asm endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    fldpi
    push 6
    fild DWORD PTR [esp]
    add esp, 4
    fdiv
    fstp rads

    invoke cos_asm, rads
    fstp rval
    invoke crt_printf,chr$("%f%c"),rval,10

    invoke crt_cos, rads
    fstp rval
    invoke crt_printf,chr$("%f%c%c"),rval,10,10

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke cos_asm, rads
      fstp rval
    counter_end
    print ustr$(eax)," cycles, cos_asm",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke crt_cos, rads
      fstp rval
    counter_end
    print ustr$(eax)," cycles, crt_cos",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


And the typical results on my P3:

0.866025
0.866025

122 cycles, cos_asm
361 cycles, crt_cos


I think at least some the difference is due to the error checking in the CRT function, for example checking for a loss of precision when the argument is too large, but I'm not sure that this could account for a 3X difference.

eschew obfuscation