News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

make it fast, fast, fast ... fpu

Started by thomas_remkus, July 31, 2009, 02:43:37 AM

Previous topic - Next topic

thomas_remkus

I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get.  Fast is important.

The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?

Neo

Quote from: thomas_remkus on July 31, 2009, 02:43:37 AM
I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get.  Fast is important.

The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?
... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:

  • Why are you wanting the sum of the same two numbers to be fast?  Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost?  The reason being that it's a completely different thing to optimize for.
  • Why are you using the FPU if you want it to be fast?  Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.
Some clarification on points like that would be useful.

dedndave

not gonna make it faster, but i don't see finit
in theory, you are supposed to finit at the beginning of a program that uses fpu
i am guessing that windows hands you the fpu in that state, already
but, neo is the guy to listen to - use sse, instead

thomas_remkus

Quote... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:

    * Why are you wanting the sum of the same two numbers to be fast?  Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost?  The reason being that it's a completely different thing to optimize for.
    * Why are you using the FPU if you want it to be fast?  Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.

Some clarification on points like that would be useful.

It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.

We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.

SSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.

PLEASE NOTE, this is not for school or anything else. It's just trying to learn the best (meaning, fastest) way to yield the process with certain limits.

MichaelW

Using the FPU I can't see very many ways to do it.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      a   dq 123.333
      b   dq 1234533.987
      sum dq ?
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        fld a
        fadd b
        fstp sum
      ENDM
      fwait
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        fld a
        fld b
        fadd
        fstp sum
      ENDM
      fwait
    counter_end
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Running on a P3:

252 cycles
199 cycles


eschew obfuscation

Neo

Quote from: thomas_remkus on July 31, 2009, 03:22:28 AM
It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.

We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.
Neat idea.  I've done a bit of Java vs. VC++ vs. g++ vs. assembly performance comparison, in the form of 4 versions of the Code Cortex screensaver.  One thing to note about looking for speedup between languages is that the time for individual operations on primitives probably isn't very representative of the performance difference in a real application.  For example, a C compiler might realize that the addition is the same every time and decide to only do it once (or not at all because it's a constant value), whereas in the assembly, you've specified that you want the values loaded from memory, added with the FPU, and stored back to memory.

QuoteSSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.
SSE instructions let you do 4 operations on floats at the same time, so you could load in a vector of 4 floats, A, and add it to a vector of 4 floats, B, in slightly less time than doing the one addition with the FPU.  There are also SSE instructions to only do one operation at a time, and they'll run in about the same as the 4-operation version.  SSE coding takes some effort to become comfortable with it, but it can give big gains.

Cheers!  :U


thomas_remkus

MichaelW: I think I'm getting the opposite results from what you had.  Here's my current code. I need to see about getting an SSE version of this to understand the difference.

include \masm32\include\masm32rt.inc
.686

.data
    align 16
varA      dq 123.333
varB      dq 1234533.987
varC      dq 0

.code
start:
finit
print chr$("  fld then add: ")
invoke GetTickCount
mov ebx, eax
mov ecx, 429496729

__beginFA:
fld varA
fadd varB
fstp varC
dec ecx
jnz __beginFA

invoke GetTickCount
sub eax, ebx
print str$(eax)
print chr$(13, 10)

print chr$("  fld fld add:  ")
invoke GetTickCount
mov ebx, eax
mov ecx, 429496729

__beginFFA:
fld varA
fld varB
fadd
fstp varC
dec ecx
jnz __beginFFA

invoke GetTickCount
sub eax, ebx
print str$(eax)
print chr$(13, 10)

    xor eax, eax
    invoke ExitProcess, eax
end start


For some reason, when we tweek the NASM code it can run in 1/2 a second while these are about 7-9 times slower. But the NASM code (from what I can tell) is the same stuff. I thought I'd be able to get thet 1/2 second on here too. It's really odd.

dedndave

hiya Thomas,
  our "Laboratory" sub-forum may be of great interest to you
MichaelW has written the timing macros we all currently use
(see the first post of the first thread in that sub-forum)
you will also find many, many SSE examples and timings in that sub-forum
MichaelW, lingo, drizz, and jj2007, and a few others love to play with both timings and SSE
they tend to stay with MMX/SSE2 code for the most part

MichaelW

Thomas,

Here is a version that reports times in the same categories as the time command (see Performance Management Guide). Time measured with GetTickCount is what the time command would report as "real" time.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data

      sysTime1      dq ?
      sysTime2      dq ?
      kernelTime1   dq ?
      kernelTime2   dq ?
      userTime1     dq ?
      userTime2     dq ?

      dqJunk        dq ?

      a             dq 123.333
      b             dq 1234533.987
      sum           dq ?

      hProcess      dd ?

    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    invoke GetCurrentProcess
    mov hProcess, eax

    invoke Sleep, 4000

    ;--------------------------------------------
    ; All times are in 100-nanosecond intervals.
    ;--------------------------------------------

    invoke GetSystemTimeAsFileTime, ADDR sysTime1
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime1, ADDR userTime1
     mov ebx, 429496729
     align 16
  @@:
    fld a
    fadd b
    fstp sum
    dec ebx
    jnz @B

    invoke GetSystemTimeAsFileTime, ADDR sysTime2
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime2, ADDR userTime2
    fild sysTime2
    fild sysTime1
    fsub                  ; calculate system time delta
    fld8 100.0e-9         
    fmul                  ; adjust to seconds
    fstp sysTime1

    invoke crt_printf, cfm$("real\t%fs\n"), sysTime1

    fild userTime2
    fild userTime1
    fsub                  ; calculste user time delta
    fld8 100.0e-9
    fmul                  ; adjust to seconds
    fstp userTime1

    invoke crt_printf, cfm$("user\t%fs\n"), userTime1

    fild kernelTime2
    fild kernelTime1
    fsub                  ; calculate kernel time delta
    fld8 100.0e-9
    fmul                  ; adjust to seconds
    fstp kernelTime1

    invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1

    ;==================================================

    invoke GetSystemTimeAsFileTime, ADDR sysTime1
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime1, ADDR userTime1
     mov ebx, 429496729
     align 16
  @@:
    fld a
    fld b
    fadd
    fstp sum
    dec ebx
    jnz @B

    invoke GetSystemTimeAsFileTime, ADDR sysTime2
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime2, ADDR userTime2
    fild sysTime2
    fild sysTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp sysTime1

    invoke crt_printf, cfm$("real\t%fs\n"), sysTime1

    fild userTime2
    fild userTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp userTime1

    invoke crt_printf, cfm$("user\t%fs\n"), userTime1

    fild kernelTime2
    fild kernelTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp kernelTime1

    invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Result running on my 500Mhz P3:

real    3.625213s
user    3.615198s
sys     0.000000s

real    2.733931s
user    2.723917s
sys     0.000000s


The MASM code is 7-9 times slower running on the same system as the NASM code, or 7-9 times slower running on a slower system?
eschew obfuscation

dedndave

well, obviously the nasm program is somehow different from the masm program
you should load em up in olly and see what the difference is
if you have a simple example pair of programs (one nasm, one masm), you may be able to get one of us to look at it
i am gonna venture a guess, here
due to some difference in assembler syntax, the nasm program is not doing the same stuff as the masm one

MichaelW

What else could NASM be doing with:

    mov ebx, 429496729
.begin:
    fld qword [a]
    fadd qword [b]
    fstp qword [c]
    dec ebx
    cmp ebx, 0
    jne .begin

eschew obfuscation

dedndave

well - i dunno - lol
he hasn't shown us any files
gotta be sumpin
if masm was such a bad assembler, we would not be using it to time algorithms
this forum would be called nasm32 - lol

what are your thoughts, Michael ?

BATSoftware

An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.

Rockoon

Quote from: BATSoftware on August 01, 2009, 01:35:25 PM
An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.

This isnt quite correct. There isnt a perfect correspondence between machine code and intel-syntax assembler. Intel syntax assembler isnt specific enough to be able to choose the opcode you want to emit in all cases (example, there are two ways to encode 'shl eax, 1')

When you get right down to it, there are a lot more opcodes than assembler instructions (I'm estimating 5x to 10x as many) where many of them are ambiguous with another in some cases.

Then you get into the fact that some assemblers try to be 'smart' (this includes masm) and interchange between two otherwise equivilent opcodes because one happens to be shorter or longer than another (sometimes longer is desirable to maintain alignment, but usualy 'smart' assemblers just pick shorter encodings while 'dumb' ones favor one or the other arbitrarily.) This fact allows one to determine with considerable accuracy *which* assembler produced a given binary.

And then there are instruction capabilities that arent addressed be the intel syntax, which was the case for a very long time with instructions like 'aad' and 'aam' which, low-and-behold, actualy supported bases other than 10 since the 8086/8088 but it wasnt until the 486 or so that assemblers started to recognize these other forms.

We assembly programers like to think that were are dealing with the bare metal, but the reality is that we are not. Such an assembler simply doesnt exist.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

thomas_remkus

I'm finding that if I tell it that the subsystem is "console" versus "windows" then this is my main difference. My MASM was being compiled as "console" and the NASM as "windows'. So when I changed to "windows" for the MASM then I got my desired performance. Here's the MASM code:

.586
.model flat, stdcall
option casemap:none

include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib

MAX_LOOP_LIMIT equ 4294967295
                   
.data
varA dq 123.333
varB dq 1234533.987
varC dq 0

.code
start:
finit
mov ebx, MAX_LOOP_LIMIT

__begin:
fld varA
fadd varB
fstp varC
dec ebx
jnz __begin

    invoke ExitProcess, 0
end start


... and here's my NASM code ...

%include '..\..\..\inc\nasmx.inc'
%include '..\..\..\inc\win32\windows.inc'
%include '..\..\..\inc\win32\kernel32.inc'
%include '..\..\..\inc\win32\user32.inc'

entry    test_fpu

[section .data]
a:      dq 123.333
b:      dq 1234533.987
c: dq 0

[section .text]
proc     test_fpu
finit
mov ebx, 4294967295

.begin:
fld qword [a]
fadd qword [b]
fstp qword [c]
dec ebx
jnz .begin

    invoke ExitProcess, dword NULL
endproc


My bat to create the MASM is:

@echo off
if exist code2.obj del code2.obj
\masm32\bin\ml /c /coff code2.asm

if exist code2.obj goto linkit
goto endend

:linkit
if exist code2.exe del code2.exe
\masm32\bin\link /subsystem:windows /libpath:c:\masm32\lib code2.obj
del code2.obj

:endend
pause


... and my NASM build is ...

@echo off
set file="DEMO1"
if exist %file%.obj del %file%.obj
if not exist %file%.asm goto errasm

..\..\..\bin\nasm -f win32 %file%.asm -o %file%.obj
if errorlevel 1 goto errasm

..\..\..\bin\GoLink.exe /entry _main DEMO1.obj kernel32.dll user32.dll
if errorlevel 1 goto errlink

if exist %file%.obj del %file%.obj
goto TheEnd

:errlink
echo _
echo Link error
pause
goto TheEnd

:errasm
echo _
echo Assembly Error
pause
goto TheEnd

:TheEnd
echo _


They are remarkably the same when it comes to the disasm. The only difference from the disasm that I see between the two is that the MASM has an additional JMP after the call to ExitProcess and the NASM does not. Does anyone know why there is a JMP after the CALL?