make it fast, fast, fast ... fpu

thomas_remkus · July 31, 2009, 02:43:37 AM

I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get. Fast is important.

The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?

Neo · July 31, 2009, 02:51:45 AM

Quote from: thomas_remkus on July 31, 2009, 02:43:37 AM
I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get. Fast is important.

The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?

... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:

Why are you wanting the sum of the same two numbers to be fast? Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost? The reason being that it's a completely different thing to optimize for.
Why are you using the FPU if you want it to be fast? Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.

Some clarification on points like that would be useful.

dedndave · July 31, 2009, 02:56:18 AM

not gonna make it faster, but i don't see finit
in theory, you are supposed to finit at the beginning of a program that uses fpu
i am guessing that windows hands you the fpu in that state, already
but, neo is the guy to listen to - use sse, instead

thomas_remkus · July 31, 2009, 03:22:28 AM

Quote... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:

* Why are you wanting the sum of the same two numbers to be fast? Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost? The reason being that it's a completely different thing to optimize for.
* Why are you using the FPU if you want it to be fast? Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.

Some clarification on points like that would be useful.

It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.

We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.

SSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.

PLEASE NOTE, this is not for school or anything else. It's just trying to learn the best (meaning, fastest) way to yield the process with certain limits.

MichaelW · July 31, 2009, 05:42:51 AM

Using the FPU I can't see very many ways to do it.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      a   dq 123.333
      b   dq 1234533.987
      sum dq ?
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 4000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        fld a
        fadd b
        fstp sum
      ENDM
      fwait
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        fld a
        fld b
        fadd
        fstp sum
      ENDM
      fwait
    counter_end
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Running on a P3:

Code Select


252 cycles
199 cycles

Neo · July 31, 2009, 06:17:32 AM

Quote from: thomas_remkus on July 31, 2009, 03:22:28 AM
It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.

We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.

Neat idea. I've done a bit of Java vs. VC++ vs. g++ vs. assembly performance comparison, in the form of 4 versions of the Code Cortex screensaver. One thing to note about looking for speedup between languages is that the time for individual operations on primitives probably isn't very representative of the performance difference in a real application. For example, a C compiler might realize that the addition is the same every time and decide to only do it once (or not at all because it's a constant value), whereas in the assembly, you've specified that you want the values loaded from memory, added with the FPU, and stored back to memory.

QuoteSSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.

SSE instructions let you do 4 operations on floats at the same time, so you could load in a vector of 4 floats, A, and add it to a vector of 4 floats, B, in slightly less time than doing the one addition with the FPU. There are also SSE instructions to only do one operation at a time, and they'll run in about the same as the 4-operation version. SSE coding takes some effort to become comfortable with it, but it can give big gains.

Cheers! :U

thomas_remkus · July 31, 2009, 03:11:53 PM

MichaelW: I think I'm getting the opposite results from what you had. Here's my current code. I need to see about getting an SSE version of this to understand the difference.

Code Select

include \masm32\include\masm32rt.inc
.686

.data
    align 16
	varA      dq 123.333
	varB      dq 1234533.987
	varC      dq 0

.code
start:
	finit
	print chr$("  fld then add: ")
	invoke GetTickCount
	mov ebx, eax
	mov ecx, 429496729
	
	__beginFA:
		fld varA
		fadd varB
		fstp varC
		dec ecx
		jnz __beginFA
		
	invoke GetTickCount
	sub eax, ebx
	print str$(eax)
	print chr$(13, 10)
	
	print chr$("  fld fld add:  ")
	invoke GetTickCount
	mov ebx, eax
	mov ecx, 429496729
	
	__beginFFA:
		fld varA
		fld varB
		fadd
		fstp varC
		dec ecx
		jnz __beginFFA
		
	invoke GetTickCount
	sub eax, ebx
	print str$(eax)
	print chr$(13, 10)	
		
    xor eax, eax
    invoke ExitProcess, eax
end start

For some reason, when we tweek the NASM code it can run in 1/2 a second while these are about 7-9 times slower. But the NASM code (from what I can tell) is the same stuff. I thought I'd be able to get thet 1/2 second on here too. It's really odd.

dedndave · July 31, 2009, 04:13:45 PM

hiya Thomas,
our "Laboratory" sub-forum may be of great interest to you
MichaelW has written the timing macros we all currently use
(see the first post of the first thread in that sub-forum)
you will also find many, many SSE examples and timings in that sub-forum
MichaelW, lingo, drizz, and jj2007, and a few others love to play with both timings and SSE
they tend to stay with MMX/SSE2 code for the most part

MichaelW · August 01, 2009, 04:15:14 AM

Thomas,

Here is a version that reports times in the same categories as the time command (see Performance Management Guide). Time measured with GetTickCount is what the time command would report as "real" time.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data

      sysTime1      dq ?
      sysTime2      dq ?
      kernelTime1   dq ?
      kernelTime2   dq ?
      userTime1     dq ?
      userTime2     dq ?

      dqJunk        dq ?

      a             dq 123.333
      b             dq 1234533.987
      sum           dq ?

      hProcess      dd ?

    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    invoke GetCurrentProcess
    mov hProcess, eax

    invoke Sleep, 4000

    ;--------------------------------------------
    ; All times are in 100-nanosecond intervals.
    ;--------------------------------------------

    invoke GetSystemTimeAsFileTime, ADDR sysTime1
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime1, ADDR userTime1
     mov ebx, 429496729
     align 16
  @@:
    fld a
    fadd b
    fstp sum
    dec ebx
    jnz @B

    invoke GetSystemTimeAsFileTime, ADDR sysTime2
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime2, ADDR userTime2
    fild sysTime2
    fild sysTime1
    fsub                  ; calculate system time delta
    fld8 100.0e-9         
    fmul                  ; adjust to seconds
    fstp sysTime1

    invoke crt_printf, cfm$("real\t%fs\n"), sysTime1

    fild userTime2
    fild userTime1
    fsub                  ; calculste user time delta
    fld8 100.0e-9
    fmul                  ; adjust to seconds
    fstp userTime1

    invoke crt_printf, cfm$("user\t%fs\n"), userTime1

    fild kernelTime2
    fild kernelTime1
    fsub                  ; calculate kernel time delta
    fld8 100.0e-9
    fmul                  ; adjust to seconds
    fstp kernelTime1

    invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1

    ;==================================================

    invoke GetSystemTimeAsFileTime, ADDR sysTime1
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime1, ADDR userTime1
     mov ebx, 429496729
     align 16
  @@:
    fld a
    fld b
    fadd
    fstp sum
    dec ebx
    jnz @B

    invoke GetSystemTimeAsFileTime, ADDR sysTime2
    invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
                            ADDR kernelTime2, ADDR userTime2
    fild sysTime2
    fild sysTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp sysTime1

    invoke crt_printf, cfm$("real\t%fs\n"), sysTime1

    fild userTime2
    fild userTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp userTime1

    invoke crt_printf, cfm$("user\t%fs\n"), userTime1

    fild kernelTime2
    fild kernelTime1
    fsub
    fld8 100.0e-9
    fmul
    fstp kernelTime1

    invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Result running on my 500Mhz P3:

Code Select


real    3.625213s
user    3.615198s
sys     0.000000s

real    2.733931s
user    2.723917s
sys     0.000000s

The MASM code is 7-9 times slower running on the same system as the NASM code, or 7-9 times slower running on a slower system?

dedndave · August 01, 2009, 04:25:35 AM

well, obviously the nasm program is somehow different from the masm program
you should load em up in olly and see what the difference is
if you have a simple example pair of programs (one nasm, one masm), you may be able to get one of us to look at it
i am gonna venture a guess, here
due to some difference in assembler syntax, the nasm program is not doing the same stuff as the masm one

MichaelW · August 01, 2009, 05:29:13 AM

What else could NASM be doing with:

Code Select


    mov ebx, 429496729
.begin:
    fld qword [a]
    fadd qword [b]
    fstp qword [c]
    dec ebx
    cmp ebx, 0
    jne .begin

dedndave · August 01, 2009, 05:48:39 AM

well - i dunno - lol
he hasn't shown us any files
gotta be sumpin
if masm was such a bad assembler, we would not be using it to time algorithms
this forum would be called nasm32 - lol

what are your thoughts, Michael ?

BATSoftware · August 01, 2009, 01:35:25 PM

An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.

Rockoon · August 01, 2009, 03:02:50 PM

Quote from: BATSoftware on August 01, 2009, 01:35:25 PM
An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.

This isnt quite correct. There isnt a perfect correspondence between machine code and intel-syntax assembler. Intel syntax assembler isnt specific enough to be able to choose the opcode you want to emit in all cases (example, there are two ways to encode 'shl eax, 1')

When you get right down to it, there are a lot more opcodes than assembler instructions (I'm estimating 5x to 10x as many) where many of them are ambiguous with another in some cases.

Then you get into the fact that some assemblers try to be 'smart' (this includes masm) and interchange between two otherwise equivilent opcodes because one happens to be shorter or longer than another (sometimes longer is desirable to maintain alignment, but usualy 'smart' assemblers just pick shorter encodings while 'dumb' ones favor one or the other arbitrarily.) This fact allows one to determine with considerable accuracy *which* assembler produced a given binary.

And then there are instruction capabilities that arent addressed be the intel syntax, which was the case for a very long time with instructions like 'aad' and 'aam' which, low-and-behold, actualy supported bases other than 10 since the 8086/8088 but it wasnt until the 486 or so that assemblers started to recognize these other forms.

We assembly programers like to think that were are dealing with the bare metal, but the reality is that we are not. Such an assembler simply doesnt exist.

thomas_remkus · August 02, 2009, 07:47:21 PM

I'm finding that if I tell it that the subsystem is "console" versus "windows" then this is my main difference. My MASM was being compiled as "console" and the NASM as "windows'. So when I changed to "windows" for the MASM then I got my desired performance. Here's the MASM code:

Code Select

.586
.model flat, stdcall
option casemap:none

include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib

MAX_LOOP_LIMIT equ 4294967295
                   
.data
	varA		dq 123.333
	varB		dq 1234533.987
	varC		dq 0

.code
start:
	finit
	mov ebx, MAX_LOOP_LIMIT
	
	__begin:
		fld varA
		fadd varB
		fstp varC
		dec ebx
		jnz __begin
	
    invoke ExitProcess, 0
end start

... and here's my NASM code ...

Code Select

%include '..\..\..\inc\nasmx.inc'
%include '..\..\..\inc\win32\windows.inc'
%include '..\..\..\inc\win32\kernel32.inc'
%include '..\..\..\inc\win32\user32.inc'

entry    test_fpu

[section .data]
	a:      dq 123.333
	b:      dq 1234533.987
	c: 	dq 0

[section .text]
proc     test_fpu
	finit
	mov ebx, 4294967295

	.begin:
		fld qword [a]
		fadd qword [b]
		fstp qword [c]
		dec ebx
		jnz .begin

    invoke ExitProcess, dword NULL
endproc

My bat to create the MASM is:

Code Select

@echo off
if exist code2.obj del code2.obj
\masm32\bin\ml /c /coff code2.asm

if exist code2.obj goto linkit
goto endend

:linkit
if exist code2.exe del code2.exe
\masm32\bin\link /subsystem:windows /libpath:c:\masm32\lib code2.obj
del code2.obj

:endend
pause

... and my NASM build is ...

Code Select

@echo off
set file="DEMO1"
if exist %file%.obj del %file%.obj
if not exist %file%.asm goto errasm

..\..\..\bin\nasm -f win32 %file%.asm -o %file%.obj
if errorlevel 1 goto errasm

..\..\..\bin\GoLink.exe /entry _main DEMO1.obj kernel32.dll user32.dll
if errorlevel 1 goto errlink

if exist %file%.obj del %file%.obj
goto TheEnd

:errlink
echo _
echo Link error
pause
goto TheEnd

:errasm
echo _
echo Assembly Error
pause
goto TheEnd

:TheEnd
echo _

They are remarkably the same when it comes to the disasm. The only difference from the disasm that I see between the two is that the MASM has an additional JMP after the call to ExitProcess and the NASM does not. Does anyone know why there is a JMP after the CALL?

News:

make it fast, fast, fast ... fpu

thomas_remkus

thomas_remkus

thomas_remkus

thomas_remkus