I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get. Fast is important.
The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?
Quote from: thomas_remkus on July 31, 2009, 02:43:37 AM
I'm hoping for some help. I'm working with some people on a sample project "http://pastebin.ca/1513115" and we're trying to see if this is the best we can get. Fast is important.
The goal in the loop is to set two variable locations and then add them to get a sum. That's it. Are we missing anything?
... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:
- Why are you wanting the sum of the same two numbers to be fast? Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost? The reason being that it's a completely different thing to optimize for.
- Why are you using the FPU if you want it to be fast? Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.
Some clarification on points like that would be useful.
not gonna make it faster, but i don't see finit
in theory, you are supposed to finit at the beginning of a program that uses fpu
i am guessing that windows hands you the fpu in that state, already
but, neo is the guy to listen to - use sse, instead
Quote... um, I'm not sure if you're missing anything, but I'm definitely missing a few things:
* Why are you wanting the sum of the same two numbers to be fast? Do you instead want the sum of two arrays or the total sum of a single array to be fast, or am I completely lost? The reason being that it's a completely different thing to optimize for.
* Why are you using the FPU if you want it to be fast? Using SSE instructions instead would be much faster, especially if you want to sum up arrays, because you can add 4 adjacent numbers in one instruction.
Some clarification on points like that would be useful.
It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.
We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.
SSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.
PLEASE NOTE, this is not for school or anything else. It's just trying to learn the best (meaning, fastest) way to yield the process with certain limits.
Using the FPU I can't see very many ways to do it.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
a dq 123.333
b dq 1234533.987
sum dq ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 4000
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 100
fld a
fadd b
fstp sum
ENDM
fwait
counter_end
print ustr$(eax)," cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
REPEAT 100
fld a
fld b
fadd
fstp sum
ENDM
fwait
counter_end
print ustr$(eax)," cycles",13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Running on a P3:
252 cycles
199 cycles
Quote from: thomas_remkus on July 31, 2009, 03:22:28 AM
It started out as a language perf test of sorts. We were comparing Python to C initially. Then we moved to other languages. Because we are working on a Linux box it's NASM for the asm flavor. I'm not a NASM person and I'm not studied in the art of FPU ... so I'm here for help.
We are trying to see how fast you can take values from variables, and add them together. The loop is just to give it weight for the output. We are not looking for an array of values or to unroll the calculations. Rather, how to get the instructions to be as fast as we can get them. Sort of an example of best technique.
Neat idea. I've done a bit of Java vs. VC++ vs. g++ vs. assembly performance comparison, in the form of 4 versions of the Code Cortex screensaver (http://www.codecortex.com/more/). One thing to note about looking for speedup between languages is that the time for individual operations on primitives probably isn't very representative of the performance difference in a real application. For example, a C compiler might realize that the addition is the same every time and decide to only do it once (or not at all because it's a constant value), whereas in the assembly, you've specified that you want the values loaded from memory, added with the FPU, and stored back to memory.
QuoteSSE? So would we be able to put the value of A into one SSE place, the value of B into another, then add them together that way? For just the test we need to get the value back into a variable.
SSE instructions let you do 4 operations on floats at the same time, so you could load in a vector of 4 floats, A, and add it to a vector of 4 floats, B, in slightly less time than doing the one addition with the FPU. There are also SSE instructions to only do one operation at a time, and they'll run in about the same as the 4-operation version. SSE coding takes some effort to become comfortable with it, but it can give big gains.
Cheers! :U
MichaelW: I think I'm getting the opposite results from what you had. Here's my current code. I need to see about getting an SSE version of this to understand the difference.
include \masm32\include\masm32rt.inc
.686
.data
align 16
varA dq 123.333
varB dq 1234533.987
varC dq 0
.code
start:
finit
print chr$(" fld then add: ")
invoke GetTickCount
mov ebx, eax
mov ecx, 429496729
__beginFA:
fld varA
fadd varB
fstp varC
dec ecx
jnz __beginFA
invoke GetTickCount
sub eax, ebx
print str$(eax)
print chr$(13, 10)
print chr$(" fld fld add: ")
invoke GetTickCount
mov ebx, eax
mov ecx, 429496729
__beginFFA:
fld varA
fld varB
fadd
fstp varC
dec ecx
jnz __beginFFA
invoke GetTickCount
sub eax, ebx
print str$(eax)
print chr$(13, 10)
xor eax, eax
invoke ExitProcess, eax
end start
For some reason, when we tweek the NASM code it can run in 1/2 a second while these are about 7-9 times slower. But the NASM code (from what I can tell) is the same stuff. I thought I'd be able to get thet 1/2 second on here too. It's really odd.
hiya Thomas,
our "Laboratory" sub-forum may be of great interest to you
MichaelW has written the timing macros we all currently use
(see the first post of the first thread in that sub-forum)
you will also find many, many SSE examples and timings in that sub-forum
MichaelW, lingo, drizz, and jj2007, and a few others love to play with both timings and SSE
they tend to stay with MMX/SSE2 code for the most part
Thomas,
Here is a version that reports times in the same categories as the time command (see Performance Management Guide (http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/IBMp690/IBM/usr/share/man/info/en_US/a_doc_lib/aixbman/prftungd/2365c62.htm)). Time measured with GetTickCount is what the time command would report as "real" time.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
sysTime1 dq ?
sysTime2 dq ?
kernelTime1 dq ?
kernelTime2 dq ?
userTime1 dq ?
userTime2 dq ?
dqJunk dq ?
a dq 123.333
b dq 1234533.987
sum dq ?
hProcess dd ?
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke GetCurrentProcess
mov hProcess, eax
invoke Sleep, 4000
;--------------------------------------------
; All times are in 100-nanosecond intervals.
;--------------------------------------------
invoke GetSystemTimeAsFileTime, ADDR sysTime1
invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
ADDR kernelTime1, ADDR userTime1
mov ebx, 429496729
align 16
@@:
fld a
fadd b
fstp sum
dec ebx
jnz @B
invoke GetSystemTimeAsFileTime, ADDR sysTime2
invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
ADDR kernelTime2, ADDR userTime2
fild sysTime2
fild sysTime1
fsub ; calculate system time delta
fld8 100.0e-9
fmul ; adjust to seconds
fstp sysTime1
invoke crt_printf, cfm$("real\t%fs\n"), sysTime1
fild userTime2
fild userTime1
fsub ; calculste user time delta
fld8 100.0e-9
fmul ; adjust to seconds
fstp userTime1
invoke crt_printf, cfm$("user\t%fs\n"), userTime1
fild kernelTime2
fild kernelTime1
fsub ; calculate kernel time delta
fld8 100.0e-9
fmul ; adjust to seconds
fstp kernelTime1
invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1
;==================================================
invoke GetSystemTimeAsFileTime, ADDR sysTime1
invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
ADDR kernelTime1, ADDR userTime1
mov ebx, 429496729
align 16
@@:
fld a
fld b
fadd
fstp sum
dec ebx
jnz @B
invoke GetSystemTimeAsFileTime, ADDR sysTime2
invoke GetProcessTimes, hProcess, ADDR dqJunk, ADDR dqJunk,
ADDR kernelTime2, ADDR userTime2
fild sysTime2
fild sysTime1
fsub
fld8 100.0e-9
fmul
fstp sysTime1
invoke crt_printf, cfm$("real\t%fs\n"), sysTime1
fild userTime2
fild userTime1
fsub
fld8 100.0e-9
fmul
fstp userTime1
invoke crt_printf, cfm$("user\t%fs\n"), userTime1
fild kernelTime2
fild kernelTime1
fsub
fld8 100.0e-9
fmul
fstp kernelTime1
invoke crt_printf, cfm$("sys\t%fs\n\n"), kernelTime1
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Result running on my 500Mhz P3:
real 3.625213s
user 3.615198s
sys 0.000000s
real 2.733931s
user 2.723917s
sys 0.000000s
The MASM code is 7-9 times slower running on the same system as the NASM code, or 7-9 times slower running on a slower system?
well, obviously the nasm program is somehow different from the masm program
you should load em up in olly and see what the difference is
if you have a simple example pair of programs (one nasm, one masm), you may be able to get one of us to look at it
i am gonna venture a guess, here
due to some difference in assembler syntax, the nasm program is not doing the same stuff as the masm one
What else could NASM be doing with:
mov ebx, 429496729
.begin:
fld qword [a]
fadd qword [b]
fstp qword [c]
dec ebx
cmp ebx, 0
jne .begin
well - i dunno - lol
he hasn't shown us any files
gotta be sumpin
if masm was such a bad assembler, we would not be using it to time algorithms
this forum would be called nasm32 - lol
what are your thoughts, Michael ?
An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.
Quote from: BATSoftware on August 01, 2009, 01:35:25 PM
An assembler is an aseembler - there are NO differences in the binary code generated among assemblers for the same target CPU. Comoliers <> assemblers. With any assembler, you get the exact same instruction for a giving mnemonic. The syntax of the mnemonic varies, but the result is constant. Something fishy with the timing routine or your code got context switched out.
This isnt quite correct. There isnt a perfect correspondence between machine code and intel-syntax assembler. Intel syntax assembler isnt specific enough to be able to choose the opcode you want to emit in all cases (example, there are two ways to encode 'shl eax, 1')
When you get right down to it, there are a lot more opcodes than assembler instructions (I'm estimating 5x to 10x as many) where many of them are ambiguous with another in some cases.
Then you get into the fact that some assemblers try to be 'smart' (this includes masm) and interchange between two otherwise equivilent opcodes because one happens to be shorter or longer than another (sometimes longer is desirable to maintain alignment, but usualy 'smart' assemblers just pick shorter encodings while 'dumb' ones favor one or the other arbitrarily.) This fact allows one to determine with considerable accuracy *which* assembler produced a given binary.
And then there are instruction capabilities that arent addressed be the intel syntax, which was the case for a very long time with instructions like 'aad' and 'aam' which, low-and-behold, actualy supported bases other than 10 since the 8086/8088 but it wasnt until the 486 or so that assemblers started to recognize these other forms.
We assembly programers like to think that were are dealing with the bare metal, but the reality is that we are not. Such an assembler simply doesnt exist.
I'm finding that if I tell it that the subsystem is "console" versus "windows" then this is my main difference. My MASM was being compiled as "console" and the NASM as "windows'. So when I changed to "windows" for the MASM then I got my desired performance. Here's the MASM code:
.586
.model flat, stdcall
option casemap:none
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
MAX_LOOP_LIMIT equ 4294967295
.data
varA dq 123.333
varB dq 1234533.987
varC dq 0
.code
start:
finit
mov ebx, MAX_LOOP_LIMIT
__begin:
fld varA
fadd varB
fstp varC
dec ebx
jnz __begin
invoke ExitProcess, 0
end start
... and here's my NASM code ...
%include '..\..\..\inc\nasmx.inc'
%include '..\..\..\inc\win32\windows.inc'
%include '..\..\..\inc\win32\kernel32.inc'
%include '..\..\..\inc\win32\user32.inc'
entry test_fpu
[section .data]
a: dq 123.333
b: dq 1234533.987
c: dq 0
[section .text]
proc test_fpu
finit
mov ebx, 4294967295
.begin:
fld qword [a]
fadd qword [b]
fstp qword [c]
dec ebx
jnz .begin
invoke ExitProcess, dword NULL
endproc
My bat to create the MASM is:
@echo off
if exist code2.obj del code2.obj
\masm32\bin\ml /c /coff code2.asm
if exist code2.obj goto linkit
goto endend
:linkit
if exist code2.exe del code2.exe
\masm32\bin\link /subsystem:windows /libpath:c:\masm32\lib code2.obj
del code2.obj
:endend
pause
... and my NASM build is ...
@echo off
set file="DEMO1"
if exist %file%.obj del %file%.obj
if not exist %file%.asm goto errasm
..\..\..\bin\nasm -f win32 %file%.asm -o %file%.obj
if errorlevel 1 goto errasm
..\..\..\bin\GoLink.exe /entry _main DEMO1.obj kernel32.dll user32.dll
if errorlevel 1 goto errlink
if exist %file%.obj del %file%.obj
goto TheEnd
:errlink
echo _
echo Link error
pause
goto TheEnd
:errasm
echo _
echo Assembly Error
pause
goto TheEnd
:TheEnd
echo _
They are remarkably the same when it comes to the disasm. The only difference from the disasm that I see between the two is that the MASM has an additional JMP after the call to ExitProcess and the NASM does not. Does anyone know why there is a JMP after the CALL?
it may be a failsafe, in case the API call fails
or, it could be the start of the IAT jmp table
Quote from: thomas_remkus on August 02, 2009, 07:47:21 PM
Does anyone know why there is a JMP after the CALL?
If you are asking about something like this:
00401000 6A00 push 0
00401002 E801000000 call fn_00401008
00401007 CC int 3
00401008 fn_00401008:
00401008 FF2500204000 jmp dword ptr [ExitProcess]
The jump routes the ExitProcess call to the actual function address.
http://en.wikipedia.org/wiki/Import_Address_Table#Import_Table