I have been trying to understand how a friend's NASM code was over 7 times faster than the same ported code to MASM. I was sure that I had everything correct so I dug into it. When I execute the nasm code (from dos) it executes VERY fast at about 1/2 a second. The same MASM was taking me like 2.8-3.8 seconds. If found that the NASM was set to "Windows GUI" and the MASM I had set to "console". When I changed that in the MASM they both ran the same perf. WHY??? What's it about the console setting that's so ugly? Is it because it's a subsystem and I have to pass through it all the time?
Here's the code I'm working with:
.586
.model flat, stdcall
option casemap:none
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
MAX_LOOP_LIMIT equ 429496729
.data
varA dq 123.333
varB dq 1234533.987
varC dq 0
.code
start:
finit
mov ebx, MAX_LOOP_LIMIT
__begin:
fld varA
fadd varB
fstp varC
dec ebx
jnz __begin
invoke ExitProcess, 0
end start
the console is a dog - lol
especially if it is in the process of outputing characters to the con window
although, i am not sure what masm vs nasm has to do with it
Consoles are slow because they interact indirectly via message passing to the subsystem, and not direct kernel calls (one of the last things in NT to do so). When you specify the console switch, it automatically initializes everything for you upon loading, whether you use it or not. If you change the WINDOWS code to do an AllocConsole() call, the speeds should match (the CONSOLE switch effectively just calls AllocConsole() for you in the beginning).
-r
he is saying he sees a large discrepency between console mode code assembled with masm vs nasm
any clues about that redskull ?
The console should have no effect on the loop timing, but the processor speed could have a very large effect. The loop would take about 3.6s to execute on my 10-year old 500MHz P3, and I would not be surprised if a recent, high-end processor could do it in 10% of that time.
Quote from: dedndave on August 01, 2009, 04:04:09 AM
he is saying he sees a large discrepency between console mode code assembled with masm vs nasm
The way I read the OP, he's seeing the discrepancy between linking with /SUBSYSTEM:WINDOWS and /SUBSYSTEM:CONSOLE, not MASM and NASM ("When I changed that in the MASM they both ran the same perf. "). If that isn't the case, I haven't the foggiest.
-r
I do most of my algo testing in console apps as its easier and faster but I also do some in GUI apps and there is no speed difference whatsoever between console and GUI mode. Where CMD.EXE can be slower is when you are dumping data on the screen but to a lesser degree you slow up an algo in GUI mode by writing results to the screen as the algo is running.
I would think consoles would HAVE to be slower, if for no other reason that it involves an LPC, and hence twice the kernel mode switches. After all, isn't speed the whole reason they changed the GUI routines from this method to straight kernel calls in the first place? I doubt it would be noticeable, as console use is few and far between, but it would be interesting to test. Of course, all this is moot regarding this code, as it has no output at all (console or otherwise); the only possible place for a slow down (all other things being equal) would be in the loading of the code, and the only thing different about the loading the code is the allocation of a console when you specify /SUBSYSTEM:CONSOLE. I don't really see that taking 4 seconds, though.... I wish M.R. over at sysinternals would do a blog about the consoles, i've never actually seen any indepth info about how they work.
-r
i think console mode is regarded as "unimportant" or "merely a tool"
the console window has several bugs and i doubt anyone at ms cares - lol
in a way, i suppose they are right, too
any "real" app is going to be gui
but, console gives you a way to run batches and scripts and test rudimentary things without a lot of code
Quote from: dedndave on August 01, 2009, 04:15:36 PM
...any "real" app is going to be gui...
I'm of the opinion that any 'real' app should be *both* (app depending, of course). There's no better program than one that you can start in an interactive GUI session when you want to, or start with 50 different command line switches to do fully automated batch processing and pipe the output to a file, while you sit around and drink coffee. The new windows PowerShell is a great tool, but not enough cmdlets ship with it.
-r
I can't wait until everything is like cooliris. If you don't know what that is, it's a free firefox addon http://www.cooliris.com that lets you zoom through online pics in 3d. It's probably 1 of the coolest things i've seen. If firefox etc was smart they'd buy it from the company and integrate it right in. Having lil snapshot previews of your favorite sites and being able to zoom through, chose 1 where it grows bigger and interact...wud be incredible. I'm trying to clone what they did to market it but my opengl/directx skills are pretty bad heh.
It's true. When I change the subsystem to "windows" I get the same perf. I was amazed at how different the outcome was between console/windows.
.586
.model flat, stdcall
option casemap:none
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
MAX_LOOP_LIMIT equ 4294967295
.data
varA dq 123.333
varB dq 1234533.987
varC dq 0
.code
start:
finit
mov ebx, MAX_LOOP_LIMIT
__begin:
fld varA
fadd varB
fstp varC
dec ebx
jnz __begin
invoke ExitProcess, 0
end start
It was hinted that if this was changed to an SSE/2 implimentation then it's supposedly 4-20 times faster.
Do you have any quantitative times for just the executed instructions, or are your numbers 'guesstimations' from the time you run the .EXE (eg, including the loading times)? Also, i'd be interested to see what happens if you link it as a WINDOWS app, but include an AllocConsole() call in the very beginning. I find it almost unbelievable the type of subsystem would actually affect the execution speed once the thread is off and running; after all, you're even using the console.
-r
i played with it a little bit
when you link it for subsystem windows, the program takes as long
because you have no screen output, you do not see anything happen when it is over
it shows up in the task manager, though
Most of the problem with the example code is it does not isolate the console loading time from the test algo. Put a key press to start the algo and time it properly and you will see why the test code that was posted fails to compare console to GUI. Once a console is allocated, it barely uses any system resource as it only dumps a bit of text on a screen. GUI display is a lot faster than console display as console does not need to be all that fast but at the moment the assumptions are like chalk and chees.
redskull:
I'm working from 'quesstimations' because I don't have any timing code. The "AllocConsole" is interesting so I'll try that.
hutch:
I'll put the keypress in there and see how that works. Uh, "chalk and cheese" ... fantastic. I think what you are saying is that once the console portion is loaded all the other code should run with the same performance.
MichaelW wrote the timing macros we use
they are available in the first post of the first thread of the Laboratory sub-forum
INCLUDE \masm32\include\masm32rt.inc
.686
INCLUDE \masm32\macros\timers.asm
notice that .686 is needed prior to the timers (i think 586 works, too)
you need to define a loop count
once you get the program running, try to adjust it so the loop test takes roughly 1/2 second (usually gives repeatable readings)
LOOP_COUNT = 100000
it is a good idea to restrict execution to a single core - this works on single-core and multi-core machines
INVOKE GetCurrentProcess
INVOKE SetProcessAffinityMask,eax,1
use HIGH_PRIORITY_CLASS for most testing
when it is done, the EAX register holds the cycle count
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
; place your test code here
counter_end
print str$(eax),9,"clock cycles",13,10