why console subsystem slower

thomas_remkus · July 31, 2009, 07:13:31 PM

I have been trying to understand how a friend's NASM code was over 7 times faster than the same ported code to MASM. I was sure that I had everything correct so I dug into it. When I execute the nasm code (from dos) it executes VERY fast at about 1/2 a second. The same MASM was taking me like 2.8-3.8 seconds. If found that the NASM was set to "Windows GUI" and the MASM I had set to "console". When I changed that in the MASM they both ran the same perf. WHY??? What's it about the console setting that's so ugly? Is it because it's a subsystem and I have to pass through it all the time?

Here's the code I'm working with:

Code Select

.586
.model flat, stdcall
option casemap:none

include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib

MAX_LOOP_LIMIT equ 429496729
                   
.data
	varA		dq 123.333
	varB		dq 1234533.987
	varC		dq 0

.code
start:
	finit
	mov ebx, MAX_LOOP_LIMIT
	
	__begin:
		fld varA
		fadd varB
		fstp varC
		dec ebx
		jnz __begin
	
    invoke ExitProcess, 0
end start

dedndave · July 31, 2009, 08:46:11 PM

the console is a dog - lol
especially if it is in the process of outputing characters to the con window
although, i am not sure what masm vs nasm has to do with it

redskull · August 01, 2009, 03:46:53 AM

Consoles are slow because they interact indirectly via message passing to the subsystem, and not direct kernel calls (one of the last things in NT to do so). When you specify the console switch, it automatically initializes everything for you upon loading, whether you use it or not. If you change the WINDOWS code to do an AllocConsole() call, the speeds should match (the CONSOLE switch effectively just calls AllocConsole() for you in the beginning).

-r

dedndave · August 01, 2009, 04:04:09 AM

he is saying he sees a large discrepency between console mode code assembled with masm vs nasm
any clues about that redskull ?

MichaelW · August 01, 2009, 05:12:14 AM

The console should have no effect on the loop timing, but the processor speed could have a very large effect. The loop would take about 3.6s to execute on my 10-year old 500MHz P3, and I would not be surprised if a recent, high-end processor could do it in 10% of that time.

redskull · August 01, 2009, 12:17:59 PM

Quote from: dedndave on August 01, 2009, 04:04:09 AM
he is saying he sees a large discrepency between console mode code assembled with masm vs nasm

The way I read the OP, he's seeing the discrepancy between linking with /SUBSYSTEM:WINDOWS and /SUBSYSTEM:CONSOLE, not MASM and NASM ("When I changed that in the MASM they both ran the same perf. "). If that isn't the case, I haven't the foggiest.

-r

hutch-- · August 01, 2009, 01:39:06 PM

I do most of my algo testing in console apps as its easier and faster but I also do some in GUI apps and there is no speed difference whatsoever between console and GUI mode. Where CMD.EXE can be slower is when you are dumping data on the screen but to a lesser degree you slow up an algo in GUI mode by writing results to the screen as the algo is running.

redskull · August 01, 2009, 03:59:08 PM

I would think consoles would HAVE to be slower, if for no other reason that it involves an LPC, and hence twice the kernel mode switches. After all, isn't speed the whole reason they changed the GUI routines from this method to straight kernel calls in the first place? I doubt it would be noticeable, as console use is few and far between, but it would be interesting to test. Of course, all this is moot regarding this code, as it has no output at all (console or otherwise); the only possible place for a slow down (all other things being equal) would be in the loading of the code, and the only thing different about the loading the code is the allocation of a console when you specify /SUBSYSTEM:CONSOLE. I don't really see that taking 4 seconds, though.... I wish M.R. over at sysinternals would do a blog about the consoles, i've never actually seen any indepth info about how they work.

-r

dedndave · August 01, 2009, 04:15:36 PM

i think console mode is regarded as "unimportant" or "merely a tool"
the console window has several bugs and i doubt anyone at ms cares - lol
in a way, i suppose they are right, too
any "real" app is going to be gui
but, console gives you a way to run batches and scripts and test rudimentary things without a lot of code

redskull · August 01, 2009, 04:28:05 PM

Quote from: dedndave on August 01, 2009, 04:15:36 PM
...any "real" app is going to be gui...

I'm of the opinion that any 'real' app should be *both* (app depending, of course). There's no better program than one that you can start in an interactive GUI session when you want to, or start with 50 different command line switches to do fully automated batch processing and pipe the output to a file, while you sit around and drink coffee. The new windows PowerShell is a great tool, but not enough cmdlets ship with it.

-r

ecube · August 01, 2009, 05:37:22 PM

I can't wait until everything is like cooliris. If you don't know what that is, it's a free firefox addon http://www.cooliris.com that lets you zoom through online pics in 3d. It's probably 1 of the coolest things i've seen. If firefox etc was smart they'd buy it from the company and integrate it right in. Having lil snapshot previews of your favorite sites and being able to zoom through, chose 1 where it grows bigger and interact...wud be incredible. I'm trying to clone what they did to market it but my opengl/directx skills are pretty bad heh.

thomas_remkus · August 02, 2009, 07:53:06 PM

It's true. When I change the subsystem to "windows" I get the same perf. I was amazed at how different the outcome was between console/windows.

Code Select

.586
.model flat, stdcall
option casemap:none

include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib

MAX_LOOP_LIMIT equ 4294967295
                   
.data
	varA		dq 123.333
	varB		dq 1234533.987
	varC		dq 0

.code
start:
	finit
	mov ebx, MAX_LOOP_LIMIT
	
	__begin:
		fld varA
		fadd varB
		fstp varC
		dec ebx
		jnz __begin
	
    invoke ExitProcess, 0
end start

It was hinted that if this was changed to an SSE/2 implimentation then it's supposedly 4-20 times faster.

redskull · August 02, 2009, 08:12:00 PM

Do you have any quantitative times for just the executed instructions, or are your numbers 'guesstimations' from the time you run the .EXE (eg, including the loading times)? Also, i'd be interested to see what happens if you link it as a WINDOWS app, but include an AllocConsole() call in the very beginning. I find it almost unbelievable the type of subsystem would actually affect the execution speed once the thread is off and running; after all, you're even using the console.

-r

dedndave · August 02, 2009, 08:31:10 PM

i played with it a little bit
when you link it for subsystem windows, the program takes as long
because you have no screen output, you do not see anything happen when it is over
it shows up in the task manager, though

hutch-- · August 03, 2009, 01:36:05 AM

Most of the problem with the example code is it does not isolate the console loading time from the test algo. Put a key press to start the algo and time it properly and you will see why the test code that was posted fails to compare console to GUI. Once a console is allocated, it barely uses any system resource as it only dumps a bit of text on a screen. GUI display is a lot faster than console display as console does not need to be all that fast but at the moment the assumptions are like chalk and chees.

News:

why console subsystem slower

thomas_remkus

ecube

thomas_remkus