I was reading a post today at: http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae
Quote
That's naive. First, today's compilers can outrun the average assembly
language programmer every single time. Only the real experts can do
hand-made assembler that outdoes a modern optimizing compiler.
Further, most of your operating system code isn't worth optimizing. It's
simply not worth the effort to go to the trouble of hand-coding all of the
initialization and overhead code. The only time it's worth the trouble is
when you have inner loops or interrupt handlers that get executed over and
over and over.
Almost all of Windows -- and even more of Linux -- is written in C and C++.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
I guess he isnt aware of the programmers that belong to this forum. :U
One would assume that NVidia would use an optimizing compiler, right?
.text:000002C9 mov ecx, [ebp+arg_0]
.text:000002CC push ecx
.text:000002CD call dword ptr ds:__imp__GetProcAddress@8 ; GetProcAddress(x,x)
.text:000002D3 mov ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, eax ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002D8 cmp ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, 0 ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002DF jnz short loc_2F3
That code is from nvapi.lib, maybe not too bad but there is a lot of it, and since it ends up in my exe it makes me look bad :( :bg
It must be a case of Tim getting a bit out of practice as Tim is an expert is coding assembler in his own right. Looking at where the post was made, it probably suffers from a dose of"political correctness" as well.
Well, Tim Roberts has been around for a LONG time and his opinion has to be respected. He is right that most compilers today can outrun an average assembly language programmer, I doubt there's any argument about that. However, there are some excellent programmers around like Mark and Lingo who can easily outpace the average compiler. That said, I usually don't bother with any optimization techniques since the amount of time spent to benefit gained is normally small, however a compiler can optimize to no end and it takes little to no extra time, its obvious they will have the advantage.
I have to agree with Tim based on personal experience implementing fast code.
A HLL like C is more convenient because you can easily compile for other architectures very quickly...I like C despite it's unpopularity amongst some new generation of programmers.
It allows you to add asm quite easily compared to other HLL which is why C is king champion for me ;)
Compilers really do a fantastic job optimizing high-level code, specifically C/C++, I can't speak about other languages but Java seems to have good optimizers too.
I'm afraid when they go the 'PC' route.. they've lost the plot, and degrade with time.
A question... Who writes the compilers, and in what language ?
Kind of defeats the purpose if they're written in C.. ??
JWASM isn't written in assembly, does that mean it's no good?
Neither MASM or JWASM are written in assembler and it is to its advantage that it is not so, these tools must operate across different platforms and need to be portable.
RE: The argument that good (anything) is better than bad assembler, substitute the word assembler with any other language and its still true. Good FORTRAN is better than bad C++, good Pascal is better than bad COBOL etc etc etc ....
The assumptions are that most don't bother to learn how to code in assembler so their assembler code must be bad, same old claptrap thats been around for the last 20 years, someone wants to promote some compiler so they foulmouth other language forms.
There is another assumption that is even more damaging to the average skills of programmers, "Don't code your own logic", use pre-canned functionality, DOT.WOT, MFC and so on. This has accounted for much of the nonsense we keep hearing and is the reason why programmers keep coming back to true low level programming, once they have been bitten a few times they want to be able to fix problems that pre-canned technology does not handle.
I think there are two issues here. The first is the content of Tim's argument:
Tim Roberts (http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae)
QuoteThat's naive. First, today's compilers can outrun the average assembly language programmer every single time.
Now take this example:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://www.masm32.com/board/index.php?topic=12460)
Init
NanoTimer() ; start the timer
Recall "\Masm32\include\Windows.inc", WinInc$() ; read in a fat text file, and translate it into a string array
xchg eax, ebx
Print Str$("Reading and tokenising the %i lines of Windows.inc took", ebx), Str$(" %i µsecs\n\n", NanoTimer(µs))
Inkey Trim$(WinInc$(4)), CrLf$, "...", CrLf$, WinInc$(ebx-3)
Exit
end start
Reading and tokenising the 26902 lines of Windows.inc took 3619 µsecs
WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
...
echo WARNING Duplicate include file windows.inc
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler. But 1. it's not portable, 2. it is one single file I/O routine that has been optimised "by hand" wasting entire weekends. Here is another example, a simple memcopy algo from the Code sensitivity of timings (http://www.masm32.com/board/index.php?topic=11454.0) thread:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 733 735 608 608 615 872 732
2048, d1s1-0 1100 821 649 649 643 653 4299
2048, d7s7-0 995 827 654 661 649 654 4324
2048, d7s8-1 1262 1339 1207 870 618 621 4319
2048, d7s9-2 1262 1341 1218 872 619 611 4340
2048, d8s7+1 1244 1333 1188 1213 620 916 1229
2048, d8s8-0 980 819 656 655 659 655 984
2048, d8s9-1 1228 1347 1210 870 613 621 1229
2048, d9s7+2 1584 1334 1176 1208 613 932 4029
2048, d9s8+1 1587 1333 1176 1209 618 929 4020
2048, d9s9-0 1101 821 660 659 659 661 4040
2048, d15s15 766 825 654 661 661 651 4031Seven algos, a dozen alignment situations tested, and no clear winner. Plus, we got different results for each and every tested CPU. The attempt to optimise that in assembler is likely to fail. However, if you are writing the latest C compiler for a big software vendor, you may have a dozen people willing to throw in their expertise, and a lab with twenty different CPUs for testing, so in the end your C compiler might yield, on average, better results for
void *memmove(
void *dest,
const void *src,
size_t count
);
than the Masm32 hobbyists. The software vendor might be willing to invest this effort, because (Roberts) "The only time it's worth the trouble is when you have inner loops or interrupt handlers that get executed over and over and over". Memcopy is among the common candidates for an innermost loop.
Now we come to the second issue:
Tim Roberts (http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae)
Only the real experts can do hand-made assembler that outdoes a modern optimizing compiler.
Yes we can :bg
Those who design and program the C compiler are real experts, and they code in assembler, of course. But so do we here, as hobbyists, and it seems that a handful of people who are fanatic enough to hang around in the only really active assembler forum on a Saturday night can be considered real experts, too.
i imagine many compilers and assemblers are written in combined languages
for example, primarily written in C, with some routines and generated code written in ASM
i think i remember Andreas saying that he wrote JwAsm in C for the sake of maintainability
Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of Windows.inc...
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.
I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?
i am guessing dynamic string array - BSTR's
JJ,
you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining windows.inc and winextra.inc at about 50000 lines.
Quote from: MichaelW on February 19, 2012, 01:40:21 PM
Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of Windows.inc...
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.
I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?
Recall does the following:
- Open the file
- guess how many lines there could be, and heapalloc 8*guessedlines (the guess is generous, and the heapalloc could cost some cycles):
mov fSize, eax ; we need to guess the number of lines for MbArrayDim:
lea ebx, [eax+4] ; worst case is a file with only CrLfs, so we would need max filesize/2 lines
shr ebx, 1 ; lines=fsize/2+2
mov esi, eax ; bytes to read
add eax, 8+32+1 ; 8 for SSE2 correction, 32 for unrolled loop step, plus 1 for the 4095 bytes case
push edi ; no flags (we may get garbage, but no problem for ReadFile)
push eax ; byte count
call MbAllocP
- heapalloc enough space for the whole file
- read the file into this buffer
- tokenise, i.e. fill the first buffer with a) DWORD start address of string and b) DWORD len of string
- close the file and return #of lines
The tokenising of Windows.inc runs at 1.1 milliseconds on my Celeron, the other 2ms are for heapallocs and readfile. Recall can read Unix format (LF only), but if Windows is specified, a LF without CR will be flagged in a special variable named BadLines.
I am curious to see what the C compiler comes up with. Thanks for testing this.
Quote from: hutch-- on February 19, 2012, 02:24:01 PM
JJ,
you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining windows.inc and winextra.inc at about 50000 lines.
See post above. How did you time your code, QPC or GetTickCount?
For the snippet below I get consistently
That took 0.0062 seconds for 52357 lines - which means around 2 ms for the tokeniser. Bear in mind it's a single core Celeron Yonah, not really the latest model :wink
include \masm32\MasmBasic\MasmBasic.inc ; download (http://www.masm32.com/board/index.php?topic=12460)
.data
Init
Recall "\Masm32\include\Windows.inc", L1$()
Recall "\Masm32\include\WinExtra.inc", L2$()
Open "O", #1, "WinBoth.inc"
Store #1, L1$()
Store #1, L2$()
Close
Delay 500 ; half a second for flushing etc
NanoTimer()
Recall "WinBoth.inc", L$()
push eax
Print Str$("That took %2f seconds", NanoTimer(s))
pop eax
Inkey Str$(" for %i lines", eax)
Exit
end start
JJ,
Timing the disk read is an unusual way to do it, the first try will always be slow with the following passes will be reading the file from the cache. I don't like loop code timings for tasks of this type so I tested it on a larger file, 64 meg of C header file with 1.9 million lines.
It keeps turning up at 125 ms for a 2 pass operation, first to count the LF (10) then the tokenise pass. You could drop it by near half if you use an estimation method for the LF count based on the file length but in this context I don't have the luxury of using that much extra memory. I do the first pass to get the count then allocate the correct amount of memory to hold the pointer array.
Quote from: hutch-- on February 19, 2012, 11:23:10 PM
JJ,
Timing the disk read is an unusual way to do it, the first try will always be slow with the following passes will be reading the file from the cache. I don't like loop code timings for tasks of this type so I tested it on a larger file, 64 meg of C header file with 1.9 million lines.
It keeps turning up at 125 ms for a 2 pass operation, first to count the LF (10) then the tokenise pass. You could drop it by near half if you use an estimation method for the LF count based on the file length but in this context I don't have the luxury of using that much extra memory. I do the first pass to get the count then allocate the correct amount of memory to hold the pointer array.
Hutch,
Which CPU?
WinBoth.inc is 2MB, 52357 lines, 2 millisecs for my Celeron (without the disk timing - I know, I know...)
C header is 64Mb, 1.9 Mio - so roughly a factor 35*2 =>> 70 ms expected
The difference is probably that MasmBasic Recall is SSE2, and that I use a single pass. There are limits to that technique, but those who are working with 2GB files surely have the right amount of RAM installed, so getting a temporarily too generous buffer shouldn't pose any practical problems. Two passes, on the other hand, means twice the work, and outside the data cache.
Can you upload the file somewhere, so that we can run a test?