Started by rags, February 18, 2012, 04:30:29 AM

  I was reading a post today at:
That's naive.  First, today's compilers can outrun the average assembly
language programmer every single time.  Only the real experts can do
hand-made assembler that outdoes a modern optimizing compiler.

Further, most of your operating system code isn't worth optimizing.  It's
simply not worth the effort to go to the trouble of hand-coding all of the
initialization and overhead code.  The only time it's worth the trouble is
when you have inner loops or interrupt handlers that get executed over and
over and over.

Almost all of Windows -- and even more of Linux -- is written in C and C++.
Tim Roberts,
Providenza & Boekelheide, Inc.

I guess he isnt aware of the programmers that belong to this forum. :U
One would assume that NVidia would use an optimizing compiler, right?

.text:000002C9                 mov     ecx, [ebp+arg_0]
.text:000002CC                 push    ecx
.text:000002CD                 call    dword ptr ds:__imp__GetProcAddress@8 ; GetProcAddress(x,x)
.text:000002D3                 mov     ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, eax ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002D8                 cmp     ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, 0 ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002DF                 jnz     short loc_2F3

That code is from nvapi.lib, maybe not too bad but there is a lot of it, and since it ends up in my exe it makes me look bad :( :bg
It must be a case of Tim getting a bit out of practice as Tim is an expert is coding assembler in his own right. Looking at where the post was made, it probably suffers from a dose of"political correctness" as well.
Well, Tim Roberts has been around for a LONG time and his opinion has to be respected. He is right that most compilers today can outrun an average assembly language programmer, I doubt there's any argument about that. However, there are some excellent programmers around like Mark and Lingo who can easily outpace the average compiler. That said, I usually don't bother with any optimization techniques since the amount of time spent to benefit gained is normally small, however a compiler can optimize to no end and it takes little to no extra time, its obvious they will have the advantage.
I have to agree with Tim based on personal experience implementing fast code.

A HLL like C is more convenient because you can easily compile for other architectures very quickly...I like C despite it's unpopularity amongst some new generation of programmers.
It allows you to add asm quite easily compared to other HLL which is why C is king champion for me ;)

Compilers really do a fantastic job optimizing high-level code, specifically C/C++, I can't speak about other languages but Java seems to have good optimizers too.


I'm afraid when they go the 'PC' route.. they've lost the plot, and degrade with time.
A question... Who writes the compilers, and in what language ?
Kind of defeats the purpose if they're written in C.. ??


JWASM isn't written in assembly, does that mean it's no good?


Neither MASM or JWASM are written in assembler and it is to its advantage that it is not so, these tools must operate across different platforms and need to be portable.

RE: The argument that good (anything) is better than bad assembler, substitute the word assembler with any other language and its still true. Good FORTRAN is better than bad C++, good Pascal is better than bad COBOL etc etc etc ....

The assumptions are that most don't bother to learn how to code in assembler so their assembler code must be bad, same old claptrap thats been around for the last 20 years, someone wants to promote some compiler so they foulmouth other language forms.

There is another assumption that is even more damaging to the average skills of programmers, "Don't code your own logic", use pre-canned functionality, DOT.WOT, MFC and so on. This has accounted for much of the nonsense we keep hearing and is the reason why programmers keep coming back to true low level programming, once they have been bitten a few times they want to be able to fix problems that pre-canned technology does not handle.
I think there are two issues here. The first is the content of Tim's argument:

Tim Roberts
QuoteThat's naive. First, today's compilers can outrun the average assembly language programmer every single time.

Now take this example:

include \masm32\MasmBasic\   ; download
  NanoTimer()      ; start the timer
  Recall "\Masm32\include\", WinInc$()   ; read in a fat text file, and translate it into a string array
  xchg eax, ebx
  Print Str$("Reading and tokenising the %i lines of took", ebx), Str$(" %i µsecs\n\n", NanoTimer(µs))
  Inkey Trim$(WinInc$(4)), CrLf$, "...", CrLf$, WinInc$(ebx-3)
end start

Reading and tokenising the 26902 lines of took 3619 µsecs

WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
echo WARNING Duplicate include file

3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler. But 1. it's not portable, 2. it is one single file I/O routine that has been optimised "by hand" wasting entire weekends. Here is another example, a simple memcopy algo from the Code sensitivity of timings thread:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
                       dest-al    psllq CeleronM  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
2048, d0s0-0      733      735      608      608      615      872      732
2048, d1s1-0     1100      821      649      649      643      653     4299
2048, d7s7-0      995      827      654      661      649      654     4324
2048, d7s8-1     1262     1339     1207      870      618      621     4319
2048, d7s9-2     1262     1341     1218      872      619      611     4340
2048, d8s7+1     1244     1333     1188     1213      620      916     1229
2048, d8s8-0      980      819      656      655      659      655      984
2048, d8s9-1     1228     1347     1210      870      613      621     1229
2048, d9s7+2     1584     1334     1176     1208      613      932     4029
2048, d9s8+1     1587     1333     1176     1209      618      929     4020
2048, d9s9-0     1101      821      660      659      659      661     4040
2048, d15s15      766      825      654      661      661      651     4031

Seven algos, a dozen alignment situations tested, and no clear winner. Plus, we got different results for each and every tested CPU. The attempt to optimise that in assembler is likely to fail. However, if you are writing the latest C compiler for a big software vendor, you may have a dozen people willing to throw in their expertise, and a lab with twenty different CPUs for testing, so in the end your C compiler might yield, on average, better results for

void *memmove(
   void *dest,
   const void *src,
   size_t count
than the Masm32 hobbyists. The software vendor might be willing to invest this effort, because (Roberts) "The only time it's worth the trouble is when you have inner loops or interrupt handlers that get executed over and over and over". Memcopy is among the common candidates for an innermost loop.

Now we come to the second issue:

Tim Roberts
Only the real experts can do hand-made assembler that outdoes a modern optimizing compiler.

Yes we can :bg
Those who design and program the C compiler are real experts, and they code in assembler, of course. But so do we here, as hobbyists, and it seems that a handful of people who are fanatic enough to hang around in the only really active assembler forum on a Saturday night can be considered real experts, too.


i imagine many compilers and assemblers are written in combined languages
for example, primarily written in C, with some routines and generated code written in ASM
i think i remember Andreas saying that he wrote JwAsm in C for the sake of maintainability


Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.

I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?
eschew obfuscation


i am guessing dynamic string array - BSTR's



you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining and at about 50000 lines.
Quote from: MichaelW on February 19, 2012, 01:40:21 PM
Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.

I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?

Recall does the following:
- Open the file
- guess how many lines there could be, and heapalloc 8*guessedlines (the guess is generous, and the heapalloc could cost some cycles):
  mov fSize, eax ; we need to guess the number of lines for MbArrayDim:
  lea ebx, [eax+4] ; worst case is a file with only CrLfs, so we would need max filesize/2 lines
  shr ebx, 1 ; lines=fsize/2+2
  mov esi, eax ; bytes to read
  add eax, 8+32+1 ; 8 for SSE2 correction, 32 for unrolled loop step, plus 1 for the 4095 bytes case
  push edi ; no flags (we may get garbage, but no problem for ReadFile)
  push eax ; byte count
  call MbAllocP

- heapalloc enough space for the whole file
- read the file into this buffer
- tokenise, i.e. fill the first buffer with a) DWORD start address of string and b) DWORD len of string
- close the file and return #of lines

The tokenising of runs at 1.1 milliseconds on my Celeron, the other 2ms are for heapallocs and readfile. Recall can read Unix format (LF only), but if Windows is specified, a LF without CR will be flagged in a special variable named BadLines.
I am curious to see what the C compiler comes up with. Thanks for testing this.


Quote from: hutch-- on February 19, 2012, 02:24:01 PM

you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining and at about 50000 lines.

See post above. How did you time your code, QPC or GetTickCount?
For the snippet below I get consistently That took 0.0062 seconds for 52357 lines - which means around 2 ms for the tokeniser. Bear in mind it's a single core Celeron Yonah, not really the latest model :wink

include \masm32\MasmBasic\   ; download
   Recall "\Masm32\include\", L1$()
   Recall "\Masm32\include\", L2$()
   Open "O", #1, ""
   Store #1, L1$()
   Store #1, L2$()
   Delay 500  ; half a second for flushing etc
   Recall "", L$()
   push eax
   Print Str$("That took %2f seconds", NanoTimer(s))
   pop eax
   Inkey Str$(" for %i lines", eax)
end start