Print Page - Real experts here

Title: Real experts here
Post by: rags on February 18, 2012, 04:30:29 AM

I was reading a post today at: http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae

Quote
That's naive. First, today's compilers can outrun the average assembly
language programmer every single time. Only the real experts can do
hand-made assembler that outdoes a modern optimizing compiler.

Further, most of your operating system code isn't worth optimizing. It's
simply not worth the effort to go to the trouble of hand-coding all of the
initialization and overhead code. The only time it's worth the trouble is
when you have inner loops or interrupt handlers that get executed over and
over and over.

Almost all of Windows -- and even more of Linux -- is written in C and C++.
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

I guess he isnt aware of the programmers that belong to this forum. :U

Title: Re: Real experts here
Post by: sinsi on February 18, 2012, 04:44:42 AM

One would assume that NVidia would use an optimizing compiler, right?

Code Select


.text:000002C9                 mov     ecx, [ebp+arg_0]
.text:000002CC                 push    ecx
.text:000002CD                 call    dword ptr ds:__imp__GetProcAddress@8 ; GetProcAddress(x,x)
.text:000002D3                 mov     ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, eax ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002D8                 cmp     ds:?g_nvapi_lpNvAPI_pepQueryInterface@@3P6APAXK@ZA, 0 ; void * (*g_nvapi_lpNvAPI_pepQueryInterface)(ulong)
.text:000002DF                 jnz     short loc_2F3

That code is from nvapi.lib, maybe not too bad but there is a lot of it, and since it ends up in my exe it makes me look bad :( :bg

Title: Re: Real experts here
Post by: hutch-- on February 18, 2012, 05:25:13 AM

It must be a case of Tim getting a bit out of practice as Tim is an expert is coding assembler in his own right. Looking at where the post was made, it probably suffers from a dose of"political correctness" as well.

Title: Re: Real experts here
Post by: donkey on February 18, 2012, 04:20:12 PM

Well, Tim Roberts has been around for a LONG time and his opinion has to be respected. He is right that most compilers today can outrun an average assembly language programmer, I doubt there's any argument about that. However, there are some excellent programmers around like Mark and Lingo who can easily outpace the average compiler. That said, I usually don't bother with any optimization techniques since the amount of time spent to benefit gained is normally small, however a compiler can optimize to no end and it takes little to no extra time, its obvious they will have the advantage.

Title: Re: Real experts here
Post by: bozo on February 18, 2012, 05:22:46 PM

I have to agree with Tim based on personal experience implementing fast code.

A HLL like C is more convenient because you can easily compile for other architectures very quickly...I like C despite it's unpopularity amongst some new generation of programmers.
It allows you to add asm quite easily compared to other HLL which is why C is king champion for me ;)

Compilers really do a fantastic job optimizing high-level code, specifically C/C++, I can't speak about other languages but Java seems to have good optimizers too.

Title: Re: Real experts here
Post by: vanjast on February 18, 2012, 09:32:10 PM

I'm afraid when they go the 'PC' route.. they've lost the plot, and degrade with time.
A question... Who writes the compilers, and in what language ?
Kind of defeats the purpose if they're written in C.. ??

Title: Re: Real experts here
Post by: bozo on February 19, 2012, 01:25:51 AM

JWASM isn't written in assembly, does that mean it's no good?

Title: Re: Real experts here
Post by: hutch-- on February 19, 2012, 02:12:24 AM

Neither MASM or JWASM are written in assembler and it is to its advantage that it is not so, these tools must operate across different platforms and need to be portable.

RE: The argument that good (anything) is better than bad assembler, substitute the word assembler with any other language and its still true. Good FORTRAN is better than bad C++, good Pascal is better than bad COBOL etc etc etc ....

The assumptions are that most don't bother to learn how to code in assembler so their assembler code must be bad, same old claptrap thats been around for the last 20 years, someone wants to promote some compiler so they foulmouth other language forms.

There is another assumption that is even more damaging to the average skills of programmers, "Don't code your own logic", use pre-canned functionality, DOT.WOT, MFC and so on. This has accounted for much of the nonsense we keep hearing and is the reason why programmers keep coming back to true low level programming, once they have been bitten a few times they want to be able to fix problems that pre-canned technology does not handle.

Title: Re: Real experts here
Post by: jj2007 on February 19, 2012, 03:16:06 AM

I think there are two issues here. The first is the content of Tim's argument:

Tim Roberts (http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae)

QuoteThat's naive. First, today's compilers can outrun the average assembly language programmer every single time.

Now take this example:

include \masm32\MasmBasic\MasmBasic.inc   ; download (http://www.masm32.com/board/index.php?topic=12460)
Init
NanoTimer()      ; start the timer
Recall "\Masm32\include\Windows.inc", WinInc$()   ; read in a fat text file, and translate it into a string array
xchg eax, ebx
Print Str$("Reading and tokenising the %i lines of Windows.inc took", ebx), Str$(" %i µsecs\n\n", NanoTimer(µs))
Inkey Trim$(WinInc$(4)), CrLf$, "...", CrLf$, WinInc$(ebx-3)
Exit
end start

Code Select

Reading and tokenising the 26902 lines of Windows.inc took 3619 µsecs

WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
...
echo WARNING Duplicate include file windows.inc

3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler. But 1. it's not portable, 2. it is one single file I/O routine that has been optimised "by hand" wasting entire weekends. Here is another example, a simple memcopy algo from the Code sensitivity of timings (http://www.masm32.com/board/index.php?topic=11454.0) thread:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32
dest-al psllq CeleronM dest-al src-al library
Code size ? 70 291 222 200 269 33
---------------------------------------------------------------------------
2048, d0s0-0 733 735 608 608 615 872 732
2048, d1s1-0 1100 821 649 649 643 653 4299
2048, d7s7-0 995 827 654 661 649 654 4324
2048, d7s8-1 1262 1339 1207 870 618 621 4319
2048, d7s9-2 1262 1341 1218 872 619 611 4340
2048, d8s7+1 1244 1333 1188 1213 620 916 1229
2048, d8s8-0 980 819 656 655 659 655 984
2048, d8s9-1 1228 1347 1210 870 613 621 1229
2048, d9s7+2 1584 1334 1176 1208 613 932 4029
2048, d9s8+1 1587 1333 1176 1209 618 929 4020
2048, d9s9-0 1101 821 660 659 659 661 4040
2048, d15s15 766 825 654 661 661 651 4031

Seven algos, a dozen alignment situations tested, and no clear winner. Plus, we got different results for each and every tested CPU. The attempt to optimise that in assembler is likely to fail. However, if you are writing the latest C compiler for a big software vendor, you may have a dozen people willing to throw in their expertise, and a lab with twenty different CPUs for testing, so in the end your C compiler might yield, on average, better results for

void *memmove(
void *dest,
const void *src,
size_t count
);
than the Masm32 hobbyists. The software vendor might be willing to invest this effort, because (Roberts) "The only time it's worth the trouble is when you have inner loops or interrupt handlers that get executed over and over and over". Memcopy is among the common candidates for an innermost loop.

Now we come to the second issue:

Tim Roberts (http://social.msdn.microsoft.com/Forums/is/vcgeneral/thread/897fffdd-1fb0-4105-953d-f7934875e0ae)
Only the real experts can do hand-made assembler that outdoes a modern optimizing compiler.

Yes we can :bg
Those who design and program the C compiler are real experts, and they code in assembler, of course. But so do we here, as hobbyists, and it seems that a handful of people who are fanatic enough to hang around in the only really active assembler forum on a Saturday night can be considered real experts, too.

Title: Re: Real experts here
Post by: dedndave on February 19, 2012, 04:47:03 AM

i imagine many compilers and assemblers are written in combined languages
for example, primarily written in C, with some routines and generated code written in ASM
i think i remember Andreas saying that he wrote JwAsm in C for the sake of maintainability

Title: Re: Real experts here
Post by: MichaelW on February 19, 2012, 01:40:21 PM

Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of Windows.inc...
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.

I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?

Title: Re: Real experts here
Post by: dedndave on February 19, 2012, 01:44:16 PM

i am guessing dynamic string array - BSTR's

Title: Re: Real experts here
Post by: hutch-- on February 19, 2012, 02:24:01 PM

JJ,

you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining windows.inc and winextra.inc at about 50000 lines.

Title: Re: Real experts here
Post by: jj2007 on February 19, 2012, 04:01:47 PM

Quote from: MichaelW on February 19, 2012, 01:40:21 PM
Quote from: jj2007 on February 19, 2012, 03:16:06 AM
Reading and tokenising the 26902 lines of Windows.inc...
3.6 millisecs on a trusty old Celeron, that won't be beaten anytime soon by a C compiler.

I'd like to test this with a C compiler. Tokenizing it how, exactly, and what sort of array are the tokens being stored in?

Recall does the following:
- Open the file
- guess how many lines there could be, and heapalloc 8*guessedlines (the guess is generous, and the heapalloc could cost some cycles):

Code Select

  mov fSize, eax	; we need to guess the number of lines for MbArrayDim:
  lea ebx, [eax+4]	; worst case is a file with only CrLfs, so we would need max filesize/2 lines
  shr ebx, 1		; lines=fsize/2+2
  mov esi, eax		; bytes to read
  add eax, 8+32+1	; 8 for SSE2 correction, 32 for unrolled loop step, plus 1 for the 4095 bytes case
  push edi			; no flags (we may get garbage, but no problem for ReadFile)
  push eax		; byte count
  call MbAllocP

- heapalloc enough space for the whole file
- read the file into this buffer
- tokenise, i.e. fill the first buffer with a) DWORD start address of string and b) DWORD len of string
- close the file and return #of lines

The tokenising of Windows.inc runs at 1.1 milliseconds on my Celeron, the other 2ms are for heapallocs and readfile. Recall can read Unix format (LF only), but if Windows is specified, a LF without CR will be flagged in a special variable named BadLines.
I am curious to see what the C compiler comes up with. Thanks for testing this.

Title: Re: Real experts here
Post by: jj2007 on February 19, 2012, 05:49:42 PM

Quote from: hutch-- on February 19, 2012, 02:24:01 PM
JJ,

you must be doing a memory copy operation as well, I have a test piece with a bare tokeniser that does nothing more than count the line feeds and load the start address of each line into an array that I cannot get a timing on joining windows.inc and winextra.inc at about 50000 lines.

See post above. How did you time your code, QPC or GetTickCount?
For the snippet below I get consistently That took 0.0062 seconds for 52357 lines - which means around 2 ms for the tokeniser. Bear in mind it's a single core Celeron Yonah, not really the latest model :wink

include \masm32\MasmBasic\MasmBasic.inc   ; download (http://www.masm32.com/board/index.php?topic=12460)
.data
   Init
   Recall "\Masm32\include\Windows.inc", L1$()
   Recall "\Masm32\include\WinExtra.inc", L2$()
   Open "O", #1, "WinBoth.inc"
   Store #1, L1$()
   Store #1, L2$()
   Close
   Delay 500 ; half a second for flushing etc
   NanoTimer()
   Recall "WinBoth.inc", L$()
   push eax
   Print Str$("That took %2f seconds", NanoTimer(s))
   pop eax
   Inkey Str$(" for %i lines", eax)
   Exit
end start

Title: Re: Real experts here
Post by: hutch-- on February 19, 2012, 11:23:10 PM

JJ,

Timing the disk read is an unusual way to do it, the first try will always be slow with the following passes will be reading the file from the cache. I don't like loop code timings for tasks of this type so I tested it on a larger file, 64 meg of C header file with 1.9 million lines.

It keeps turning up at 125 ms for a 2 pass operation, first to count the LF (10) then the tokenise pass. You could drop it by near half if you use an estimation method for the LF count based on the file length but in this context I don't have the luxury of using that much extra memory. I do the first pass to get the count then allocate the correct amount of memory to hold the pointer array.

Title: Re: Real experts here
Post by: jj2007 on February 20, 2012, 07:48:56 AM

Quote from: hutch-- on February 19, 2012, 11:23:10 PM
JJ,

Timing the disk read is an unusual way to do it, the first try will always be slow with the following passes will be reading the file from the cache. I don't like loop code timings for tasks of this type so I tested it on a larger file, 64 meg of C header file with 1.9 million lines.

It keeps turning up at 125 ms for a 2 pass operation, first to count the LF (10) then the tokenise pass. You could drop it by near half if you use an estimation method for the LF count based on the file length but in this context I don't have the luxury of using that much extra memory. I do the first pass to get the count then allocate the correct amount of memory to hold the pointer array.

Hutch,

Which CPU?
WinBoth.inc is 2MB, 52357 lines, 2 millisecs for my Celeron (without the disk timing - I know, I know...)
C header is 64Mb, 1.9 Mio - so roughly a factor 35*2 =>> 70 ms expected

The difference is probably that MasmBasic Recall is SSE2, and that I use a single pass. There are limits to that technique, but those who are working with 2GB files surely have the right amount of RAM installed, so getting a temporarily too generous buffer shouldn't pose any practical problems. Two passes, on the other hand, means twice the work, and outside the data cache.

Can you upload the file somewhere, so that we can run a test?

The MASM Forum Archive 2004 to 2012

General Forums => The Soap Box => Topic started by: rags on February 18, 2012, 04:30:29 AM