(I'm so furious right now that this might get moved to the Colosseum. My apologies.)
I'm porting some assembly code to C++ with compiler intrinsics, and I just about puked when I saw the disassembly of the release build with full optimization and forced inlining in Visual Studio 2008. It's got thousands of lines of:
movdqa xmm0,xmmword ptr [rsp+900h]
pcmpgtd xmm0,xmmword ptr [rsp+360h]
movdqa xmmword ptr [rsp+1550h],xmm0
movdqa xmm0,xmmword ptr [rsp+1550h]
movdqa xmmword ptr [rsp+370h],xmm0
movdqa xmm0,xmmword ptr [rsp+370h]
pand xmm0,xmmword ptr [13FB42390h]
movdqa xmmword ptr [rsp+1570h],xmm0
movdqa xmm0,xmmword ptr [rsp+1570h]
movdqa xmmword ptr [rsp+380h],xmm0
movaps xmm0,xmmword ptr [rsp+380h]
movaps xmmword ptr [rsp+950h],xmm0
cvtps2dq xmm0,xmmword ptr [rsp+900h]
movdqa xmmword ptr [rsp+1590h],xmm0
movdqa xmm0,xmmword ptr [rsp+1590h]
movdqa xmmword ptr [rsp+390h],xmm0
movdqa xmm0,xmmword ptr [rsp+390h]
paddd xmm0,xmmword ptr [13FB42390h]
movdqa xmmword ptr [rsp+15B0h],xmm0
movdqa xmm0,xmmword ptr [rsp+15B0h]
movdqa xmmword ptr [rsp+3A0h],xmm0
movaps xmm0,xmmword ptr [rsp+3A0h]
mulps xmm0,xmmword ptr [13FB423C0h]
movaps xmmword ptr [rsp+15D0h],xmm0
movaps xmm0,xmmword ptr [rsp+15D0h]
movaps xmmword ptr [rsp+8D0h],xmm0
rsqrtps xmm0,xmmword ptr [rsp+8D0h]
movaps xmmword ptr [rsp+15F0h],xmm0
movaps xmm0,xmmword ptr [rsp+15F0h]
movaps xmmword ptr [rsp+3B0h],xmm0
rsqrtps xmm0,xmmword ptr [rsp+3B0h]
movaps xmmword ptr [rsp+1610h],xmm0
movaps xmm0,xmmword ptr [rsp+1610h]
It takes a minute to realize just how terrible this is. It's not just doing 4 unnecessary moves to and from memory for every 1 real instruction. It's doing the equivalent of "x= xmm0; xmm0 = x;" over and over, using a new variable for each one. It was even worse before I forced inlining, since it had each one as a function call passing parameters on the stack. Here's the original assembly for the above (with comments removed & names changed, since we haven't released the source, at least not yet).
pcmpgtd xmm15,xmm0
pand xmm15,xmm9
cvtps2dq xmm1,xmm1
paddd xmm1,xmm9
mulps xmm1,xmmword ptr MyFactor
rsqrtps xmm1,xmm1
rsqrtps xmm1,xmm1
HOW can they call it an "optimizing" compiler? Is there some "Don't Suck" compiler option I'm not aware of? Do you guys have any other ideas on how to get C++-esque code that isn't this bad and works on 32&64-bit Windows&Linux&Mac?
I'll be trying it with G++ tomorrow. I don't expect it to do the same ridiculous thing, but I don't expect it to be great either, since previous tests I've done give decidedly mixed results for SSE support between VC++ and G++. :(
Neo,
Have a look at the output with the optimisation option turned off, just occasionally it looks better than the optimised output.
Other than that, a 64 bit MASM object module starts looking good. :bg
You can turn off optimizing for some parts of the code. Don't ask me how, you'll have to search for that, it should leave the inline code alone.
This is why it's disturbing when I hear even seasoned ASM developers make claims like "optimized compilers" are somehow even remotely as good as well hand written optimized ASM. And i've seen this mentioned various times on this very forum!
The thing that gets me is that there's not even an excuse for it. A compiler could very easily generate the original code when I give it the intrinsics explicitly like I am, it just doesn't. Someone made a conscious decision not to implement the simplest, most obvious optimizations, instead restricting themselves to idiocy like only using 1 SSE register. It's complete with things like when calculating the size of an array of 2*n doubles, it does the n+n to save on multiplying by 2, but then it explicitly multiplies by 8 (i.e. sizeof(double)) using mul plus a bunch of other instructions to check for overflows. It's honestly like they wrote the entire compiler on paper and never checked their work. ::)
For optimizations where the compiler really doesn't have sufficient info to easily assert something, like choosing to generate 4 times as many random numbers, I can certainly forgive a compiler for that, but this kind of thing, a compiler really should be able to either figure it out or not have the problem in the first place.
Can anyone recommend a journal or conference that'd still take papers on compiler optimization? This stuff really needs to be brought to light, especially since compiler optimization has widely (for no good reason) been considered a solved problem since the 70's.
the compiler just doesn't "think" as well as it should :bg
but, if you time that code, i bet it is pretty fast, actually
that doesn't mean it isn't sloppy, nor does it mean it couldn't be better
one of the reasons i have always prefered ASM was, i have seen disassemblies from different compilers
what you are seeing is typical of older compilers, although they did not use SSE, of course
the compiler is only as good as the group of people who wrote it wanted it to be - lol
if they chose to spend more time on certain areas, those areas are improved
i suspect they focus on functions - especially those that use no API's
this is the code they can speed up in order to make their compiler appear to be better than the competition
if the code is not part of an isolated function (or perhaps in a loop), they may figure it is not going to be used over and over
if the function has API calls, they may figure it isn't speed-critical
they probably try to improve the code that will make the end result faster for time-consuming repetitive tasks
i have simplified this a bit, but you get the idea
compilers don't really "think" very well at all
if you write a function that simply sets a variable value to some constant, the compiler will write just that
in ASM, we might choose a macro for this
depending on the compiler, there may be additional overhead, as well
SetVarConst proc
push ebp
mov epb,esp
mov eax,6
mov SomeVar,eax
pop ebp
ret
SetVarConst endp
a macro like this might be more appropriate...
mov dword ptr SomeVar,6
the group that wrote the compiler aren't likely to see how simple the code is or how it is going to be used
if you play card games at all, you may have noticed that they tend to stack the deck to make it more challenging
that is because it is really not that easy to "teach" a program to play a specific game well
it is much easier to write the code to stack the deck :P
the programmer is likely to spend more time making the cards look nice - lol
i think we have all seen cases where a compiler actually generates some pretty fast code, too
sometimes, they can be a learning tool to writing better ASM
well - you found out where the programmers that wrote the compiler focused their attention
EDIT
here is a simple and fun little exercise :P
this appears to be some type of "assignment" code (guessing)
try writing the same assignments inside a loop
for the iteration count, use a variable
assign that variable a value of 1 (single loop pass)
then, disassemble it and see if they optimized it differently because it's inside a loop :U
you could even take the next step and put it in a function as well
this may change the way you write compiled code entirely - lol
maybe you could write a paper and get published !! (Forcing Compiler Optimization)
i will happily take 10% of any profits you might make :bg
The code is already in a loop that has no API calls and does up to about 6 trillion iterations over the course of the application, so it needs performance (it already takes days to run on an i7), but it also needs to run on 6 OS's, meaning that maintaining 6 versions in assembly is a pain in the butt.
I'm definitely writing up a paper showing the compiler output for different things I give it plus the performance results. The 33 memory accesses in 33 instructions versus 1 memory access in 7 instructions probably does make a non-negligible difference in performance, but I'll know later today. I have to write such that I'm not presupposing anything, so it'll be named something along the lines of "An Evaluation of Compiler Optimization Quality".
...and Dave, I'll give you 500% of the $0 profit. :bg
Neo,
Its appears to be a substantial drop in quality as I have on and off played with the 32 bit output of VC and often its very good, nothing like the mess you have posted but here I am mainly talking about high level source code and the assembler output after the optimiser has finished. Its probably the case that with the later 64 bit compiler, especially with intrinsics has yet to be as well developed as the 32 bit versions. i have generally seen that very recent compiler design is going backwards from the techniques that were being used 6 or 7 years ago. before that they tended to be out of date with technology change in processor design.
i don't know if anyone has butchered it but Wikipedia used to have a lot of very good technical data on compiler optimisation theory and practice but I doubt thats the problem, its probably the case that the instrinsics were tacked on as an afterthought to address the average level of assembler literacy in high level programmers.
The problem seems too dumb to be intentional. Could it be something that was fixed in a SP or in a later version of the compiler?
Quote from: MichaelW on May 12, 2010, 04:24:46 PM
The problem seems too dumb to be intentional. Could it be something that was fixed in a SP or in a later version of the compiler?
Well, it's VC++ 2008, and they've had "support" for SSE since VC++ 2003 (which outperformed G++ from 2006 in a few benchmark tests I did back then). I can try it on VC++ 2010, but I don't think they've done a thing to actually improve the compiler since VC++ 2008.
i think Hutch is on the right track
the 64-bit versions just haven't matured, yet
i bet their next release will have a lot of improvements
you could try testing the same type of code in a 32-bit compiler to see what the differences are
as for my idea - well - at least it was an idea - lol
dang - i was hoping you could get published :'(
Sure enough, Hutch is probably right. The 64-bit compiler's code is 10 times slower than the 32-bit compiler's code. That said, I also discovered that the 64-bit compiler is for Visual Studio 2005, whereas the 32-bit one is for 2008, so I'm installing the 2008 64-bit compiler. Apparently VS 2010 doesn't support 64-bit at all yet, so that one's not an option.
Quote from: NeoApparently VS 2010 doesn't support 64-bit at all yet, so that one's not an option.
It sure does support it, I've been using it.
Also, Visual Studio 2005 was at the time they were just getting started with the x64 compiler, I'm not surprised it's terrible.
Neo,
If you don't mind bashing out the SSE code in bare mnemonics, the "Intel" option in GAS last time I saw it could probably come very close to using direct MASM mnemonic notation so you may be able to write some common code if its only pointed at x86-64 bit targets.
From memory with GAS in Intel mode you have to use the fully specified QWORD PTR style but you could probably craft both with the same mnemonics and just use different headers and footers for the different platforms.
Quote from: hutch-- on May 13, 2010, 03:07:14 AMIf you don't mind bashing out the SSE code in bare mnemonics, the "Intel" option in GAS last time I saw it could probably come very close to using direct MASM mnemonic notation so you may be able to write some common code if its only pointed at x86-64 bit targets.
From memory with GAS in Intel mode you have to use the fully specified QWORD PTR style but you could probably craft both with the same mnemonics and just use different headers and footers for the different platforms.
That's what we've got at the moment, but it still means 6 different versions for the 6 platforms, since GAS for Mac and GAS for Linux have different syntax from each other and are incompatible with each other. Plus, 64-bit Mac uses rip-relative addressing. They also only partly support Intel syntax, so I have to go through and change every number, register dereference, and variable declaration, manually lay out the stack, and work around a serious bug with encoding conditional jumps... backward? or maybe it was forward. One of the two is identified as an error in the Linux version of GAS, so I have to switch it back to AT&T syntax for 1 instruction, then switch it back to Intel. etc. etc.
Anyway, the points are that it's a giant pain in the butt for me to update 6 versions of the code when we need it to do new stuff, and I'm already the only one who can modify the code. :( The hope is that with a simple C++ wrapper, it'll be almost like writing scalar code in C++... which it is, when the compiler's not being crazy. :wink
Neil,
With that level of confusion, wouldn't it be an option to put the critical parts into an OS-specific dll? Or into *obj files that you assemble separately every now and then?
...or write .6..5 assemblers that all use the same syntax :lol
(you can use one that already exists as "master")
Quote from: dedndave on May 12, 2010, 10:11:16 AMhere is a simple and fun little exercise :P
this appears to be some type of "assignment" code (guessing)
try writing the same assignments inside a loop
for the iteration count, use a variable
assign that variable a value of 1 (single loop pass)
then, disassemble it and see if they optimized it differently because it's inside a loop :U
you could even take the next step and put it in a function as well
this may change the way you write compiled code entirely - lol
maybe you could write a paper and get published !! (Forcing Compiler Optimization)
i will happily take 10% of any profits you might make :bg
that is harder. if what you mean is to say doing something like :
for( int i = 0; i < 5; i++ )
x = i;
and you want to make it just do x = 5 instead. well that is tricky for a compiler to do. however having something like this :
for( int i = 0; i < 5; i++ ) {
...................
x = 5;
}
would be entirely different and easily optimised. it is known as loop invariant code motion. you hoist out all invariants. this is really easy for a compiler to change. it can just check livein/liveouts at a given point. i wrote a compiler a few months ago and i wanted to implement the first case too. and sort of decided it was too much work :-P
writing a compiler - big undertaking, Mike - way over my head, i am sure - lol
but, i think you highlight a point i was trying to make...
Quote... and i wanted to implement the first case too. and sort of decided it was too much work
and, it probably wouldn't have helped make a better executable very often
guys that write these things (including you :bg ) tend to focus their efforts toward things that make better EXE's, and more often
i am still trying to think of an easy way to help Neil get through 6 versions of code...
... maybe write a program that generates the alternate modified source files from a single input file ???
Dave,
Quotei am still trying to think of an easy way to help Neil get through 6 versions of code...
... maybe write a program that generates the alternate modified source files from a single input file ???
Macros come to mind.
Dave.
Quote from: KeepingRealBusy on May 13, 2010, 10:01:47 PM
Quotei am still trying to think of an easy way to help Neil get through 6 versions of code...
... maybe write a program that generates the alternate modified source files from a single input file ???
Macros come to mind.
Dave.
Right, especially if combined with conditional assembly
if OS eq Win32
...
elseif if OS eq Linux
...
elseif if OS eq Mac
...
else
.err <no such OS, sorry>
endif
Macros alone work, unless you have macros for generating object files in ELF (for Linux) or MachO (for Mac) formats, since Windows OBJ files are in a variant of PE format, and the Linux and Mac assemblers have a completely different syntax for conditionally including/excluding pieces of code. That counts out the two ideas along those lines I'd looked into a few months ago. :(
Solar Assembler works on Windows, Linux/Unix and MacOSX and it has the same syntax for all platforms including #ifdef #endif.
The idea is that you should use an assembler that works and has the same syntax on all target platforms. If you do not like my assembler then you could try NASM or FASM.
Warn: Solar Assembler still has some bugs and it is in alpha stages but I usually fix those fast if asked nicely :))
About your compiler optimization issue: we can not decide unless we see the source code. You might use #pragms's or intrinsic in such a way as to effectively disable compiler optimizations.
Compilers are dumb conceptually because they have no clue about algorithms ... BUT they are not that dumb at all when it comes to small tricks and instruction level optimizations. Hence IMHO you must be doing something wrong in your C code :D :D :D
I'm agreeing with Bogdan on the last part. These small things should nearly definitely be handled in the peephole optimization part of the compiler.
I have learnt this much that some algorithms cannot be further optimised from their compiled output and it is for a very simple reason, if you find the memory read/write boundary, nothing will make it faster. I had a test piece I used a couple of years ago that was a Sedgewick radix quicksort hybrid that I used as the test piece for a tool I was designing and while I thought the VC output code was scruffy, multiple complete rewrites would not go faster than the compiled output. With practice you could get it as fast using less registers, lower call overhead to its helper functions, individually the helper function were measurably faster but the overall algo hit the wall and just would not go faster.
I think this is the case with at least some algos and with code that does not hit anything hard in any particular place, compiler optimisation theory does do the job but it hangs together a lot differently to a manually written algorithm. Much of modern compiler optimisation theory is simple stuff that makes sense and you normally do this when you lay out an algorithm, keep un-necessary variables out of loops, plan the register use to reduce memory read/writes etc ... but it is here where manually written code is simply more flexible and with code that can be made faster, you pick up the benefit in areas where it matters.
I have long had the view that for a one off usage that optimisation rarely ever matters but as code reuse increases where an algorithm may be run trillions of times in its lifetime, it starts to pay off in terms of effort to do the work to make it faster or smaller or both as the unit cost of creating the algorithm diminishes at the increase of its usage.