VC++ 6.0 just generated this snippet, which serves as a perfect example of how compilers can do some obviously sub-optimal things.
C++ statement:
if (++data[0] != 0)
Generated assembly:
mov eax,dword ptr [esi+8]
mov edx,dword ptr [eax]
inc edx
mov dword ptr [eax],edx
mov eax,dword ptr [esi+8]
cmp dword ptr [eax],0
jne label1
The redundant instruction is rather glaring.
Try turning on optimization :wink.
That was with full optimization turned on.
#include <cstdio>
void foo(int *data)
{
if (++data[0] != 0) std::printf("!");
}
Compiled without optimization (VC7, I didn't have VC6 installed) produced
push ebp
mov ebp, esp
mov eax, DWORD PTR _data$[ebp]
mov ecx, DWORD PTR [eax]
add ecx, 1
mov edx, DWORD PTR _data$[ebp]
mov DWORD PTR [edx], ecx
mov eax, DWORD PTR _data$[ebp]
cmp DWORD PTR [eax], 0
je SHORT $L665
push OFFSET FLAT:$SG667
call _printf
add esp, 4
$L665:
pop ebp
ret 0
which looks quite a lot like your result, but compiled with /O2 it produced
mov eax, DWORD PTR _data$[esp-4]
inc DWORD PTR [eax]
je SHORT $L666
mov DWORD PTR _data$[esp-4], OFFSET FLAT:??_C@_01DCLJPIOD@?$CB?$AA@
jmp _printf
$L666:
ret 0
Did you perchance have any inline asm in the same source file?
Here is the result with VC6:
cl /Zl /c /Ogtyb2 /Gs /G6 /Gz /Zp1 /FAs /Fasample.asm sample.cpp
mov eax, DWORD PTR _data$[esp-4]
inc DWORD PTR [eax]
je SHORT $L578
push OFFSET FLAT:$SG579
call _printf
pop ecx
Compiler inefficiences is one of my favorite topics :) Jibz always strongly supports the C compiler. :) He's really good at optimizing code in C. However with VC++ 6.0 and 7.0 I have issues with how they generate optimized code. I actually wrote an article not too long ago that covered that and a few other things.
http://www.masmforum.com/simple/index.php?topic=1140.0
Also the guy who writes Virtualdub is irritated by VC++ optimizations. A lot of times you can see him post code snippets of assembler code the VC++ 7.0 generated and griping about it. Most of his project is in C, but large chunks are in assembler. If you go back through his archives of comments that's where he shows some C code and shows the corresponding assembler code usually for both VC++ 6.0 and 7.0
http://www.virtualdub.org/
http://www.virtualdub.org/docs_compiling
On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++. Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0. For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.
Posit,
I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.586 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
junk1 dd offset junk2
junk2 dd 0
.code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT EQU 10000000
REPEAT_COUNT EQU 100
mov esi,offset junk1-8
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov eax,dword ptr [esi+8]
mov edx,dword ptr [eax]
inc edx
mov dword ptr [eax],edx
mov eax,dword ptr [esi+8]
cmp dword ptr [eax],0
jne $+2
ENDM
counter_end
print ustr$(eax)
print chr$(" cycles",13,10)
mov junk2,0
counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov eax,dword ptr [esi+8]
mov edx,dword ptr [eax]
inc edx
mov dword ptr [eax],edx
;mov eax,dword ptr [esi+8]
cmp dword ptr [eax],0
jne $+2
ENDM
counter_end
print ustr$(eax)
print chr$(" cycles",13,10)
mov eax,input(13,10,"Press enter to exit...")
exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Results on my P3:
540 cycles
698 cycles
EDIT: added "mov junk2,0" so both loops start at the same state, no change in timings.
Quote from: Jibz on May 18, 2005, 06:55:57 PM
Did you perchance have any inline asm in the same source file?
Yes, quite a bit, although none in the particular function that generated that output.
I compiled it with the options:
/nologo /Zp16 /ML /W3 /GX /Zi /O2 /Ob2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/ArbInt.pch" /YX
Quote from: MichaelW on May 18, 2005, 07:47:42 PM
I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?
On my 2.4 GHz Celeron the loop with the redundant instruction runs slightly slower than if I comment out the instruction, although I'm timing it within the C++ program.
Mark_Larson,
Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.
I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.
Posit,
Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.
Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.
Quote from: Jibz on May 18, 2005, 08:09:20 PM
Mark_Larson,
Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.
I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.
Yep, for large projects C is the way to go, because of code development time. And then if you are going to convert some of the C routines to use assembler, you need to profile your code and find the slow routines before just randomly converting stuff to assembler.
Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language. GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating. Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI. That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler. Either convert the whole thing to assembler or none of it. A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.
Quote from: Mark_Larson on May 18, 2005, 07:44:28 PM
Compiler inefficiences is one of my favorite topics :)
Mine too. In fact, I'm currently writing a book on it. :)
Quote
On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++.
I don't see this that often. Some things GCC does a whole lot better than VC++ (6), other things, it does a whole lot worse. I don't have the web site handy, but there is a site that benchmarks a bunch of different C compilers. Those benchmarks suggest that VCC is quite a bit better than GCC and only a little bit inferior to the Intel compiler. Of course, those results are based on the particular benchmarks that guy used.
Quote
Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0. For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.
It also helps to write your C code in such a way that helps the compiler generate better machine code.
Cheers,
Randy Hyde
The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.
To avoid interfering asm inlines use intrinsics, this will let the compiler to allocate the registers, etc.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefcompilerintrinsics.asp
http://www.intel.com/cd/ids/developer/asmo-na/eng/59644.htm
Have a nice day.
Quote from: Jibz on May 18, 2005, 08:09:20 PM
Posit,
Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.
Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.
The disassembly I posted was produced with the exact compiler options listed. I just tried it again with the same result. The debugger has to be on to view disassembly information in VC++ 6.0.
Quote from: Mark_Larson on May 18, 2005, 08:24:31 PM
Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language. GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating. Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI. That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler. Either convert the whole thing to assembler or none of it. A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.
Yeah, I already discovered that the doing small _asm blocks within a function tends to put you at odds with the compiler--at the very least the compiler pads the _asm block with protective pushes and pops. So I've been converting entire functions to inline assembly while keeping either the __cdecl interface or sometimes changing it to __fastcall.
Here is the entirety of the function that generated the initial snippet:
//increments an ArbInt
void ArbInt::operator++()
{
if (++data[0] != 0) //no carry, so return
return;
unsigned long i = 1; //start at 2nd digit of ArbInt
while (size > i) //go until end of array
{
if (++data[i] != 0) //return if next digit doesn't carry
return;
++i;
}
//we've reached end of array and there's still a carry,
//so add a leading one
if (size == capacity) //allocate more memory if necessary
{
++capacity;
data = (unsigned long*)realloc(data, capacity<<2);
}
data[size] = 1;
++size; //update the size
}
As you can see, it's straight C++, with the only child function a call to the C standard realloc(). And here's the disassembly of the function:
1145: //increments an ArbInt
1146: void ArbInt::operator++()
1147: {
00402B90 push esi
00402B91 mov esi,ecx
1148: if (++data[0] != 0) //no carry, so return
00402B93 mov eax,dword ptr [esi+8]
00402B96 mov edx,dword ptr [eax]
00402B98 inc edx
00402B99 mov dword ptr [eax],edx
00402B9B mov eax,dword ptr [esi+8]
00402B9E cmp dword ptr [eax],0
00402BA1 jne ArbInt::operator+++66h (00402bf6)
1149: return;
1150:
1151: unsigned long i = 1; //start at 2nd digit of ArbInt
1152: while (size > i) //go until end of array
00402BA3 mov edx,dword ptr [esi]
00402BA5 mov ecx,1
00402BAA cmp edx,ecx
00402BAC jbe ArbInt::operator+++35h (00402bc5)
1153: {
1154: if (++data[i] != 0) //return if next digit doesn't carry
00402BAE mov edx,dword ptr [eax+ecx*4]
00402BB1 inc edx
00402BB2 mov dword ptr [eax+ecx*4],edx
00402BB5 mov eax,dword ptr [esi+8]
00402BB8 cmp dword ptr [eax+ecx*4],0
00402BBC jne ArbInt::operator+++66h (00402bf6)
00402BBE mov edx,dword ptr [esi]
1155: return;
1156: ++i;
00402BC0 inc ecx
00402BC1 cmp edx,ecx
00402BC3 ja ArbInt::operator+++1Eh (00402bae)
1157: }
1158:
1159: //we've reached end of array and there's still a carry,
1160: //so add a leading one
1161: if (size == capacity) //allocate more memory if necessary
00402BC5 mov eax,dword ptr [esi+4]
00402BC8 mov ecx,dword ptr [esi]
00402BCA cmp ecx,eax
00402BCC jne ArbInt::operator+++55h (00402be5)
1162: {
1163: ++capacity;
1164: data = (unsigned long*)realloc(data, capacity<<2);
00402BCE mov ecx,dword ptr [esi+8]
00402BD1 inc eax
00402BD2 mov dword ptr [esi+4],eax
00402BD5 shl eax,2
00402BD8 push eax
00402BD9 push ecx
00402BDA call _realloc (00427582)
00402BDF add esp,8
00402BE2 mov dword ptr [esi+8],eax
1165: }
1166:
1167: data[size] = 1;
00402BE5 mov edx,dword ptr [esi]
00402BE7 mov eax,dword ptr [esi+8]
00402BEA mov dword ptr [eax+edx*4],1
1168: ++size; //update the size
00402BF1 mov eax,dword ptr [esi]
00402BF3 inc eax
00402BF4 mov dword ptr [esi],eax
00402BF6 pop esi
1169: }
00402BF7 ret
Something odd that I just noticed while posting this is that the exact same statement later in the function does not generate a redundant instruction like the first occurrence of the statement.
Quote from: hitchhikr on May 18, 2005, 08:58:31 PM
The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.
I tried out the Intel compiler a few days ago, used the suggested optimization switches on Intel's website (/O3 and a couple others), and this particular project ran about 5% slower than after compilation with VC++ 6.0.
There is some conventional wisdom in NOT MIXING C and assembler code within the same module. Most modern C compilers when set with the correct flags perform a number of optimisations (usually based on register theory and redundancy removal) and when you plonk your own inline assembler in the middle of it, you mess up the internal optimisation.
The trick has always been to use each tool to its advantage and this is achieved by the compatible object module format so you can optimise in BOTH languages and combine the results.
These days I am only familiar with VC6 and the VCTOOLKIT C compilers as far as output goes and if the C code is well written you get very good results in most cases. Where the code design requires complexity or algorithm layout that they don't or can't do well, you have MASM to target the task.
To test whether inline assembly is having an effect, I tried the original versions of the project source files. The compiler generated the exact same assembly instructions for the function in question, with no inline assembly anywhere in the project.
I then tried turning various compiler options on and off. The only one that made a difference was adding /G6, which generates the following slightly shorter code. The redundant instruction is even more noticable.
173: if (++data[0] != 0) //no carry, so return
004014B3 mov eax,dword ptr [esi+8]
004014B6 inc dword ptr [eax]
004014B8 mov eax,dword ptr [esi+8]
004014BB cmp dword ptr [eax],0
004014BE jne ArbInt::operator+++5Fh (0040150f)
Either you don't know how to use a compiler or your own a crippled version of vc.
Result with the Digital Mars compiler:
_TEXT segment
assume CS:_TEXT
_foo:
mov EAX,4[ESP]
inc dword ptr [EAX]
je L15
push offset FLAT:_DATA
call near ptr _printf
add ESP,4
L15: ret
_TEXT ends
Quote from: hitchhikr on May 19, 2005, 08:22:01 AM
Either you don't know how to use a compiler or your own a crippled version of vc.
I'm willing to entertain both possibilities.
Posit,
I'm getting roughly the same result as you with your code and VC7. Quite strange.
I wonder if it's really due to some optimization for Intel processors as MichaelW suggested .. what is your CPU, Intel or AMD?
GCC (3.4.2) -O2 produced
mov edx, DWORD PTR [ebx+8]
mov eax, DWORD PTR [edx]
inc eax
mov DWORD PTR [edx], eax
test eax, eax
jne L1
which doesn't have the extra read, but the test seems superfluous.
If you are using Visual C++ Standard Edition, it does not include the optimizing compiler. You can get an optimizing compiler from the VC++ Toolkit 2003.
Related reading material:
http://www.sawtoothdistortion.com/Articles/FreeOptimizingCompiler.html
The CPU is a Celeron 2.4 GHz.
I split off the new posts and moved them here:
http://www.masm32.com/board/index.php?topic=14656.0