Print Page - Compiler Inefficiencies

Title: Compiler Inefficiencies
Post by: Posit on May 18, 2005, 05:36:32 PM

VC++ 6.0 just generated this snippet, which serves as a perfect example of how compilers can do some obviously sub-optimal things.

Code Select


C++ statement:

		if (++data[0] != 0)

Generated assembly:

		mov         eax,dword ptr [esi+8]
		mov         edx,dword ptr [eax]
		inc         edx
		mov         dword ptr [eax],edx
		mov         eax,dword ptr [esi+8]
		cmp         dword ptr [eax],0
		jne         label1

The redundant instruction is rather glaring.

Title: Re: Compiler Inefficiencies
Post by: Jibz on May 18, 2005, 06:46:35 PM

Try turning on optimization :wink.

Title: Re: Compiler Inefficiencies
Post by: Posit on May 18, 2005, 06:49:50 PM

That was with full optimization turned on.

Title: Re: Compiler Inefficiencies
Post by: Jibz on May 18, 2005, 06:55:57 PM

Code Select

#include <cstdio>

void foo(int *data)
{
    if (++data[0] != 0) std::printf("!");
}

Compiled without optimization (VC7, I didn't have VC6 installed) produced

Code Select

    push    ebp
    mov     ebp, esp
    mov     eax, DWORD PTR _data$[ebp]
    mov     ecx, DWORD PTR [eax]
    add     ecx, 1
    mov     edx, DWORD PTR _data$[ebp]
    mov     DWORD PTR [edx], ecx
    mov     eax, DWORD PTR _data$[ebp]
    cmp     DWORD PTR [eax], 0
    je      SHORT $L665
    push    OFFSET FLAT:$SG667
    call    _printf
    add     esp, 4
$L665:
    pop     ebp
    ret     0

which looks quite a lot like your result, but compiled with /O2 it produced

Code Select

    mov     eax, DWORD PTR _data$[esp-4]
    inc     DWORD PTR [eax]
    je      SHORT $L666
    mov     DWORD PTR _data$[esp-4], OFFSET FLAT:??_C@_01DCLJPIOD@?$CB?$AA@
    jmp     _printf
$L666:
    ret     0

Did you perchance have any inline asm in the same source file?

Title: Re: Compiler Inefficiencies
Post by: Vortex on May 18, 2005, 07:37:35 PM

Here is the result with VC6:

Code Select


cl /Zl /c /Ogtyb2 /Gs /G6 /Gz /Zp1 /FAs /Fasample.asm sample.cpp

Code Select


mov   eax, DWORD PTR _data$[esp-4]
inc   DWORD PTR [eax]
je    SHORT $L578
push  OFFSET FLAT:$SG579
call   _printf
pop   ecx

Title: Re: Compiler Inefficiencies
Post by: Mark_Larson on May 18, 2005, 07:44:28 PM

Compiler inefficiences is one of my favorite topics :) Jibz always strongly supports the C compiler. :) He's really good at optimizing code in C. However with VC++ 6.0 and 7.0 I have issues with how they generate optimized code. I actually wrote an article not too long ago that covered that and a few other things.

http://www.masmforum.com/simple/index.php?topic=1140.0

Also the guy who writes Virtualdub is irritated by VC++ optimizations. A lot of times you can see him post code snippets of assembler code the VC++ 7.0 generated and griping about it. Most of his project is in C, but large chunks are in assembler. If you go back through his archives of comments that's where he shows some C code and shows the corresponding assembler code usually for both VC++ 6.0 and 7.0

http://www.virtualdub.org/
http://www.virtualdub.org/docs_compiling

On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++. Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0. For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.

Title: Re: Compiler Inefficiencies
Post by: MichaelW on May 18, 2005, 07:47:42 PM

Posit,

I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive 

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      junk1 dd offset junk2
      junk2 dd 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000
    REPEAT_COUNT EQU 100

    mov   esi,offset junk1-8

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,dword ptr [esi+8]
        mov   edx,dword ptr [eax]
        inc   edx
        mov   dword ptr [eax],edx
        mov   eax,dword ptr [esi+8]
        cmp   dword ptr [eax],0
        jne   $+2
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   junk2,0
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,dword ptr [esi+8]
        mov   edx,dword ptr [eax]
        inc   edx
        mov   dword ptr [eax],edx
        ;mov   eax,dword ptr [esi+8]
        cmp   dword ptr [eax],0
        jne   $+2
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Results on my P3:

Code Select


540 cycles
698 cycles

EDIT: added "mov junk2,0" so both loops start at the same state, no change in timings.

Title: Re: Compiler Inefficiencies
Post by: Posit on May 18, 2005, 07:54:34 PM

Quote from: Jibz on May 18, 2005, 06:55:57 PM
Did you perchance have any inline asm in the same source file?

Yes, quite a bit, although none in the particular function that generated that output.

I compiled it with the options:

Code Select


/nologo /Zp16 /ML /W3 /GX /Zi /O2 /Ob2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/ArbInt.pch" /YX

Title: Re: Compiler Inefficiencies
Post by: Posit on May 18, 2005, 08:00:27 PM

Quote from: MichaelW on May 18, 2005, 07:47:42 PM
I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?

On my 2.4 GHz Celeron the loop with the redundant instruction runs slightly slower than if I comment out the instruction, although I'm timing it within the C++ program.

Title: Re: Compiler Inefficiencies
Post by: Jibz on May 18, 2005, 08:09:20 PM

Mark_Larson,

Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.

I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.

Posit,

Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.

Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.

Title: Re: Compiler Inefficiencies
Post by: Mark_Larson on May 18, 2005, 08:24:31 PM

Quote from: Jibz on May 18, 2005, 08:09:20 PM
Mark_Larson,

Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.

I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.

Yep, for large projects C is the way to go, because of code development time. And then if you are going to convert some of the C routines to use assembler, you need to profile your code and find the slow routines before just randomly converting stuff to assembler.

Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language. GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating. Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI. That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler. Either convert the whole thing to assembler or none of it. A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.

Title: Re: Compiler Inefficiencies
Post by: Randall Hyde on May 18, 2005, 08:54:13 PM

Quote from: Mark_Larson on May 18, 2005, 07:44:28 PM
Compiler inefficiences is one of my favorite topics :)

Mine too. In fact, I'm currently writing a book on it. :)

Quote
On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++.

I don't see this that often. Some things GCC does a whole lot better than VC++ (6), other things, it does a whole lot worse. I don't have the web site handy, but there is a site that benchmarks a bunch of different C compilers. Those benchmarks suggest that VCC is quite a bit better than GCC and only a little bit inferior to the Intel compiler. Of course, those results are based on the particular benchmarks that guy used.

Quote
Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0. For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.

It also helps to write your C code in such a way that helps the compiler generate better machine code.
Cheers,
Randy Hyde

Title: Re: Compiler Inefficiencies
Post by: hitchhikr on May 18, 2005, 08:58:31 PM

The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.

To avoid interfering asm inlines use intrinsics, this will let the compiler to allocate the registers, etc.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefcompilerintrinsics.asp
http://www.intel.com/cd/ids/developer/asmo-na/eng/59644.htm

Have a nice day.

Title: Re: Compiler Inefficiencies
Post by: Posit on May 18, 2005, 10:17:50 PM

Quote from: Jibz on May 18, 2005, 08:09:20 PM
Posit,

Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.

Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.

The disassembly I posted was produced with the exact compiler options listed. I just tried it again with the same result. The debugger has to be on to view disassembly information in VC++ 6.0.

Quote from: Mark_Larson on May 18, 2005, 08:24:31 PM
Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language. GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating. Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI. That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler. Either convert the whole thing to assembler or none of it. A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.

Yeah, I already discovered that the doing small _asm blocks within a function tends to put you at odds with the compiler--at the very least the compiler pads the _asm block with protective pushes and pops. So I've been converting entire functions to inline assembly while keeping either the __cdecl interface or sometimes changing it to __fastcall.

Here is the entirety of the function that generated the initial snippet:

Code Select


//increments an ArbInt
void ArbInt::operator++()
{
	if (++data[0] != 0)     //no carry, so return
		return;

	unsigned long i = 1;    //start at 2nd digit of ArbInt
	while (size > i)           //go until end of array			
	{
		if (++data[i] != 0)       //return if next digit doesn't carry
			return;
		++i;
	}

	//we've reached end of array and there's still a carry,
	//so add a leading one
	if (size == capacity)          //allocate more memory if necessary
	{
		++capacity;
		data = (unsigned long*)realloc(data, capacity<<2);
	}

	data[size] = 1;
	++size;                        //update the size
}

As you can see, it's straight C++, with the only child function a call to the C standard realloc(). And here's the disassembly of the function:

Code Select



1145: //increments an ArbInt
1146: void ArbInt::operator++()
1147: {
00402B90   push        esi
00402B91   mov         esi,ecx
1148:     if (++data[0] != 0)         //no carry, so return
00402B93   mov         eax,dword ptr [esi+8]
00402B96   mov         edx,dword ptr [eax]
00402B98   inc         edx
00402B99   mov         dword ptr [eax],edx
00402B9B   mov         eax,dword ptr [esi+8]
00402B9E   cmp         dword ptr [eax],0
00402BA1   jne         ArbInt::operator+++66h (00402bf6)
1149:         return;
1150:
1151:     unsigned long i = 1;        //start at 2nd digit of ArbInt
1152:     while (size > i)            //go until end of array
00402BA3   mov         edx,dword ptr [esi]
00402BA5   mov         ecx,1
00402BAA   cmp         edx,ecx
00402BAC   jbe         ArbInt::operator+++35h (00402bc5)
1153:     {
1154:         if (++data[i] != 0)     //return if next digit doesn't carry
00402BAE   mov         edx,dword ptr [eax+ecx*4]
00402BB1   inc         edx
00402BB2   mov         dword ptr [eax+ecx*4],edx
00402BB5   mov         eax,dword ptr [esi+8]
00402BB8   cmp         dword ptr [eax+ecx*4],0
00402BBC   jne         ArbInt::operator+++66h (00402bf6)
00402BBE   mov         edx,dword ptr [esi]
1155:             return;
1156:         ++i;
00402BC0   inc         ecx
00402BC1   cmp         edx,ecx
00402BC3   ja          ArbInt::operator+++1Eh (00402bae)
1157:     }
1158:
1159:     //we've reached end of array and there's still a carry,
1160:     //so add a leading one
1161:     if (size == capacity)   //allocate more memory if necessary
00402BC5   mov         eax,dword ptr [esi+4]
00402BC8   mov         ecx,dword ptr [esi]
00402BCA   cmp         ecx,eax
00402BCC   jne         ArbInt::operator+++55h (00402be5)
1162:     {
1163:         ++capacity;
1164:         data = (unsigned long*)realloc(data, capacity<<2);
00402BCE   mov         ecx,dword ptr [esi+8]
00402BD1   inc         eax
00402BD2   mov         dword ptr [esi+4],eax
00402BD5   shl         eax,2
00402BD8   push        eax
00402BD9   push        ecx
00402BDA   call        _realloc (00427582)
00402BDF   add         esp,8
00402BE2   mov         dword ptr [esi+8],eax
1165:     }
1166:
1167:     data[size] = 1;
00402BE5   mov         edx,dword ptr [esi]
00402BE7   mov         eax,dword ptr [esi+8]
00402BEA   mov         dword ptr [eax+edx*4],1
1168:     ++size;             //update the size
00402BF1   mov         eax,dword ptr [esi]
00402BF3   inc         eax
00402BF4   mov         dword ptr [esi],eax
00402BF6   pop         esi
1169: }
00402BF7   ret

Something odd that I just noticed while posting this is that the exact same statement later in the function does not generate a redundant instruction like the first occurrence of the statement.

Quote from: hitchhikr on May 18, 2005, 08:58:31 PM
The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.

I tried out the Intel compiler a few days ago, used the suggested optimization switches on Intel's website (/O3 and a couple others), and this particular project ran about 5% slower than after compilation with VC++ 6.0.

Title: Re: Compiler Inefficiencies
Post by: hutch-- on May 19, 2005, 12:51:35 AM

There is some conventional wisdom in NOT MIXING C and assembler code within the same module. Most modern C compilers when set with the correct flags perform a number of optimisations (usually based on register theory and redundancy removal) and when you plonk your own inline assembler in the middle of it, you mess up the internal optimisation.

The trick has always been to use each tool to its advantage and this is achieved by the compatible object module format so you can optimise in BOTH languages and combine the results.

These days I am only familiar with VC6 and the VCTOOLKIT C compilers as far as output goes and if the C code is well written you get very good results in most cases. Where the code design requires complexity or algorithm layout that they don't or can't do well, you have MASM to target the task.

Title: Re: Compiler Inefficiencies
Post by: Posit on May 19, 2005, 05:32:07 AM

To test whether inline assembly is having an effect, I tried the original versions of the project source files. The compiler generated the exact same assembly instructions for the function in question, with no inline assembly anywhere in the project.

I then tried turning various compiler options on and off. The only one that made a difference was adding /G6, which generates the following slightly shorter code. The redundant instruction is even more noticable.

Code Select


173:      if (++data[0] != 0)         //no carry, so return
004014B3   mov         eax,dword ptr [esi+8]
004014B6   inc         dword ptr [eax]
004014B8   mov         eax,dword ptr [esi+8]
004014BB   cmp         dword ptr [eax],0
004014BE   jne         ArbInt::operator+++5Fh (0040150f)

Title: Re: Compiler Inefficiencies
Post by: hitchhikr on May 19, 2005, 08:22:01 AM

Either you don't know how to use a compiler or your own a crippled version of vc.

Title: Re: Compiler Inefficiencies
Post by: Vortex on May 19, 2005, 09:07:52 AM

Result with the Digital Mars compiler:

Code Select


_TEXT segment
 assume CS:_TEXT
_foo:
  mov EAX,4[ESP]
  inc dword ptr [EAX]
  je L15
  push offset FLAT:_DATA
  call near ptr _printf
  add ESP,4
L15:  ret
_TEXT ends

Title: Re: Compiler Inefficiencies
Post by: Posit on May 19, 2005, 09:59:10 AM

Quote from: hitchhikr on May 19, 2005, 08:22:01 AM
Either you don't know how to use a compiler or your own a crippled version of vc.

I'm willing to entertain both possibilities.

Title: Re: Compiler Inefficiencies
Post by: Jibz on May 19, 2005, 12:03:53 PM

Posit,

I'm getting roughly the same result as you with your code and VC7. Quite strange.

I wonder if it's really due to some optimization for Intel processors as MichaelW suggested .. what is your CPU, Intel or AMD?

GCC (3.4.2) -O2 produced

Code Select

    mov     edx, DWORD PTR [ebx+8]
    mov     eax, DWORD PTR [edx]
    inc     eax
    mov     DWORD PTR [edx], eax
    test    eax, eax
    jne     L1

which doesn't have the extra read, but the test seems superfluous.

Title: Re: Compiler Inefficiencies
Post by: GregL on May 19, 2005, 05:56:22 PM

If you are using Visual C++ Standard Edition, it does not include the optimizing compiler. You can get an optimizing compiler from the VC++ Toolkit 2003.

Related reading material:
http://www.sawtoothdistortion.com/Articles/FreeOptimizingCompiler.html

Title: Re: Compiler Inefficiencies
Post by: Posit on May 19, 2005, 06:55:33 PM

The CPU is a Celeron 2.4 GHz.

Title: Re: Compiler Inefficiencies
Post by: MichaelW on August 19, 2010, 07:23:36 AM

I split off the new posts and moved them here:

http://www.masm32.com/board/index.php?topic=14656.0

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: Posit on May 18, 2005, 05:36:32 PM