News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compiler Inefficiencies

Started by Posit, May 18, 2005, 05:36:32 PM

Previous topic - Next topic

Posit

VC++ 6.0 just generated this snippet, which serves as a perfect example of how compilers can do some obviously sub-optimal things.


C++ statement:

if (++data[0] != 0)

Generated assembly:

mov         eax,dword ptr [esi+8]
mov         edx,dword ptr [eax]
inc         edx
mov         dword ptr [eax],edx
mov         eax,dword ptr [esi+8]
cmp         dword ptr [eax],0
jne         label1


The redundant instruction is rather glaring.

Jibz

Try turning on optimization :wink.

Posit

That was with full optimization turned on.

Jibz

#include <cstdio>

void foo(int *data)
{
    if (++data[0] != 0) std::printf("!");
}


Compiled without optimization (VC7, I didn't have VC6 installed) produced

    push    ebp
    mov     ebp, esp
    mov     eax, DWORD PTR _data$[ebp]
    mov     ecx, DWORD PTR [eax]
    add     ecx, 1
    mov     edx, DWORD PTR _data$[ebp]
    mov     DWORD PTR [edx], ecx
    mov     eax, DWORD PTR _data$[ebp]
    cmp     DWORD PTR [eax], 0
    je      SHORT $L665
    push    OFFSET FLAT:$SG667
    call    _printf
    add     esp, 4
$L665:
    pop     ebp
    ret     0


which looks quite a lot like your result, but compiled with /O2 it produced

    mov     eax, DWORD PTR _data$[esp-4]
    inc     DWORD PTR [eax]
    je      SHORT $L666
    mov     DWORD PTR _data$[esp-4], OFFSET FLAT:??_C@_01DCLJPIOD@?$CB?$AA@
    jmp     _printf
$L666:
    ret     0


Did you perchance have any inline asm in the same source file?

Vortex

Here is the result with VC6:


cl /Zl /c /Ogtyb2 /Gs /G6 /Gz /Zp1 /FAs /Fasample.asm sample.cpp



mov   eax, DWORD PTR _data$[esp-4]
inc   DWORD PTR [eax]
je    SHORT $L578
push  OFFSET FLAT:$SG579
call   _printf
pop   ecx




Mark_Larson

  Compiler inefficiences is one of my favorite topics :)  Jibz always strongly supports the C compiler.  :)  He's really good at optimizing code in C.  However with VC++ 6.0 and 7.0  I have issues with how they generate optimized code.  I actually wrote an article not too long ago that covered that and a few other things.

http://www.masmforum.com/simple/index.php?topic=1140.0

  Also the guy who writes Virtualdub is irritated by VC++ optimizations.  A lot of times you can see him post code snippets of assembler code the VC++ 7.0 generated and griping about it.  Most of his project is in C, but large chunks are in assembler.  If you go back through his archives of comments that's where he shows some C code and shows the corresponding assembler code usually for both VC++ 6.0 and 7.0

http://www.virtualdub.org/
http://www.virtualdub.org/docs_compiling

  On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++.  Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0.  For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

Posit,

I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm
    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      junk1 dd offset junk2
      junk2 dd 0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000
    REPEAT_COUNT EQU 100

    mov   esi,offset junk1-8

    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,dword ptr [esi+8]
        mov   edx,dword ptr [eax]
        inc   edx
        mov   dword ptr [eax],edx
        mov   eax,dword ptr [esi+8]
        cmp   dword ptr [eax],0
        jne   $+2
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   junk2,0
    counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov   eax,dword ptr [esi+8]
        mov   edx,dword ptr [eax]
        inc   edx
        mov   dword ptr [eax],edx
        ;mov   eax,dword ptr [esi+8]
        cmp   dword ptr [eax],0
        jne   $+2
      ENDM
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Results on my P3:

540 cycles
698 cycles


EDIT: added "mov   junk2,0" so both loops start at the same state, no change in timings.

eschew obfuscation

Posit

Quote from: Jibz on May 18, 2005, 06:55:57 PM
Did you perchance have any inline asm in the same source file?

Yes, quite a bit, although none in the particular function that generated that output.

I compiled it with the options:

/nologo /Zp16 /ML /W3 /GX /Zi /O2 /Ob2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /Fp"Release/ArbInt.pch" /YX





Posit

Quote from: MichaelW on May 18, 2005, 07:47:42 PM
I have doubts that Microsoft would have missed something this obvious. Are you sure that the redundant instruction is not an optimization?

On my 2.4 GHz Celeron the loop with the redundant instruction runs slightly slower than if I comment out the instruction, although I'm timing it within the C++ program.

Jibz

Mark_Larson,

Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.

I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.

Posit,

Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.

Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.

Mark_Larson

Quote from: Jibz on May 18, 2005, 08:09:20 PM
Mark_Larson,

Well, I don't know if I always strongly support C compilers .. I generally recommend using the right tool for the job :U.

I do however think that C compilers of today generate more than fast enough code for most uses. Programs like VirtualDub, where the speed of inner loops are critical are a natural exception, and a compiler can never compete with a human there.

Yep, for large projects C is the way to go, because of code development time.  And then if you are going to convert some of the C routines to use assembler, you need to profile your code and find the slow routines before just randomly converting stuff to assembler.

  Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language.  GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating.  Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI.  That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler.  Either convert the whole thing to assembler or none of it.  A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Randall Hyde

Quote from: Mark_Larson on May 18, 2005, 07:44:28 PM
  Compiler inefficiences is one of my favorite topics :) 
Mine too. In fact, I'm currently writing a book on it. :)

Quote
  On the plus side, I have been using GCC a lot at work ( 3.3.3), and it produces better assembler code than VC++. 
I don't see this that often. Some things GCC does a whole lot better than VC++ (6), other things, it does a whole lot worse. I don't have the web site handy, but there is a site that benchmarks a bunch of different C compilers. Those benchmarks suggest that VCC is quite a bit better than GCC and only a little bit inferior to the Intel compiler. Of course, those results are based on the particular benchmarks that guy used.

Quote
Also as Vortex and Jibz pointed out, using the right compiler flags is key to at least getting good performance out of VC++ 6.0.  For serious speed I always use a makefile that VC++ generates, and I tweak it with different parameters until I find the optimum one for me.
It also helps to write your C code in such a way that helps the compiler generate better machine code.
Cheers,
Randy Hyde

hitchhikr

The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.

To avoid interfering asm inlines use intrinsics, this will let the compiler to allocate the registers, etc.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/vcrefcompilerintrinsics.asp
http://www.intel.com/cd/ids/developer/asmo-na/eng/59644.htm

Have a nice day.

Posit

Quote from: Jibz on May 18, 2005, 08:09:20 PM
Posit,

Inline asm has a tendency to affect the optimizer. Also I can see you have debugging turned on, I don't know if that might do something to the output of VC6.

Either way, it looks a bit like the optimizer was not on when the code you initially posted was produced.

The disassembly I posted was produced with the exact compiler options listed. I just tried it again with the same result. The debugger has to be on to view disassembly information in VC++ 6.0.

Quote from: Mark_Larson on May 18, 2005, 08:24:31 PM
  Posit, Jibz is referring to the VC++ code getting tripped up by the assembler it generates if you use inline assembly language.  GCC handles it better because for inline assembler it lets you specify a lot of info so that the C compiler doesn't get messed up in the asm code it is generating.  Think of it this way, you do some inline assembly that uses ESI, and then VC++ was planning on using ESI for some of the C code you wrote, and now it has to produce less than optimal code and not use ESI.  That's why I also recommend if you are going to write some of your code in assembler, don't just do parts of a routine in inline assembler.  Either convert the whole thing to assembler or none of it.  A lot of times you won't get the performance you had wanted if you only convert part of a routine, because you and VC++ will be tripping over each other.

Yeah, I already discovered that the doing small _asm blocks within a function tends to put you at odds with the compiler--at the very least the compiler pads the _asm block with protective pushes and pops. So I've been converting entire functions to inline assembly while keeping either the __cdecl interface or sometimes changing it to __fastcall.

Here is the entirety of the function that generated the initial snippet:


//increments an ArbInt
void ArbInt::operator++()
{
if (++data[0] != 0)     //no carry, so return
return;

unsigned long i = 1;    //start at 2nd digit of ArbInt
while (size > i)           //go until end of array
{
if (++data[i] != 0)       //return if next digit doesn't carry
return;
++i;
}

//we've reached end of array and there's still a carry,
//so add a leading one
if (size == capacity)          //allocate more memory if necessary
{
++capacity;
data = (unsigned long*)realloc(data, capacity<<2);
}

data[size] = 1;
++size;                        //update the size
}



As you can see, it's straight C++, with the only child function a call to the C standard realloc(). And here's the disassembly of the function:



1145: //increments an ArbInt
1146: void ArbInt::operator++()
1147: {
00402B90   push        esi
00402B91   mov         esi,ecx
1148:     if (++data[0] != 0)         //no carry, so return
00402B93   mov         eax,dword ptr [esi+8]
00402B96   mov         edx,dword ptr [eax]
00402B98   inc         edx
00402B99   mov         dword ptr [eax],edx
00402B9B   mov         eax,dword ptr [esi+8]
00402B9E   cmp         dword ptr [eax],0
00402BA1   jne         ArbInt::operator+++66h (00402bf6)
1149:         return;
1150:
1151:     unsigned long i = 1;        //start at 2nd digit of ArbInt
1152:     while (size > i)            //go until end of array
00402BA3   mov         edx,dword ptr [esi]
00402BA5   mov         ecx,1
00402BAA   cmp         edx,ecx
00402BAC   jbe         ArbInt::operator+++35h (00402bc5)
1153:     {
1154:         if (++data[i] != 0)     //return if next digit doesn't carry
00402BAE   mov         edx,dword ptr [eax+ecx*4]
00402BB1   inc         edx
00402BB2   mov         dword ptr [eax+ecx*4],edx
00402BB5   mov         eax,dword ptr [esi+8]
00402BB8   cmp         dword ptr [eax+ecx*4],0
00402BBC   jne         ArbInt::operator+++66h (00402bf6)
00402BBE   mov         edx,dword ptr [esi]
1155:             return;
1156:         ++i;
00402BC0   inc         ecx
00402BC1   cmp         edx,ecx
00402BC3   ja          ArbInt::operator+++1Eh (00402bae)
1157:     }
1158:
1159:     //we've reached end of array and there's still a carry,
1160:     //so add a leading one
1161:     if (size == capacity)   //allocate more memory if necessary
00402BC5   mov         eax,dword ptr [esi+4]
00402BC8   mov         ecx,dword ptr [esi]
00402BCA   cmp         ecx,eax
00402BCC   jne         ArbInt::operator+++55h (00402be5)
1162:     {
1163:         ++capacity;
1164:         data = (unsigned long*)realloc(data, capacity<<2);
00402BCE   mov         ecx,dword ptr [esi+8]
00402BD1   inc         eax
00402BD2   mov         dword ptr [esi+4],eax
00402BD5   shl         eax,2
00402BD8   push        eax
00402BD9   push        ecx
00402BDA   call        _realloc (00427582)
00402BDF   add         esp,8
00402BE2   mov         dword ptr [esi+8],eax
1165:     }
1166:
1167:     data[size] = 1;
00402BE5   mov         edx,dword ptr [esi]
00402BE7   mov         eax,dword ptr [esi+8]
00402BEA   mov         dword ptr [eax+edx*4],1
1168:     ++size;             //update the size
00402BF1   mov         eax,dword ptr [esi]
00402BF3   inc         eax
00402BF4   mov         dword ptr [esi],eax
00402BF6   pop         esi
1169: }
00402BF7   ret



Something odd that I just noticed while posting this is that the exact same statement later in the function does not generate a redundant instruction like the first occurrence of the statement.

Quote from: hitchhikr on May 18, 2005, 08:58:31 PM
The ms compiler is really good for producing small code, but for speed considerations you gentlemen should get your hands on the intel's one.

I tried out the Intel compiler a few days ago, used the suggested optimization switches on Intel's website (/O3 and a couple others), and this particular project ran about 5% slower than after compilation with VC++ 6.0.

hutch--

There is some conventional wisdom in NOT MIXING C and assembler code within the same module. Most modern C compilers when set with the correct flags perform a number of optimisations (usually based on register theory and redundancy removal) and when you plonk your own inline assembler in the middle of it, you mess up the internal optimisation.

The trick has always been to use each tool to its advantage and this is achieved by the compatible object module format so you can optimise in BOTH languages and combine the results.

These days I am only familiar with VC6 and the VCTOOLKIT C compilers as far as output goes and if the C code is well written you get very good results in most cases. Where the code design requires complexity or algorithm layout that they don't or can't do well, you have MASM to target the task.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php