How come C# is so much faster than my assembly code?

Started by houyunqing, October 17, 2008, 11:51:09 PM

Previous topic - Next topic

houyunqing

i'm just doing a simple multiplication for 100000000 times, it turned out that the C#.NET code was about FIFTY TIMES FASTER!! What's wrong with the assembly code??? Isn't the masm code supposed to be faster??? :dazzled:
Is it because of problem with parallelism? or am i using the wrong instruction?

here's the masm code segments i used:

.486
.model flat, stdcall

    myloop:
    fld realnum2; real4 0.2            This loading instruction is also present in the loop of the C# code, so should not be moved out of the loop
    fld realnum;   real4 0.1
    fmul st, st(1)
    fst realnum   ;  as you can see later in the MSIL it's also got the load
    fst realnum
    dec ecx
    jnz myloop

here's the C# code:
        private void button3_Click(object sender, EventArgs e)
        {
            float s1 = 0.1f;
            float s2 = 0.2f;
            float s3 = 0.3f;
            for (uint i = 100000000; i != 0; i--)
            {
                s3 = s1 * s2;
            }
            MessageBox.Show("Done");
        }
here's the .NET MSIL assembly code the above method produced: (i don't know how they're translated into real opcodes, just put here for you to have a look)
.method private hidebysig instance void  button3_Click(object sender,
                                                       class [mscorlib]System.EventArgs e) cil managed
{
  // Code size       62 (0x3e)
  .maxstack  2
  .locals init ([0] float32 s1,
           [1] float32 s2,
           [2] float32 s3,
           [3] uint32 i,
           [4] bool CS$4$0000)
  IL_0000:  nop
  IL_0001:  ldc.r4     0.1
  IL_0006:  stloc.0
  IL_0007:  ldc.r4     0.2
  IL_000c:  stloc.1
  IL_000d:  ldc.r4     0.30000001
  IL_0012:  stloc.2
  IL_0013:  ldc.i4     0x5f5e100
  IL_0018:  stloc.3
  IL_0019:  br.s       IL_0025
  IL_001b:  nop
  IL_001c:  ldloc.0
  IL_001d:  ldloc.1
  IL_001e:  mul
  IL_001f:  stloc.2
  IL_0020:  nop
  IL_0021:  ldloc.3
  IL_0022:  ldc.i4.1
  IL_0023:  sub
  IL_0024:  stloc.3
  IL_0025:  ldloc.3
  IL_0026:  ldc.i4.0
  IL_0027:  ceq
  IL_0029:  ldc.i4.0
  IL_002a:  ceq
  IL_002c:  stloc.s    CS$4$0000
  IL_002e:  ldloc.s    CS$4$0000
  IL_0030:  brtrue.s   IL_001b
  IL_0032:  ldstr      "Done"
  IL_0037:  call       valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
  IL_003c:  pop
  IL_003d:  ret
} // end of method Form1::button3_Click


jj2007

Can you zip your full Masm code and the two exe's and post them, please?

houyunqing

hey, i have attached the files
inside test_realnumber is the masm32 code and exe
inside csharp_speedtest is the C#.NET code and exe
don't try directly disassembling the c# exe file cause you'll get a total mess. disassemble it with .NET Framework's MSIL Disassembler

the time taken to run that C# code is almost the same as doing 1000000000 times mul ... not fmul... strange isn't it?

[attachment deleted by admin]

GregL

houyunqing,

Your asm code has some problems. You aren't popping the FPU stack properly. You are also writing back to realnum (twice), instead of using a third variable and writing to it once. Change the asm code as follows and it's a lot faster and more closely matches the C# code.


    mov ecx, 1000000000
  myloop:
    fld realnum
    fld realnum2
    fmul
    fstp realnum3
    dec ecx
    jnz myloop


MichaelW

And if after correcting the asm code the C# executable is still much faster, then the likely explanation would be that optimizations are eliminating the redundant calculations (since each one produces the same result), or eliminating the calculations altogether (since the independent variables are both constants).
eschew obfuscation

hutch--

If time actually atters with the calculation and the precision is only 32 bit FP, try using SSE instructions instead of FP as you may get a speed gain.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

houyunqing

Quote from: Greg on October 18, 2008, 02:16:42 AM
houyunqing,

Your asm code has some problems. You aren't popping the FPU stack properly. You are also writing back to realnum (twice), instead of using a third variable and writing to it once. Change the asm code as follows and it's a lot faster and more closely matches the C# code.


    mov ecx, 1000000000
  myloop:
    fld realnum
    fld realnum2
    fmul
    fstp realnum3
    dec ecx
    jnz myloop



Yes Yes!
After  i've made the correction it is now faster than the C# code! had lots of misunderstanding about these instructions...haha, Thanks!!

Quote from: MichaelW on October 18, 2008, 03:27:05 AM
And if after correcting the asm code the C# executable is still much faster, then the likely explanation would be that optimizations are eliminating the redundant calculations (since each one produces the same result), or eliminating the calculations altogether (since the independent variables are both constants).
Microsoft compilers don't provide any optimization, do they?

MichaelW

The Micrsosft C/C++ compilers are optimizing compilers. C# has an optimize option, but I have no idea how, or how well it works. Since I don't have C#, I tested the VC++ Toolkit 2003 compiler using this source:

#include <windows.h>
int main(void)
{
    float s1 = 0.1f;
    float s2 = 0.2f;
    float s3 = 0.3f;
    int i;
    for (i = 100000000; i != 0; i--)
    {
        s3 = s1 * s2;
    }
    printf("%f\n",s3);
    getch();
    return 0;
}


Using the command line "cl /FA floattest.c", with no optimization options specified, the relevant parts of the asm output are more or less what you would expect:

_TEXT SEGMENT
_s1$ = -16 ; size = 4
_s2$ = -12 ; size = 4
_i$ = -8 ; size = 4
_s3$ = -4 ; size = 4
_main PROC NEAR
; File c:\program files\microsoft visual c++ toolkit 2003\my\floattest.c
; Line 3
push ebp
mov ebp, esp
sub esp, 16 ; 00000010H
; Line 4
mov DWORD PTR _s1$[ebp], 1036831949 ; 3dcccccdH
; Line 5
mov DWORD PTR _s2$[ebp], 1045220557 ; 3e4ccccdH
; Line 6
mov DWORD PTR _s3$[ebp], 1050253722 ; 3e99999aH
; Line 8
mov DWORD PTR _i$[ebp], 100000000 ; 05f5e100H
jmp SHORT $L73999
$L74000:
mov eax, DWORD PTR _i$[ebp]
sub eax, 1
mov DWORD PTR _i$[ebp], eax
$L73999:
cmp DWORD PTR _i$[ebp], 0
je SHORT $L74001
; Line 10
fld DWORD PTR _s1$[ebp]
fmul DWORD PTR _s2$[ebp]
fstp DWORD PTR _s3$[ebp]
; Line 11
jmp SHORT $L74000
$L74001:
; Line 12
fld DWORD PTR _s3$[ebp]
sub esp, 8
fstp QWORD PTR [esp]
push OFFSET FLAT:$SG74003
call _printf
add esp, 12 ; 0000000cH
; Line 13
call _getch
; Line 14
xor eax, eax
; Line 15
mov esp, ebp
pop ebp
ret 0
_main ENDP
_TEXT ENDS


But using the command line "cl /O2 /FA floattest.c", specifying the /O2 (maximize speed) option, the relevant parts of the asm output are:

_TEXT SEGMENT
_main PROC NEAR ; COMDAT
; Line 12
fld QWORD PTR __real@3f947ae151eb8520
sub esp, 8
fstp QWORD PTR [esp]
push OFFSET FLAT:??_C@_03PPOCCAPH@?$CFf?6?$AA@
call _printf
add esp, 12 ; 0000000cH
; Line 13
call _getch
; Line 14
xor eax, eax
; Line 15
ret 0
_main ENDP
_TEXT ENDS

eschew obfuscation

houyunqing