How come C# is so much faster than my assembly code?

houyunqing · October 17, 2008, 11:51:09 PM

i'm just doing a simple multiplication for 100000000 times, it turned out that the C#.NET code was about FIFTY TIMES FASTER!! What's wrong with the assembly code??? Isn't the masm code supposed to be faster??? :dazzled:
Is it because of problem with parallelism? or am i using the wrong instruction?

here's the masm code segments i used:

.486
.model flat, stdcall

myloop:
fld realnum2; real4 0.2 This loading instruction is also present in the loop of the C# code, so should not be moved out of the loop
fld realnum; real4 0.1
fmul st, st(1)
fst realnum ; as you can see later in the MSIL it's also got the load
fst realnum
dec ecx
jnz myloop

here's the C# code:
private void button3_Click(object sender, EventArgs e)
{
float s1 = 0.1f;
float s2 = 0.2f;
float s3 = 0.3f;
for (uint i = 100000000; i != 0; i--)
{
s3 = s1 * s2;
}
MessageBox.Show("Done");
}
here's the .NET MSIL assembly code the above method produced: (i don't know how they're translated into real opcodes, just put here for you to have a look)
.method private hidebysig instance void button3_Click(object sender,
class [mscorlib]System.EventArgs e) cil managed
{
// Code size 62 (0x3e)
.maxstack 2
.locals init ([0] float32 s1,
[1] float32 s2,
[2] float32 s3,
[3] uint32 i,
[4] bool CS$4$0000)
IL_0000: nop
IL_0001: ldc.r4 0.1
IL_0006: stloc.0
IL_0007: ldc.r4 0.2
IL_000c: stloc.1
IL_000d: ldc.r4 0.30000001
IL_0012: stloc.2
IL_0013: ldc.i4 0x5f5e100
IL_0018: stloc.3
IL_0019: br.s IL_0025
IL_001b: nop
IL_001c: ldloc.0
IL_001d: ldloc.1
IL_001e: mul
IL_001f: stloc.2
IL_0020: nop
IL_0021: ldloc.3
IL_0022: ldc.i4.1
IL_0023: sub
IL_0024: stloc.3
IL_0025: ldloc.3
IL_0026: ldc.i4.0
IL_0027: ceq
IL_0029: ldc.i4.0
IL_002a: ceq
IL_002c: stloc.s CS$4$0000
IL_002e: ldloc.s CS$4$0000
IL_0030: brtrue.s IL_001b
IL_0032: ldstr "Done"
IL_0037: call valuetype [System.Windows.Forms]System.Windows.Forms.DialogResult [System.Windows.Forms]System.Windows.Forms.MessageBox::Show(string)
IL_003c: pop
IL_003d: ret
} // end of method Form1::button3_Click

jj2007 · October 18, 2008, 12:14:48 AM

Can you zip your full Masm code and the two exe's and post them, please?

houyunqing · October 18, 2008, 12:36:06 AM

hey, i have attached the files
inside test_realnumber is the masm32 code and exe
inside csharp_speedtest is the C#.NET code and exe
don't try directly disassembling the c# exe file cause you'll get a total mess. disassemble it with .NET Framework's MSIL Disassembler

the time taken to run that C# code is almost the same as doing 1000000000 times mul ... not fmul... strange isn't it?

[attachment deleted by admin]

GregL · October 18, 2008, 02:16:42 AM

houyunqing,

Your asm code has some problems. You aren't popping the FPU stack properly. You are also writing back to realnum (twice), instead of using a third variable and writing to it once. Change the asm code as follows and it's a lot faster and more closely matches the C# code.

Code Select


    mov ecx, 1000000000
  myloop:
    fld realnum
    fld realnum2
    fmul
    fstp realnum3
    dec ecx
    jnz myloop

MichaelW · October 18, 2008, 03:27:05 AM

And if after correcting the asm code the C# executable is still much faster, then the likely explanation would be that optimizations are eliminating the redundant calculations (since each one produces the same result), or eliminating the calculations altogether (since the independent variables are both constants).

hutch-- · October 18, 2008, 05:15:21 AM

If time actually atters with the calculation and the precision is only 32 bit FP, try using SSE instructions instead of FP as you may get a speed gain.

houyunqing · October 20, 2008, 10:02:10 AM

Quote from: Greg on October 18, 2008, 02:16:42 AM
houyunqing,

Your asm code has some problems. You aren't popping the FPU stack properly. You are also writing back to realnum (twice), instead of using a third variable and writing to it once. Change the asm code as follows and it's a lot faster and more closely matches the C# code.

Code Select Expand
mov ecx, 1000000000 myloop: fld realnum fld realnum2 fmul fstp realnum3 dec ecx jnz myloop

Yes Yes!
After i've made the correction it is now faster than the C# code! had lots of misunderstanding about these instructions...haha, Thanks!!

Quote from: MichaelW on October 18, 2008, 03:27:05 AM
And if after correcting the asm code the C# executable is still much faster, then the likely explanation would be that optimizations are eliminating the redundant calculations (since each one produces the same result), or eliminating the calculations altogether (since the independent variables are both constants).

Microsoft compilers don't provide any optimization, do they?

MichaelW · October 20, 2008, 04:40:06 PM

The Micrsosft C/C++ compilers are optimizing compilers. C# has an optimize option, but I have no idea how, or how well it works. Since I don't have C#, I tested the VC++ Toolkit 2003 compiler using this source:

Code Select


#include <windows.h>
int main(void)
{
    float s1 = 0.1f;
    float s2 = 0.2f;
    float s3 = 0.3f;
    int i;
    for (i = 100000000; i != 0; i--)
    {
        s3 = s1 * s2;
    }
    printf("%f\n",s3);
    getch();
    return 0;
}

Using the command line "cl /FA floattest.c", with no optimization options specified, the relevant parts of the asm output are more or less what you would expect:

Code Select


_TEXT	SEGMENT
_s1$ = -16						; size = 4
_s2$ = -12						; size = 4
_i$ = -8						; size = 4
_s3$ = -4						; size = 4
_main	PROC NEAR
; File c:\program files\microsoft visual c++ toolkit 2003\my\floattest.c
; Line 3
	push	ebp
	mov	ebp, esp
	sub	esp, 16					; 00000010H
; Line 4
	mov	DWORD PTR _s1$[ebp], 1036831949		; 3dcccccdH
; Line 5
	mov	DWORD PTR _s2$[ebp], 1045220557		; 3e4ccccdH
; Line 6
	mov	DWORD PTR _s3$[ebp], 1050253722		; 3e99999aH
; Line 8
	mov	DWORD PTR _i$[ebp], 100000000		; 05f5e100H
	jmp	SHORT $L73999
$L74000:
	mov	eax, DWORD PTR _i$[ebp]
	sub	eax, 1
	mov	DWORD PTR _i$[ebp], eax
$L73999:
	cmp	DWORD PTR _i$[ebp], 0
	je	SHORT $L74001
; Line 10
	fld	DWORD PTR _s1$[ebp]
	fmul	DWORD PTR _s2$[ebp]
	fstp	DWORD PTR _s3$[ebp]
; Line 11
	jmp	SHORT $L74000
$L74001:
; Line 12
	fld	DWORD PTR _s3$[ebp]
	sub	esp, 8
	fstp	QWORD PTR [esp]
	push	OFFSET FLAT:$SG74003
	call	_printf
	add	esp, 12					; 0000000cH
; Line 13
	call	_getch
; Line 14
	xor	eax, eax
; Line 15
	mov	esp, ebp
	pop	ebp
	ret	0
_main	ENDP
_TEXT	ENDS

But using the command line "cl /O2 /FA floattest.c", specifying the /O2 (maximize speed) option, the relevant parts of the asm output are:

Code Select


_TEXT	SEGMENT
_main	PROC NEAR					; COMDAT
; Line 12
	fld	QWORD PTR __real@3f947ae151eb8520
	sub	esp, 8
	fstp	QWORD PTR [esp]
	push	OFFSET FLAT:??_C@_03PPOCCAPH@?$CFf?6?$AA@
	call	_printf
	add	esp, 12					; 0000000cH
; Line 13
	call	_getch
; Line 14
	xor	eax, eax
; Line 15
	ret	0
_main	ENDP
_TEXT	ENDS

houyunqing · October 21, 2008, 05:54:56 AM

Wow, nice!
thanks for the information

News:

How come C# is so much faster than my assembly code?

houyunqing

jj2007

houyunqing

GregL

MichaelW

hutch--

houyunqing

MichaelW

houyunqing