My code is ..... 4X slower ?!?!?

robione · January 03, 2010, 02:34:33 PM

This might be a good example of debugging code (in this case timing code) needing to be debugged. I'm not sure. I don't think so but it wouldn't be the 1st time :/ ....... (much time passes) It turns out I have no clue how to time anything. Brief background:

I've been updating an old class I wrote (3D-Vectors) to use SSE. I figure I'll expand that into a "general" vector class after I squeeze out the speed here and can carry it over. So I was writing most of this using intrinsics because I didn't know how to access class members in assembly for about a day... until I came upon updating the cross product. So come to find out MSVC6 SP5 doesn't have intrinsics for a bunch of SSE2 instructions.... now I really didn't have a choice.

I figured it out... now my cross product is faster :).... then I decided to see if I could get more speed out of my Length() method. Here's the before (intrinsics) and after (inline SSE) kinda rolled into one:

Code Select


	float Length(void)	{
//		float f;
		__asm {
			mov esi,0xffffffff
			mov [esi],0
/*			mov ecx,[this].m_data

			movaps xmm0,[ecx]
			mulps xmm0,[ecx]
			movaps xmm2,xmm0
			movhlps xmm1,xmm0
			psrldq xmm2,4
			addps xmm0,xmm1
			addss xmm0,xmm2
			sqrtss xmm0,xmm0

			movss f,xmm0
*/		}
//		return f;
/**/		__m128 a;
		
		a = _mm_mul_ps(m_data,m_data);
		*((float *)&a) += *(((float *)&a)+1) + *(((float *)&a)+2);
		a = _mm_sqrt_ss(a);

		return *((float *)&a);
/**/	}

The esi stuff is just there so I get an exception and can view the release code when I "debug" the process. Which is below. I don't get a bunch of stuff. First is the instrinc release and then my SSE.

Code Select

00401174   push        ebp
00401175   mov         ebp,esp
00401177   and         esp,0FFFFFFF0h
0040117A   sub         esp,1Ch
0040117D   push        esi
0040117E   mov         esi,0FFFFFFFFh
00401183   mov         byte ptr [esi],0
00401186   movaps      xmm0,xmmword ptr [ecx]
00401189   movaps      xmm1,xmm0
0040118C   mulps       xmm1,xmm0
0040118F   movaps      xmmword ptr [esp+10h],xmm1
00401194   fld         dword ptr [esp+10h]
00401198   fadd        dword ptr [esp+18h]
0040119C   pop         esi
0040119D   fadd        dword ptr [esp+10h]
004011A1   fstp        dword ptr [esp+0Ch]
004011A5   movaps      xmm0,xmmword ptr [esp+0Ch]
004011AA   movaps      xmmword ptr [esp+0Ch],xmm0
004011AF   sqrtss      xmm0,xmm0
004011B3   movss       dword ptr [esp+0Ch],xmm0
004011B9   movaps      xmm0,xmmword ptr [esp+0Ch]
004011BE   movaps      xmmword ptr [esp+0Ch],xmm0
004011C3   fld         dword ptr [esp+0Ch]
004011C7   mov         esp,ebp
004011C9   pop         ebp
004011CA   ret

Code Select


00401119   mov         eax,2625A0h
0040111E   mov         esi,0FFFFFFFFh
00401123   mov         byte ptr [esi],0
00401126   mov         ecx,dword ptr [esp+14h]
0040112A   movaps      xmm0,xmmword ptr [ecx]
0040112D   mulps       xmm0,xmmword ptr [ecx]
00401130   movaps      xmm2,xmm0
00401133   movhlps     xmm1,xmm0
00401136   psrldq      xmm2,4
0040113B   addps       xmm0,xmm1
0040113E   addss       xmm0,xmm2
00401142   sqrtss      xmm0,xmm0
00401146   movss       dword ptr [esp+1Ch],xmm0
0040114C   dec         eax
0040114D   jne         0040111E

So where did all the ebp/esp stuff go? I guess the compiler can't inline the instrinsics maybe. The main program is pretty simple:

Code Select

	int i,ii,c1,c2;
	float f1;
	C_fVector3D v;

	::timeBeginPeriod(1);

	v = -1.f;
	
	for(ii=0;ii<25;ii++) {
		c1 = ::timeGetTime();
		for(i=0;i<2500000;) {
			f1 = v.Length();
			v[0] += 0.00001f;
			i++;
		}
		c2 = ::timeGetTime() - c1;
		printf("\n%u",c2);
	}

	::timeEndPeriod(1);
	}

Like one thing I don't get about the compiler's code are lines: 004011AA, 004011B9, 004011BE. That kinda seems like a waste. My code is 10 lines minus my error stuff.... the compiler's 24. Granted the compiler has a bunch more simpler instructions. I really don't see how my 10 lines of code are 4x slower then 24. I think SSE instructions take longer to execute and decode but I mean.... did I miss something somewhere.

Now I took what the compiler did and turned it into:

Code Select


float C_fVector3D::Length(void)	{
float f;
__asm {
   mov ecx,[this].m_data
   movaps      xmm0,[ecx]
   movaps      xmm1,xmm0
   mulps       xmm1,xmm0
   mov		   eax,esp
   and		   eax,0xfffffff0
   movaps      [eax+32],xmm1
   fld         dword ptr [eax+32]
   fadd        dword ptr [eax+36]
   fadd        dword ptr [eax+40]
   fstp        f
   movss      xmm0,f
   sqrtss      xmm0,xmm0
   movss       f,xmm0
   }
	return f;
}

....and it's 3x slower then my SSE. (Although now it seems I cant reproduce that) Ack. Is the sky green and the grass blue? I feel lost. Doing 10x more iterations yields .... ~10x the time I'm given initially in all cases. So now I tried using RDTSC just for the code inside the function to get a guesstimate..... and the SSE is "faster"?!?!? What is going on here?

dedndave · January 03, 2010, 04:40:17 PM

if you look in the first post of the first thread in the Laboratory sub-forum,
you will find a d/l for MichaelW's timers.asm timing macros
just place the un-zipped file in your \masm32\macros folder
here is a simple example of how to use the macros
if the code takes a considerable amount of time, you may want to use HIGH_PRIORITY_CLASS instead of REALTIME

INCLUDE \masm32\include\masm32rt.inc
.586
INCLUDE \masm32\macros\timers.asm

LOOP_COUNT EQU 400000000

.code

_main PROC

;select core 0

INVOKE GetCurrentProcess
INVOKE SetProcessAffinityMask,eax,1
INVOKE Sleep,500

mov ecx,3

test00: push ecx

;----------------------------

counter_begin LOOP_COUNT,REALTIME_PRIORITY_CLASS

;----------------------------
;code to be timed

mov eax,1
; bsr ecx,eax
xor ecx,ecx
bsr ecx,eax

;----------------------------

counter_end

;----------------------------

print str$(eax),9,"clock cycles",13,10
pop ecx
dec ecx
jnz test00

; inkey
exit

_main ENDP

END _main

the order of the first 3 lines is somewhat important
the masm32rt.inc file sets the processor to .486
the timers.asm macros require it to be .586 or higher
you can get fairly good results if you adjust the LOOP_COUNT EQUate so each test takes about 500 ms or more

Farabi · January 04, 2010, 05:21:30 PM

Well, if you use floating point operation it mean a longer time to calculate. Assembler user is not only us, but many. Most of compiler makers is understand assembler too and they posible to make an optimized code. If you cant beat the standard C library, it mean, nothing you should create.

News:

My code is ..... 4X slower ?!?!?

robione

dedndave

Farabi