This might be a good example of debugging code (in this case timing code) needing to be debugged. I'm not sure. I don't think so but it wouldn't be the 1st time :/ ....... (much time passes) It turns out I have no clue how to time anything. Brief background:
I've been updating an old class I wrote (3D-Vectors) to use SSE. I figure I'll expand that into a "general" vector class after I squeeze out the speed here and can carry it over. So I was writing most of this using intrinsics because I didn't know how to access class members in assembly for about a day... until I came upon updating the cross product. So come to find out MSVC6 SP5 doesn't have intrinsics for a bunch of SSE2 instructions.... now I really didn't have a choice.
I figured it out... now my cross product is faster :).... then I decided to see if I could get more speed out of my Length() method. Here's the before (intrinsics) and after (inline SSE) kinda rolled into one:
float Length(void) {
// float f;
__asm {
mov esi,0xffffffff
mov [esi],0
/* mov ecx,[this].m_data
movaps xmm0,[ecx]
mulps xmm0,[ecx]
movaps xmm2,xmm0
movhlps xmm1,xmm0
psrldq xmm2,4
addps xmm0,xmm1
addss xmm0,xmm2
sqrtss xmm0,xmm0
movss f,xmm0
*/ }
// return f;
/**/ __m128 a;
a = _mm_mul_ps(m_data,m_data);
*((float *)&a) += *(((float *)&a)+1) + *(((float *)&a)+2);
a = _mm_sqrt_ss(a);
return *((float *)&a);
/**/ }
The esi stuff is just there so I get an exception and can view the release code when I "debug" the process. Which is below. I don't get a bunch of stuff. First is the instrinc release and then my SSE.
00401174 push ebp
00401175 mov ebp,esp
00401177 and esp,0FFFFFFF0h
0040117A sub esp,1Ch
0040117D push esi
0040117E mov esi,0FFFFFFFFh
00401183 mov byte ptr [esi],0
00401186 movaps xmm0,xmmword ptr [ecx]
00401189 movaps xmm1,xmm0
0040118C mulps xmm1,xmm0
0040118F movaps xmmword ptr [esp+10h],xmm1
00401194 fld dword ptr [esp+10h]
00401198 fadd dword ptr [esp+18h]
0040119C pop esi
0040119D fadd dword ptr [esp+10h]
004011A1 fstp dword ptr [esp+0Ch]
004011A5 movaps xmm0,xmmword ptr [esp+0Ch]
004011AA movaps xmmword ptr [esp+0Ch],xmm0
004011AF sqrtss xmm0,xmm0
004011B3 movss dword ptr [esp+0Ch],xmm0
004011B9 movaps xmm0,xmmword ptr [esp+0Ch]
004011BE movaps xmmword ptr [esp+0Ch],xmm0
004011C3 fld dword ptr [esp+0Ch]
004011C7 mov esp,ebp
004011C9 pop ebp
004011CA ret
00401119 mov eax,2625A0h
0040111E mov esi,0FFFFFFFFh
00401123 mov byte ptr [esi],0
00401126 mov ecx,dword ptr [esp+14h]
0040112A movaps xmm0,xmmword ptr [ecx]
0040112D mulps xmm0,xmmword ptr [ecx]
00401130 movaps xmm2,xmm0
00401133 movhlps xmm1,xmm0
00401136 psrldq xmm2,4
0040113B addps xmm0,xmm1
0040113E addss xmm0,xmm2
00401142 sqrtss xmm0,xmm0
00401146 movss dword ptr [esp+1Ch],xmm0
0040114C dec eax
0040114D jne 0040111E
So where did all the ebp/esp stuff go? I guess the compiler can't inline the instrinsics maybe. The main program is pretty simple:
int i,ii,c1,c2;
float f1;
C_fVector3D v;
::timeBeginPeriod(1);
v = -1.f;
for(ii=0;ii<25;ii++) {
c1 = ::timeGetTime();
for(i=0;i<2500000;) {
f1 = v.Length();
v[0] += 0.00001f;
i++;
}
c2 = ::timeGetTime() - c1;
printf("\n%u",c2);
}
::timeEndPeriod(1);
}
Like one thing I don't get about the compiler's code are lines: 004011AA, 004011B9, 004011BE. That kinda seems like a waste. My code is 10 lines minus my error stuff.... the compiler's 24. Granted the compiler has a bunch more simpler instructions. I really don't see how my 10 lines of code are 4x slower then 24. I think SSE instructions take longer to execute and decode but I mean.... did I miss something somewhere.
Now I took what the compiler did and turned it into:
float C_fVector3D::Length(void) {
float f;
__asm {
mov ecx,[this].m_data
movaps xmm0,[ecx]
movaps xmm1,xmm0
mulps xmm1,xmm0
mov eax,esp
and eax,0xfffffff0
movaps [eax+32],xmm1
fld dword ptr [eax+32]
fadd dword ptr [eax+36]
fadd dword ptr [eax+40]
fstp f
movss xmm0,f
sqrtss xmm0,xmm0
movss f,xmm0
}
return f;
}
....and it's 3x slower then my SSE. (Although now it seems I cant reproduce that) Ack. Is the sky green and the grass blue? I feel lost. Doing 10x more iterations yields .... ~10x the time I'm given initially in all cases. So now I tried using RDTSC just for the code inside the function to get a guesstimate..... and the SSE is "faster"?!?!? What is going on here?
if you look in the first post of the first thread in the Laboratory sub-forum,
you will find a d/l for MichaelW's timers.asm timing macros
just place the un-zipped file in your \masm32\macros folder
here is a simple example of how to use the macros
if the code takes a considerable amount of time, you may want to use HIGH_PRIORITY_CLASS instead of REALTIME
INCLUDE \masm32\include\masm32rt.inc
.586
INCLUDE \masm32\macros\timers.asm
LOOP_COUNT EQU 400000000
.code
_main PROC
;select core 0
INVOKE GetCurrentProcess
INVOKE SetProcessAffinityMask,eax,1
INVOKE Sleep,500
mov ecx,3
test00: push ecx
;----------------------------
counter_begin LOOP_COUNT,REALTIME_PRIORITY_CLASS
;----------------------------
;code to be timed
mov eax,1
; bsr ecx,eax
xor ecx,ecx
bsr ecx,eax
;----------------------------
counter_end
;----------------------------
print str$(eax),9,"clock cycles",13,10
pop ecx
dec ecx
jnz test00
; inkey
exit
_main ENDP
END _main
the order of the first 3 lines is somewhat important
the masm32rt.inc file sets the processor to .486
the timers.asm macros require it to be .586 or higher
you can get fairly good results if you adjust the LOOP_COUNT EQUate so each test takes about 500 ms or more
Well, if you use floating point operation it mean a longer time to calculate. Assembler user is not only us, but many. Most of compiler makers is understand assembler too and they posible to make an optimized code. If you cant beat the standard C library, it mean, nothing you should create.