News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

My code is ..... 4X slower ?!?!?

Started by robione, January 03, 2010, 02:34:33 PM

Previous topic - Next topic

robione

This might be a good example of debugging code (in this case timing code) needing to be debugged. I'm not sure. I don't think so but it wouldn't be the 1st time :/ ....... (much time passes) It turns out I have no clue how to time anything. Brief background:

I've been updating an old class I wrote (3D-Vectors) to use SSE. I figure I'll expand that into a "general" vector class after I squeeze out the speed here and can carry it over. So I was writing most of this using intrinsics because I didn't know how to access class members in assembly for about a day... until I came upon updating the cross product. So come to find out MSVC6 SP5 doesn't have intrinsics for a bunch of SSE2 instructions.... now I really didn't have a choice.

I figured it out... now my cross product is faster :).... then I decided to see if I could get more speed out of my Length() method. Here's the before (intrinsics) and after (inline SSE) kinda rolled into one:


float Length(void) {
// float f;
__asm {
mov esi,0xffffffff
mov [esi],0
/* mov ecx,[this].m_data

movaps xmm0,[ecx]
mulps xmm0,[ecx]
movaps xmm2,xmm0
movhlps xmm1,xmm0
psrldq xmm2,4
addps xmm0,xmm1
addss xmm0,xmm2
sqrtss xmm0,xmm0

movss f,xmm0
*/ }
// return f;
/**/ __m128 a;

a = _mm_mul_ps(m_data,m_data);
*((float *)&a) += *(((float *)&a)+1) + *(((float *)&a)+2);
a = _mm_sqrt_ss(a);

return *((float *)&a);
/**/ }


The esi stuff is just there so I get an exception and can view the release code when I "debug" the process. Which is below. I don't get a bunch of stuff. First is the instrinc release and then my SSE.

00401174   push        ebp
00401175   mov         ebp,esp
00401177   and         esp,0FFFFFFF0h
0040117A   sub         esp,1Ch
0040117D   push        esi
0040117E   mov         esi,0FFFFFFFFh
00401183   mov         byte ptr [esi],0
00401186   movaps      xmm0,xmmword ptr [ecx]
00401189   movaps      xmm1,xmm0
0040118C   mulps       xmm1,xmm0
0040118F   movaps      xmmword ptr [esp+10h],xmm1
00401194   fld         dword ptr [esp+10h]
00401198   fadd        dword ptr [esp+18h]
0040119C   pop         esi
0040119D   fadd        dword ptr [esp+10h]
004011A1   fstp        dword ptr [esp+0Ch]
004011A5   movaps      xmm0,xmmword ptr [esp+0Ch]
004011AA   movaps      xmmword ptr [esp+0Ch],xmm0
004011AF   sqrtss      xmm0,xmm0
004011B3   movss       dword ptr [esp+0Ch],xmm0
004011B9   movaps      xmm0,xmmword ptr [esp+0Ch]
004011BE   movaps      xmmword ptr [esp+0Ch],xmm0
004011C3   fld         dword ptr [esp+0Ch]
004011C7   mov         esp,ebp
004011C9   pop         ebp
004011CA   ret


00401119   mov         eax,2625A0h
0040111E   mov         esi,0FFFFFFFFh
00401123   mov         byte ptr [esi],0
00401126   mov         ecx,dword ptr [esp+14h]
0040112A   movaps      xmm0,xmmword ptr [ecx]
0040112D   mulps       xmm0,xmmword ptr [ecx]
00401130   movaps      xmm2,xmm0
00401133   movhlps     xmm1,xmm0
00401136   psrldq      xmm2,4
0040113B   addps       xmm0,xmm1
0040113E   addss       xmm0,xmm2
00401142   sqrtss      xmm0,xmm0
00401146   movss       dword ptr [esp+1Ch],xmm0
0040114C   dec         eax
0040114D   jne         0040111E


So where did all the ebp/esp stuff go? I guess the compiler can't inline the instrinsics maybe. The main program is pretty simple:
int i,ii,c1,c2;
float f1;
C_fVector3D v;

::timeBeginPeriod(1);

v = -1.f;

for(ii=0;ii<25;ii++) {
c1 = ::timeGetTime();
for(i=0;i<2500000;) {
f1 = v.Length();
v[0] += 0.00001f;
i++;
}
c2 = ::timeGetTime() - c1;
printf("\n%u",c2);
}

::timeEndPeriod(1);
}


Like one thing I don't get about the compiler's code are lines: 004011AA, 004011B9, 004011BE. That kinda seems like a waste. My code is 10 lines minus my error stuff.... the compiler's 24. Granted the compiler has a bunch more simpler instructions. I really don't see how my 10 lines of code are 4x slower then 24. I think SSE instructions take longer to execute and decode but I mean.... did I miss something somewhere.

Now I took what the compiler did and turned it into:

float C_fVector3D::Length(void) {
float f;
__asm {
   mov ecx,[this].m_data
   movaps      xmm0,[ecx]
   movaps      xmm1,xmm0
   mulps       xmm1,xmm0
   mov    eax,esp
   and    eax,0xfffffff0
   movaps      [eax+32],xmm1
   fld         dword ptr [eax+32]
   fadd        dword ptr [eax+36]
   fadd        dword ptr [eax+40]
   fstp        f
   movss      xmm0,f
   sqrtss      xmm0,xmm0
   movss       f,xmm0
   }
return f;
}
....and it's 3x slower then my SSE. (Although now it seems I cant reproduce that) Ack. Is the sky green and the grass blue? I feel lost. Doing 10x more iterations yields .... ~10x the time I'm given initially in all cases. So now I tried using RDTSC just for the code inside the function to get a guesstimate..... and the SSE is "faster"?!?!? What is going on here?

dedndave

if you look in the first post of the first thread in the Laboratory sub-forum,
you will find a d/l for MichaelW's timers.asm timing macros
just place the un-zipped file in your \masm32\macros folder
here is a simple example of how to use the macros
if the code takes a considerable amount of time, you may want to use HIGH_PRIORITY_CLASS instead of REALTIME

        INCLUDE \masm32\include\masm32rt.inc
        .586
        INCLUDE \masm32\macros\timers.asm

LOOP_COUNT EQU  400000000

        .code

_main   PROC

;select core 0

        INVOKE  GetCurrentProcess
        INVOKE  SetProcessAffinityMask,eax,1
        INVOKE  Sleep,500

        mov     ecx,3

test00: push    ecx

;----------------------------

        counter_begin LOOP_COUNT,REALTIME_PRIORITY_CLASS

;----------------------------
;code to be timed

        mov     eax,1
;       bsr     ecx,eax
        xor     ecx,ecx
        bsr     ecx,eax

;----------------------------

        counter_end

;----------------------------

        print   str$(eax),9,"clock cycles",13,10
        pop     ecx
        dec     ecx
        jnz     test00

;        inkey
        exit

_main   ENDP

        END     _main

the order of the first 3 lines is somewhat important
the masm32rt.inc file sets the processor to .486
the timers.asm macros require it to be .586 or higher
you can get fairly good results if you adjust the LOOP_COUNT EQUate so each test takes about 500 ms or more

Farabi

Well, if you use floating point operation it mean a longer time to calculate. Assembler user is not only us, but many. Most of compiler makers is understand assembler too and they posible to make an optimized code. If you cant beat the standard C library, it mean, nothing you should create.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"