VC++ vs assembler code and how to write floating point code using scalar SSE

Started by Mark_Larson, March 21, 2005, 10:29:18 PM

Previous topic - Next topic

Mark_Larson

  I have always loved real time raytracing.  Realtime raytracing does raytracing with framerates that you can use in games.  So you have to be able to draw the screen in fractions of a second.  I still visit a lot of websites that have realtime raytracing information to see if I can find better algorithms than what I currently am using in my raytracer.  I also like to find raytracing code on the internet and try and make it run faster.  I am going to talk about a particular procedure in some C++ code I found on the internet.  The only changes I made to the code is I converted it to C code.  The code can be found here http://www.icarusindie.com/DoItYourSelf/rtsr/realtime3d/lesson28.php


  I want to cover three things.  First, how the Visual C++ compiler converts the C++ code to assembly code.  Second, I want to cover some basic optimization tricks you do when you first convert some C code to assembler.  And third, the C++ code uses a lot of single precision floating point code.  I want to go over how to use the scalar version of the SSE instructions.  The scalar version is no different than standard floating point code.  They can do most of the same things that you can do with the standard floating point instructions.  The other big plus is NO FP REGISTER STACK!!!  Sorry that implementation of the floating point registers drives me nuts.  Intel should have their head removed for that one.  Here is the routine we are going to be converting to scalar SSE code.  It basically checks to see if a particular ray that is being cast hits a particular sphere.  Their is an array of spheres, parameter "int j" tells the routine which sphere is being checked against.  Parameters x1, y1, and z1 are the location of the camera.  And parameters x2, y2, and z2 are the location of the pixel.  Parameter "ret" is used to tell whether the ray hit the sphere or not.


my_point ray_sphere(int j, float x1,float y1,float z1,float x2,float y2,float z2, int &ret)
{
   my_point V, EO;
   float v, disc;

   //assume the intersection failed
   ret=-1;

   V.x = x2 - x1;
   V.y = y2 - y1;
   V.z = z2 - z1;

   EO.x = the_sphere[j].x - x1;
   EO.y = the_sphere[j].y - y1;
   EO.z = the_sphere[j].z - z1;

     if(EO.x*V.x + EO.y*V.y + EO.z*V.z<=0)
       return the_sphere[j].P; //intersection is behind the camera, so don't draw the point

   v = EO.x * V.x + EO.y * V.y + EO.z * V.z;

   disc = the_sphere[j].radius * the_sphere[j].radius -
           (EO.x * EO.x +
            EO.y * EO.y +
            EO.z * EO.z - v*v);

   if(disc>=0)
   {
       disc = sqrtf(disc);
       the_sphere[j].P.x = x1 + (v-disc)*V.x;
       the_sphere[j].P.y = y1 + (v-disc)*V.y;
       the_sphere[j].P.z = z1 + (v-disc)*V.z;
       ret=1;
   }
   return the_sphere[j].P;
}



  I am going to first talk about using scalar SSE.  So let's look at the different floating point arithmetic the author is doing in the code.  They do an add, subtract, multiply, square root, and a "move".  The corresponding scalar SSE instsructions are ( I went ahead and added divsion in there).


fsub = subss
fadd = addss
      fmul = mulss
      fsqrt = sqrtss
fdiv = divss
fld = movss



  So as you can see it's very easy to convert some floating point code to use scalar SSE.  So what if you want to do a comparison?  That's easy.  They have an instruction that's even better than the standard floating point compare.


movss xmm7,[five] ;register has a floating point 5.0f in it
movss xmm0,[zero] ;register has a floating point 0.0f in it
comiss xmm7,xmm0 ;set PF, CF, and ZF based upon comparison
ja xmm7_greater_than_xmm0

;it is less than or equal if it gets here

xmm7_greater_than_xmm0:
;xmm7 is greater than xmm0




  The instruction sets the PF, CF, and ZF flags in the EFLAGS register based on the comparison between the two operands.  The OF, SF, and AF flags are set to 0.  Pretty cool huh?  It's a lot harder to do the same code with floating point registers.  You can't do all the combinations of conditional jumps but you can do most of them.  Check the "Jcc" instruction ( jump conditional) in Book 2 of the P4 processor ( you can use an earlier Intel processor manual as well).  The ones that need OF, AF or SF won't work.  Here is the list that work from my P4 instruction set manual:


ja, jae, jb, jbe, jc, jcx, jecx, je, jna, jnae, jnb, jnbe, jnc, jne, jnp, jnz, jp, jpe, jpo, jz, ja, jae, jb, jbe, jc, je, jz



  So what else are we missing?  Trigonometric support for SSE.  Intel actually wrote a set of routines that use scalar SSE, packed SSE, scalar SSE2, and packed SSE2 to do trigonometric functions.  They are faster than a floating point trig instruction, faster than a table lookup, but have less accuracy.  It is called the Approximate Math Library.  This page has a link to where you can download it.  Just look for "Approximate Math Library" on the page.

http://developer.intel.com/design/Pentium4/devtools/ 

  So now let's try a straight conversion of the above C code to scalar SSE.  I changed one thing.  I pass in a pointer to the proper sphere in the array of spheres we are accessing.  I did this all in inline assembler.  As a general rule if you are trying to optimize some code and only change a few lines of a procedure in C to assembler, you don't get as much of a speed up.  You generally have to convert the whole procedure or most of it.  Just letting you know for other people who want to use inline assembler like I did.  For real speed I'd convert this to an external assembler file after getting it to work with inline assembler.  I wrote this code without first looking at what assembly code VC++ produces for this procedure. I find that doing it that way always helps me write much faster assembler code than what VC++ can generate.  I don't want the way VC++ does it to taint my thought processes, when I am converting.  So here is my scalar SSE code.


my_point my_ray_sphere(my_sphere *the_sphere, float x1,float y1,float z1,float x2,float y2,float z2, int &ret)
{
   //assume the intersection failed
   ret=-1;

_asm {
mov esi,[the_sphere]
movss xmm0,[x2] //x2
movss xmm1,[y2] //y2
movss xmm2,[z2] //z2

// the_sphere->V.x = x2 - x1;
// the_sphere->V.y = y2 - y1;
// the_sphere->V.z = z2 - z1;

movss xmm3,[x1] //x1
movss xmm4,[y1] //y1
movss xmm5,[z1] //z1
subss xmm0,xmm3 //V.x
subss xmm1,xmm4 //V.y
subss xmm2,xmm5 //V.z

// the_sphere->EO.x = the_sphere->x - x1;
// the_sphere->EO.y = the_sphere->y - y1;
// the_sphere->EO.z = the_sphere->z - z1;

movss xmm6,[esi] //x
movss xmm7,[esi+4] //y
subss xmm6,xmm3 //E0.x
movss xmm3,[esi+8] //z
subss xmm7,xmm4 //E0.y
subss xmm3,xmm5 //E0.z

//the_sphere->v = the_sphere->EO.x * the_sphere->V.x + the_sphere->EO.y * the_sphere->V.y + the_sphere->EO.z * the_sphere->V.z;
//xmm4 and xmm5 free

movss xmm4,xmm6 //E0.x
movss xmm5,xmm7 //E0.y
mulss xmm4,xmm0 //E0.x * V.x
mulss xmm5,xmm1 //E0.y * V.y
addss xmm4,xmm5 //E0.x * V.x + E0.y * V.y
movss xmm5,xmm3 //E0.z
mulss xmm5,xmm2 //E0.z * V.z
addss xmm4,xmm5 //E0.x * V.x + E0.y * V.y + E0.z * V.z
pxor xmm5,xmm5 //xmm5 = 0.0f

// if ( the_sphere->v <= 0)
// return the_sphere->P; //intersection is behind the camera, so don't draw the point

comiss xmm4,xmm5 //sets PF, ZF and CF in EFLAGS based on result of compare.
//ja, jae, jb, jbe, jc, jcx, jecx, je, jna, jnae, jnb, jnbe, jnc, jne, jnp, jnz, jp, jpe, jpo, jz, ja, jae, jb, jbe, jc, je, jz
jbe exit_routine

//v = xmm4
//V.x = xmm0
//V.y = xmm1
//V.z = xmm2
//E0.x = xmm6
//E0.y = xmm7
//E0.z = xmm3

//the_sphere->disc = the_sphere->radius_squared -
// (the_sphere->EO.x * the_sphere->EO.x +
// the_sphere->EO.y * the_sphere->EO.y +
// the_sphere->EO.z * the_sphere->EO.z - the_sphere->v*the_sphere->v);

mulss xmm6,xmm6 //E0.x * E0.x
movss xmm5,xmm4 //v
mulss xmm7,xmm7 //E0.y * E0.y
mulss xmm5,xmm5 //v*v
mulss xmm3,xmm3 //E0.z * E0.z
addss xmm6,xmm7 //E0.x * E0.x + E0.y * E0.y
addss xmm5,xmm3 //E0.z * E0.z + v*v
pxor xmm7,xmm7 //xmm7 = 0.0f
movss xmm3,[esi+16] //the_sphere->radius_squared
addss xmm6,xmm5 //E0.x * E0.x + E0.y * E0.y + E0.z * E0.z + v*v
subss xmm3,xmm6 //disc = radius_squared - E0.x * E0.x + E0.y * E0.y + E0.z * E0.z + v*v

//if(the_sphere->disc>=0)
//{
// the_sphere->disc = sqrtf(the_sphere->disc);
// the_sphere->P.x = x1 + (the_sphere->v-the_sphere->disc)*the_sphere->V.x;
// the_sphere->P.y = y1 + (the_sphere->v-the_sphere->disc)*the_sphere->V.y;
// the_sphere->P.z = z1 + (the_sphere->v-the_sphere->disc)*the_sphere->V.z;
// ret=1;
//}

comiss xmm3,xmm7 //Set PF, CF and ZF in Eflags register based on scalar compare
jb exit_routine
//ja, jae, jb, jbe, jc, jcx, jecx, je, jna, jnae, jnb, jnbe, jnc, jne, jnp, jnz, jp, jpe, jpo, jz, ja, jae, jb, jbe, jc, je, jz

//v = xmm4
//V.x = xmm0
//V.y = xmm1
//V.z = xmm2
//disc = xmm6

// xmm3 free, xmm5 free, xmm7 free

sqrtss xmm6,xmm6 //disc = sqrtf(disc)
movss xmm3,[x1] //x1
movss xmm5,[y1] //y1
subss xmm4,xmm6 //v-disc

//xmm6 free now

movss xmm6,[z1] //z1
mulss xmm0,xmm4 //(v-disc)*V.x
mulss xmm1,xmm4 //(v-disc)*V.y
mulss xmm2,xmm4 //(v-disc)*V.z
addss xmm0,xmm3 //x1 + (v-disc)*V.x
addss xmm1,xmm5 //y1 + (v-disc)*V.y
addss xmm2,xmm6 //z1 + (v-disc)*V.z

movss [esi+20],xmm0 //P.x
movss [esi+24],xmm1 //P.y
movss [esi+28],xmm2 //P.z

}

ret = 1; //set return equal to a 1 meaning the ray hit the sphere

_asm {

exit_routine:

}

return the_sphere->P; //return the point of intersection.
}



   I am generally really good about knowing the best way to convert some C code to assembler to make it run really fast.  What's the timing difference between the two routines?  Well the original floating point code runs in 171 cycles on my P4.  My converted scalar SSE routine runs in 79 cycles.  That is more than twice as fast.  That's a straight conversion with no big attempts at trying to speed it up.  The only things I did was read variables into registers as early as possible ( and keep them there) to help break dependencies later on.  ( Bad Microsoft for not doing the same).  And when I had to use the result of a calculation on the next line, I tried to calculate it earlier ( to help break up dependencies).  I couldn't always do that, but it makes a big difference.  Oh, and before I forget.  I meant to address one other thing.  Converting the code from scalar to non-scalar is pretty easy.  All the instructions get the trailing "ss" changed to a "ps".  The second to last "s" in "ss" stands for "scalar".  The "p" in "ps" stands for "packed".  One additional change.  The MOVE instruction needs an "a" in the middle of it ( stands for "aligned").  And you need to make sure all accessed memory locations are on a 16 byte boundary.  The packed SSE code processes 4 rays at a time and runs in 25 cycles on my P4.  Here are the non-scalar SSE instructions.


move = movaps
sub = subps
add = addps
      mul = mulps
      sqrt = sqrtps
div = divps



  So what's left?  Let's look at the code that the VC++ 6.0 sp5 compiler was generating and see why my assembler code was so much better.  And this is exactly why I can always write faster code in assembler than what the C compiler can generate.  I am going to break it up into sections to make it easier to follow.  Oh and one quick thing.  Total memory accesses in my code for the whole procedure is 17.  The total memory accesses VC++ does in it's assembler code is 72!!!  Ewwwwww. That relates to the one of two optimziations I did in the conversion.  I read variables early into registers and keep them there.  And here is the very important part, how I decide which variables go into registers.  I look at the routine and find which variables are used the most in the C code, and then keep those variables in registers instead of memory.  Almost every line of the assembly code generated by VC++ has a memory access.  I also included pushes/pops in that count.  One other slight optimization that I do when I convert code from C to assembler.  I read values into registers first, and then do operations on two registers.  I rarely do operations between a register and memory.  Since the Pentium processor came out, doing it that way has been faster, because it's RISC like.

My comments are in the code block.


; 50   :    my_point V, EO;
; 51   :    float v, disc;
; 52   :
; 53   :    //assume the intersection failed
; 54   :    ret=-1;
; 55   :
; 56   :    V.x = x2 - x1;
; 57   :    V.y = y2 - y1;
; 58   :    V.z = z2 - z1;

fld DWORD PTR _x2$[esp+28]
fsub DWORD PTR _x1$[esp+28]


mov eax, DWORD PTR _j$[esp+28]
mov edx, DWORD PTR _ret$[esp+28]
push esi
lea ecx, DWORD PTR [eax+eax*8]
mov DWORD PTR [edx], -1
fstp DWORD PTR _V$[esp+36]
fld DWORD PTR _y2$[esp+32]
fsub DWORD PTR _y1$[esp+32]
shl ecx, 2
fstp DWORD PTR _V$[esp+40]
fld DWORD PTR _z2$[esp+32]
fsub DWORD PTR _z1$[esp+32]
fstp DWORD PTR _V$[esp+44]




//and here is my code that does the same exact thing.  See how much simpler my code is?  In addition now that I have read x1, y1, and z1 into
// registers I won't re-read them.  They will stay in the registers.  The next step also uses x1, y1, and z1.

movss xmm0,[x2] //x2
movss xmm1,[y2] //y2
movss xmm2,[z2] //z2
movss xmm3,[x1] //x1
movss xmm4,[y1] //y1
movss xmm5,[z1] //z1
subss xmm0,xmm3 //V.x
subss xmm1,xmm4 //V.y
subss xmm2,xmm5 //V.z





   My comments are in the code block.


; 59   :
; 60   :    EO.x = the_sphere[j].x - x1;
fld DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx]
fsub DWORD PTR _x1$[esp+32]
fstp DWORD PTR _EO$[esp+36]
; 61   :    EO.y = the_sphere[j].y - y1;

fld DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+4]
fsub DWORD PTR _y1$[esp+32]
fstp DWORD PTR _EO$[esp+40]

; 62   :    EO.z = the_sphere[j].z - z1;

fld DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+8]
fsub DWORD PTR _z1$[esp+32]
fst DWORD PTR _EO$[esp+44]


//and here is my code that does the same exact thing.  xmm3, xmm4 and xmm5 already have x1, y1, and z1 in them from the previous block of code
// I left them in there so now I don't have to read them from memory again.  If you look at the VC++ code it re-reads those variables from memory.
// x1, y1, and z1 were used in both of the first two steps.


movss xmm6,[esi] //x
movss xmm7,[esi+4] //y
subss xmm6,xmm3 //E0.x
movss xmm3,[esi+8] //z
subss xmm7,xmm4 //E0.y
subss xmm3,xmm5 //E0.z




   My comments are in the code block.


; 63   :
; 64   :      if(EO.x*V.x + EO.y*V.y + EO.z*V.z<=0)

fmul DWORD PTR _V$[esp+44]
fld DWORD PTR _EO$[esp+40]
fmul DWORD PTR _V$[esp+40]
faddp ST(1), ST(0)
fld DWORD PTR _EO$[esp+36]
fmul DWORD PTR _V$[esp+36]
faddp ST(1), ST(0)
fcom DWORD PTR __real@00000000
fnstsw ax
test ah, 65 ; 00000041H
jp SHORT $L43300

; 65   :        return the_sphere[j].P; //intersection is behind the camera, so don't draw the point

lea eax, DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+20]
mov ecx, eax
mov eax, DWORD PTR $T43368[esp+32]
mov edx, eax
mov esi, DWORD PTR [ecx]
mov DWORD PTR [edx], esi
mov esi, DWORD PTR [ecx+4]
mov DWORD PTR [edx+4], esi
mov esi, DWORD PTR [ecx+8]
mov DWORD PTR [edx+8], esi
pop esi
mov ecx, DWORD PTR [ecx+12]
fstp ST(0)
mov DWORD PTR [edx+12], ecx

; 83   : }

add esp, 32 ; 00000020H
ret 0
$L43300:


//and my code that does the same thing.  You'll notice that I kept EO.x, EO.y and EO.z in registers.  VC++ saved them to memory and read them back
// They are in xmm6, xmm7, and xmm3

movss xmm4,xmm6 //E0.x
movss xmm5,xmm7 //E0.y
mulss xmm4,xmm0 //E0.x * V.x
mulss xmm5,xmm1 //E0.y * V.y
addss xmm4,xmm5 //E0.x * V.x + E0.y * V.y
movss xmm5,xmm3 //E0.z
mulss xmm5,xmm2 //E0.z * V.z
addss xmm4,xmm5 //E0.x * V.x + E0.y * V.y + E0.z * V.z
pxor xmm5,xmm5 //xmm5 = 0.0f

comiss xmm4,xmm5 //sets PF, ZF and CF in EFLAGS based on result of compare.
jbe exit_routine




   My comments are in the code block.


; 66   :
; 67   :    v = EO.x * V.x + EO.y * V.y + EO.z * V.z;
; 68   :
; 69   :    disc = the_sphere[j].radius * the_sphere[j].radius -
; 70   :            (EO.x * EO.x +
; 71   :             EO.y * EO.y +
; 72   :             EO.z * EO.z - v*v);

fld DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+12]
fld ST(0)
fmulp ST(1), ST(0)
fld DWORD PTR _EO$[esp+44]
fmul DWORD PTR _EO$[esp+44]
fld DWORD PTR _EO$[esp+40]
fmul DWORD PTR _EO$[esp+40]
faddp ST(1), ST(0)
fld DWORD PTR _EO$[esp+36]
fmul DWORD PTR _EO$[esp+36]
faddp ST(1), ST(0)
fld ST(2)
fmul ST(0), ST(3)
fsubp ST(1), ST(0)
fsubp ST(1), ST(0)


//Here is my code.  The previous code block calculated "v" already and compared it against 0.0f.  I saved that result in register xmm4.
// So I will calculate the value for variable "disc".  VC++ also saves the result of the previous comparison.
// you'll notice I read radius_squared into a register first and then do an operation on the register to be more RISC-like.

mulss xmm6,xmm6 //E0.x * E0.x
movss xmm5,xmm4 //v
mulss xmm7,xmm7 //E0.y * E0.y
mulss xmm5,xmm5 //v*v
mulss xmm3,xmm3 //E0.z * E0.z
addss xmm6,xmm7 //E0.x * E0.x + E0.y * E0.y
addss xmm5,xmm3 //E0.z * E0.z + v*v
pxor xmm7,xmm7 //xmm7 = 0.0f
movss xmm3,[esi+16] //the_sphere->radius_squared
addss xmm6,xmm5 //E0.x * E0.x + E0.y * E0.y + E0.z * E0.z + v*v
subss xmm3,xmm6 //disc = radius_squared - E0.x * E0.x + E0.y * E0.y + E0.z * E0.z + v*v




   My comments are in the code block.


; 73   :
; 74   :    if(disc>=0)

fcom DWORD PTR __real@00000000
fnstsw ax
and eax, 256 ; 00000100H
jne SHORT $L43376

; 75   :    {
; 76   :        disc = sqrtf(disc);

fsqrt

; 77   :        the_sphere[j].P.x = x1 + (v-disc)*V.x;

fsubp ST(1), ST(0)
fld DWORD PTR _V$[esp+36]
fmul ST(0), ST(1)
fadd DWORD PTR _x1$[esp+32]
fstp DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+20]

; 78   :        the_sphere[j].P.y = y1 + (v-disc)*V.y;

fld DWORD PTR _V$[esp+40]
fmul ST(0), ST(1)
fadd DWORD PTR _y1$[esp+32]
fstp DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+24]

; 79   :        the_sphere[j].P.z = z1 + (v-disc)*V.z;

fmul DWORD PTR _V$[esp+44]
fadd DWORD PTR _z1$[esp+32]
fstp DWORD PTR ?the_sphere@@3PAUmy_sphere@@A[ecx+28]

; 80   :        ret=1;

mov DWORD PTR [edx], 1
jmp SHORT $L43301
$L43376:

; 73   :
; 74   :    if(disc>=0)

fstp ST(0)

; 80   :        ret=1;

fstp ST(0)
$L43301:

; 81   :    }



//and here is my code.  Again I do fewer memory accesses, because my code has most of the variables already in registers.
//The code is checking if "disc" ( the discriminate) is bigger than or equal to 0.  If it is, then we the ray hits the
// sphere and we want to compute the point of intersection.  P.x, P.y and P.z are the points that correspond to the point of intersection.



comiss xmm3,xmm7 //Set PF, CF and ZF in Eflags register based on scalar compare
jb exit_routine
//ja, jae, jb, jbe, jc, jcx, jecx, je, jna, jnae, jnb, jnbe, jnc, jne, jnp, jnz, jp, jpe, jpo, jz, ja, jae, jb, jbe, jc, je, jz

//v = xmm4
//V.x = xmm0
//V.y = xmm1
//V.z = xmm2
//disc = xmm6

// xmm3 free, xmm5 free, xmm7 free

sqrtss xmm6,xmm6 //disc = sqrtf(disc)
movss xmm3,[x1] //x1
movss xmm5,[y1] //y1
subss xmm4,xmm6 //v-disc

//xmm6 free now

movss xmm6,[z1] //z1
mulss xmm0,xmm4 //(v-disc)*V.x
mulss xmm1,xmm4 //(v-disc)*V.y
mulss xmm2,xmm4 //(v-disc)*V.z
addss xmm0,xmm3 //x1 + (v-disc)*V.x
addss xmm1,xmm5 //y1 + (v-disc)*V.y
addss xmm2,xmm6 //z1 + (v-disc)*V.z

movss [esi+20],xmm0 //P.x
movss [esi+24],xmm1 //P.y
movss [esi+28],xmm2 //P.z

}



  Bonus points if you made it all the way to the end!!!  If you are really serious about being a first rate optimizer, then learning to optimize code from what the VC++ compiler generates isn't the best way to learn how to optimize code.  Buy some books on optimization or visit some websites.  For the bulk of the people out there the assembler code that VC++ generates is fast enough.  However some people need things to be blazing fast for their program and for that you can't beat highly tuned assembler code.

Mark Larson
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

GregL

Mark,

Wow, good article! I'm going to print it out and read it tonight.

I had been wondering about how to convert FPU & MMX code to SSE/SSE2/SSE3, because Win64 is not going to allow FPU & MMX code in 64-bit mode. Plus, it sounds like it's a good performance boost for 32-bit code. It's time to start digging into it.

Thanks,