News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Help with this function

Started by adam23, December 06, 2006, 06:19:38 PM

Previous topic - Next topic

adam23

Okay I have a class called Vector3Asm with three int values x,y,z that I am trying to implement with almost all assembly language.

I have this code running and it produces good results, but it seems to be running about 266 ms slower on 1 million cycles, compared to my C++ code.


//====================================================
//operator / (int)
//This will probably be used more than the previous
//one the program will check to make sure the value is not equal
//to zero
//=====================================================
Vector3Asm Vector3Asm::operator /(int rhs)
{
Vector3Asm temp;
_asm
{
mov ebx, rhs //eax = rhs
cmp ebx, 0
je exit //Check for divide by zero
mov edx, 0 //make sure edx is zeroed out

mov ecx, this //save pointer to current object
lea esi, temp //get pointer to temp object
mov edx, 0
mov eax, ([ecx]).x  //save x
idiv ebx //x /rhs
mov ([esi]).x, eax  //save value is temp.x
mov edx, 0
mov eax, ([ecx]).y  //move y to eax
idiv ebx //y /rhs
mov ([esi]).y, eax  //save value in temp.y
mov edx, 0
mov eax, ([ecx]).z
idiv ebx
mov([esi]).z, eax //save value in temp.z
exit:
}
return temp;
}

Here is my C++ code

//===========================================
//operator / (int rhs)
//===========================================
Vector3 Vector3::operator /(int rhs)
{
Vector3 temp;
if(rhs != 0)
{
temp.x = x/rhs;
temp.y = y/rhs;
temp.z = z/rhs;
}
return temp;
}


I created an object in both functions, when normally in C++ I would just call
return Vector3(rhs.x * x, rhs.y * y, rhs.z * z);

I did it the way I did to try and be as consistent as possible so I would have an easier time beating it in assembly, If that makes sense  :eek

So what can I do to improve my assembly version, maybe I am missing something, I think I commented it well enough to follow.

Thanks in advance
Adam


u

The C++ code could be easily inlined, without shocking the compiler. In some compilers, inlining asm kills the optimization of the whole function.

Even if the compiler chokes, you can optimize this code a bit, using the fpu:

Vec3 Vec3::operator / (int rhs){
Vec3 temp;
_asm{
cmp rhs,0
je _exit
fld1
fidiv rhs
mov ecx,this
fild dword ptr [ecx]
fild dword ptr [ecx+4]
fild dword ptr [ecx+8]
fmul ST,ST(3)
fistp dword ptr temp.z
fmul ST,ST(2)
fistp temp.y
fmul
fistp temp.x
_exit:
}
return temp;
}


The SSE extension was created specifically to compute vectors and matrices, so that's a further optimization pointer.
Please use a smaller graphic in your signature.

u

Also, when combining C++ and asm, it's nice to toy around with the asm-listing, to see how the objects work. This way, I got rid of useless calls inside the "operator /" :

Make an extras.asm, and include it in your C++ project. Paste this code in the asm file:


.386P
.model FLAT
.code
PUBLIC ??KVector3Asm@@QAE?AV0@H@Z ; Vector3Asm::operator/
;---------------------------------------------[
??KVector3Asm@@QAE?AV0@H@Z: ; Vector3Asm::operator/
cmp dword ptr[esp+8],0
je @F
fld1
fidiv dword ptr [esp+8]
mov eax,[esp+4]
fild dword ptr [ecx]
fild dword ptr [ecx+4]
fild dword ptr [ecx+8]
fmul ST,ST(3)
fistp dword ptr [eax+8]
fmul ST,ST(2)
fistp dword ptr [eax+4]
fmul
fistp dword ptr [eax]
@@:
ret 8
;---------------------------------------------/
end



In the Project Settings of extras.asm, set-up a "Custom build":
Quote
Commands:
/masm32/bin/ml.exe /c /coff /Cp /nologo extras.asm
Outputs:
extras.obj


Comment/remove your C++ code of the "operator /" (but leave the proto in the class! ), and compile.  :toothy
Please use a smaller graphic in your signature.

adam23

Wow, I just got done testing the results of the first code you posted and the results are astounding on my P4 3.0GHZ the assembly code ran 1235 ms faster in 50 million trials.

I knew some of the commands you used such as fidiv, fild, and fmul, but I'm not sure what fld1, and fistp are.

I know fld loads a value onto ST(0), but what does the 1 do?

I know that FIST stores an integer, but what does the p mean?

Thanks again for the help, I'm trying to figure out how to add the code externally now.  This is probably a dumb question but do how do I include an asm file in C++, do I link it in project settings or do I use #include?

EDIT I just found that FLD1 pushes 1.0 onto the stack
and I think fistp stores an int and pops

Here is how I commented the code

_asm
{
cmp rhs,0 //make sure rhs does not equal 0
je exit //if it does jump to exit
fld1 //push 1.0 onto the stack
fidiv rhs //divide ST(0) by rhs
mov ecx,this //ecx points to this
fild dword ptr [ecx] //Load integer onto ST(0)
fild dword ptr [ecx+4] //Load integer onto ST(0) pushes others down
fild dword ptr [ecx+8] //Load integer onto ST(0) pushes others down
fmul ST,ST(3) //ST = z,  ST(3) = 1/rhs
fistp dword ptr temp.z //Store value in temp.z
fmul ST,ST(2) //ST = y  ST(2) = 1/rhs
fistp temp.y //Store value in temp.y
fmul //ST * ST(1) = x * 1/rhs
fistp temp.x //Store value and pop
exit:
}


Adam

u

Here's a workspace+project, where the .asm file lives happily next to a .cpp file :).


[attachment deleted by admin]
Please use a smaller graphic in your signature.

dsouza123

If your Vector3 really held four dwords x,y,z and an extra then you could use SSE2 code.
If x,y,z,extra had 10,101,202,303 and rhs had 5.


cmp       rhs,   0
je  exit
movd      xmm2,  rhs     ; xmm2 dword 0  gets  rhs  0,0,0,5 <- 5
mov       ebx,   this    ; ebx gets address held in this

pshufd    xmm2,  xmm2, 0 ; xmm2 dwords 3,2,1,0 all get xmm2 dword 0               5,     5,     5,    5  <--    0,    0,    0,   5
movdqu    xmm0,  [ebx]   ; xmm0 dwords 3,2,1,0 get this dwords extra,z,y,x
cvtdq2ps  xmm3,  xmm2    ; xmm3 real4 3,2,1,0 convert xmm2 dwords 3,2,1,0       5.0,   5.0,   5.0,  5.0  <--    5,    5,    5,   5
cvtdq2ps  xmm1,  xmm0    ; xmm1 real4 3,2,1,0 convert xmm0 dwords 3,2,1,0     303.0, 202.0, 101.0, 10.0  <--  303,  202,  101,  10
lea       esi,   temp    ; esi gets address of temp
divps     xmm1,  xmm3    ; xmm1 real4 3,2,1,0 div xmm3 real4 3,2,1,0           60.6,  40.4,  20.2,  2.0  <-- 303.0/5.0, 202.0/5.0, 101.0/5.0, 10.0/5.0
                         ; xmm1 = xmm1 / xmm3    on a real4 pair basis
cvttps2dq xmm0,  xmm1    ; xmm0 dword 3,2,1,0 trunc cvt xmm1 real4 3,2,1,0       60,    40,    20,    2  <-- 60.6, 40.4, 20.2, 2.0
movdqu    [esi], xmm0    ; four dwords of temp extra,z,y,x get xmm0 3,2,1,0


If the four dword vectors are 16 byte aligned the faster movdqa can be used in place of movdqu.

If real4 doesn't provide enough precision then a version using real8 (double precision) could be done
and two mulpd by the real8 reciprical of rhs could be used instead of one divps.

raymond

If you are interested in using the FPU instructions in your apps and want to learn more about the subject, you may want to look at the following tutorial.

http://www.ray.masmcode.com/fpu.html

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

adam23

Quote from: Ultrano on December 06, 2006, 11:18:54 PM
Here's a workspace+project, where the .asm file lives happily next to a .cpp file :).


Thanks for taking the time to load that, but I don't have Visual C++ 6.0.  I've been using VS 2003 Prof, VS 2005 Standard, or Visual C++ Express Edition.  When I convert it to any of these formats I get this error.

error PRJ0019: A tool returned an error code from "Performing Custom Build Step"

Do you know what I need to change to get that to work?

Thanks again for all the help I really appreciate it.  I feel like I've learned a ton since yesterday morning.
Adam

adam23

Quote from: raymond on December 07, 2006, 04:44:30 AM
If you are interested in using the FPU instructions in your apps and want to learn more about the subject, you may want to look at the following tutorial.

http://www.ray.masmcode.com/fpu.html

Raymond


That is a great site, thanks for pointing that out to me.

adam23

Quote from: dsouza123 on December 06, 2006, 11:21:29 PM
If your Vector3 really held four dwords x,y,z and an extra then you could use SSE2 code.
If x,y,z,extra had 10,101,202,303 and rhs had 5.


Removed to save room


If the four dword vectors are 16 byte aligned the faster movdqa can be used in place of movdqu.

If real4 doesn't provide enough precision then a version using real8 (double precision) could be done
and two mulpd by the real8 reciprical of rhs could be used instead of one divps.


Wow, some of those commands look foreign to me, I must have not gotten to those yet.  I am going to have to look into SSE2 code though because it looks really interesting.  I've basically got a decent understanding of all the commands on this pdf
http://www.jegerlehner.ch/intel/IntelCodeTable.pdf
Thank for posting the code, because that will give me a starting point to learning it.

u

Quote from: adam23 on December 07, 2006, 01:42:28 PM
When I convert it to any of these formats I get this error.
error PRJ0019: A tool returned an error code from "Performing Custom Build Step"
The problem is that  VS cannot find "\masm32\bin\ml.exe" . If you have masm32 installed, this error happens if your project isn't in the same HDD partition as the masm32 package. Example: I have masm32 in D:\masm32 , if I put this project in "F:\project2", I'd get the same error.
The solution is to either move the project's folder to the drive where masm32 is, or modify the custom-build command from "\masm32\bin\ml ...."  to "D:\masm32\bin\ml ...."  (if masm32 is in D:\ )

I wonder how much faster the SSE implementation is :)
Please use a smaller graphic in your signature.

adam23

Thanks that worked perfectly.  I ended up having to search for ml.exe, I found it buried in my Program Files/Visual Studio .Net 2003/Vc7/bin folder.  That really helps a lot, and I am going to try the SSE2 and see what kind of results I get.