Help with this function

adam23 · December 06, 2006, 06:19:38 PM

Okay I have a class called Vector3Asm with three int values x,y,z that I am trying to implement with almost all assembly language.

I have this code running and it produces good results, but it seems to be running about 266 ms slower on 1 million cycles, compared to my C++ code.

Code Select


//====================================================
//operator / (int)
//This will probably be used more than the previous
//one the program will check to make sure the value is not equal
//to zero
//=====================================================
Vector3Asm Vector3Asm::operator /(int rhs)
{
	Vector3Asm temp;
	_asm
	{
		mov ebx, rhs		//eax = rhs
		cmp ebx, 0
		je exit			//Check for divide by zero
		mov edx, 0		//make sure edx is zeroed out

		mov ecx, this		//save pointer to current object
		lea esi, temp		//get pointer to temp object
		mov edx, 0
		mov eax, ([ecx]).x  //save x
		idiv ebx		//x /rhs
		mov ([esi]).x, eax  //save value is temp.x
		mov edx, 0
		mov eax, ([ecx]).y  //move y to eax
		idiv ebx		//y /rhs
		mov ([esi]).y, eax  //save value in temp.y
		mov edx, 0
		mov eax, ([ecx]).z
		idiv ebx
		mov([esi]).z, eax	//save value in temp.z
exit:
	}
	return temp;
}

Here is my C++ code

Code Select


//===========================================
//operator / (int rhs)
//===========================================
Vector3 Vector3::operator /(int rhs)
{
	Vector3 temp;
	if(rhs != 0)
	{
		temp.x = x/rhs;
		temp.y = y/rhs;
		temp.z = z/rhs;
	}
	return temp;
}

I created an object in both functions, when normally in C++ I would just call
return Vector3(rhs.x * x, rhs.y * y, rhs.z * z);

I did it the way I did to try and be as consistent as possible so I would have an easier time beating it in assembly, If that makes sense :eek

So what can I do to improve my assembly version, maybe I am missing something, I think I commented it well enough to follow.

Thanks in advance
Adam

Code Select

u · December 06, 2006, 07:59:49 PM

The C++ code could be easily inlined, without shocking the compiler. In some compilers, inlining asm kills the optimization of the whole function.

Even if the compiler chokes, you can optimize this code a bit, using the fpu:

Code Select


Vec3 Vec3::operator / (int rhs){
	Vec3 temp;
	_asm{
		cmp rhs,0
		je _exit
		fld1
		fidiv rhs
		mov ecx,this
		fild dword ptr [ecx]
		fild dword ptr [ecx+4]
		fild dword ptr [ecx+8]
		fmul ST,ST(3)
		fistp dword ptr temp.z
		fmul ST,ST(2)
		fistp temp.y
		fmul
		fistp temp.x
_exit:
	}
	return temp;
}

The SSE extension was created specifically to compute vectors and matrices, so that's a further optimization pointer.

u · December 06, 2006, 08:57:55 PM

Also, when combining C++ and asm, it's nice to toy around with the asm-listing, to see how the objects work. This way, I got rid of useless calls inside the "operator /" :

Make an extras.asm, and include it in your C++ project. Paste this code in the asm file:

Code Select


.386P
.model FLAT
.code
PUBLIC	??KVector3Asm@@QAE?AV0@H@Z				; Vector3Asm::operator/
;---------------------------------------------[
??KVector3Asm@@QAE?AV0@H@Z: ; Vector3Asm::operator/
cmp dword ptr[esp+8],0
je @F
fld1
fidiv dword ptr [esp+8]
mov eax,[esp+4]
fild dword ptr [ecx]
fild dword ptr [ecx+4]
fild dword ptr [ecx+8]
fmul ST,ST(3)
fistp dword ptr [eax+8]
fmul ST,ST(2)
fistp dword ptr [eax+4]
fmul
fistp dword ptr [eax]
@@:
ret 8
;---------------------------------------------/
end

In the Project Settings of extras.asm, set-up a "Custom build":

Quote
Commands:
/masm32/bin/ml.exe /c /coff /Cp /nologo extras.asm
Outputs:
extras.obj

Comment/remove your C++ code of the "operator /" (but leave the proto in the class! ), and compile. :toothy

adam23 · December 06, 2006, 09:31:06 PM

Wow, I just got done testing the results of the first code you posted and the results are astounding on my P4 3.0GHZ the assembly code ran 1235 ms faster in 50 million trials.

I knew some of the commands you used such as fidiv, fild, and fmul, but I'm not sure what fld1, and fistp are.

I know fld loads a value onto ST(0), but what does the 1 do?

I know that FIST stores an integer, but what does the p mean?

Thanks again for the help, I'm trying to figure out how to add the code externally now. This is probably a dumb question but do how do I include an asm file in C++, do I link it in project settings or do I use #include?

EDIT I just found that FLD1 pushes 1.0 onto the stack
and I think fistp stores an int and pops

Here is how I commented the code

Code Select


	_asm
	{
		cmp rhs,0					//make sure rhs does not equal 0
		je exit						//if it does jump to exit
		fld1						//push 1.0 onto the stack
		fidiv rhs					//divide ST(0) by rhs
		mov ecx,this				//ecx points to this
		fild dword ptr [ecx]		//Load integer onto ST(0)
		fild dword ptr [ecx+4]		//Load integer onto ST(0) pushes others down
		fild dword ptr [ecx+8]		//Load integer onto ST(0) pushes others down
		fmul ST,ST(3)				//ST = z,  ST(3) = 1/rhs	
		fistp dword ptr temp.z		//Store value in temp.z
		fmul ST,ST(2)				//ST = y  ST(2) = 1/rhs
		fistp temp.y				//Store value in temp.y
		fmul						//ST * ST(1) = x * 1/rhs
		fistp temp.x				//Store value and pop
exit:
	}

Adam

u · December 06, 2006, 11:18:54 PM

Here's a workspace+project, where the .asm file lives happily next to a .cpp file :).

[attachment deleted by admin]

dsouza123 · December 06, 2006, 11:21:29 PM

If your Vector3 really held four dwords x,y,z and an extra then you could use SSE2 code.
If x,y,z,extra had 10,101,202,303 and rhs had 5.

Code Select


cmp       rhs,   0
je  exit
movd      xmm2,  rhs     ; xmm2 dword 0  gets  rhs  0,0,0,5 <- 5
mov       ebx,   this    ; ebx gets address held in this

pshufd    xmm2,  xmm2, 0 ; xmm2 dwords 3,2,1,0 all get xmm2 dword 0               5,     5,     5,    5  <--    0,    0,    0,   5
movdqu    xmm0,  [ebx]   ; xmm0 dwords 3,2,1,0 get this dwords extra,z,y,x 
cvtdq2ps  xmm3,  xmm2    ; xmm3 real4 3,2,1,0 convert xmm2 dwords 3,2,1,0       5.0,   5.0,   5.0,  5.0  <--    5,    5,    5,   5
cvtdq2ps  xmm1,  xmm0    ; xmm1 real4 3,2,1,0 convert xmm0 dwords 3,2,1,0     303.0, 202.0, 101.0, 10.0  <--  303,  202,  101,  10
lea       esi,   temp    ; esi gets address of temp
divps     xmm1,  xmm3    ; xmm1 real4 3,2,1,0 div xmm3 real4 3,2,1,0           60.6,  40.4,  20.2,  2.0  <-- 303.0/5.0, 202.0/5.0, 101.0/5.0, 10.0/5.0
                          ; xmm1 = xmm1 / xmm3    on a real4 pair basis
cvttps2dq xmm0,  xmm1    ; xmm0 dword 3,2,1,0 trunc cvt xmm1 real4 3,2,1,0       60,    40,    20,    2  <-- 60.6, 40.4, 20.2, 2.0
movdqu    [esi], xmm0    ; four dwords of temp extra,z,y,x get xmm0 3,2,1,0

If the four dword vectors are 16 byte aligned the faster movdqa can be used in place of movdqu.

If real4 doesn't provide enough precision then a version using real8 (double precision) could be done
and two mulpd by the real8 reciprical of rhs could be used instead of one divps.

raymond · December 07, 2006, 04:44:30 AM

If you are interested in using the FPU instructions in your apps and want to learn more about the subject, you may want to look at the following tutorial.

http://www.ray.masmcode.com/fpu.html

Raymond

adam23 · December 07, 2006, 01:42:28 PM

Quote from: Ultrano on December 06, 2006, 11:18:54 PM
Here's a workspace+project, where the .asm file lives happily next to a .cpp file :).

Thanks for taking the time to load that, but I don't have Visual C++ 6.0. I've been using VS 2003 Prof, VS 2005 Standard, or Visual C++ Express Edition. When I convert it to any of these formats I get this error.

error PRJ0019: A tool returned an error code from "Performing Custom Build Step"

Do you know what I need to change to get that to work?

Thanks again for all the help I really appreciate it. I feel like I've learned a ton since yesterday morning.
Adam

adam23 · December 07, 2006, 01:43:38 PM

Quote from: raymond on December 07, 2006, 04:44:30 AM
If you are interested in using the FPU instructions in your apps and want to learn more about the subject, you may want to look at the following tutorial.

http://www.ray.masmcode.com/fpu.html

Raymond

That is a great site, thanks for pointing that out to me.

adam23 · December 07, 2006, 01:48:49 PM

Quote from: dsouza123 on December 06, 2006, 11:21:29 PM
If your Vector3 really held four dwords x,y,z and an extra then you could use SSE2 code.
If x,y,z,extra had 10,101,202,303 and rhs had 5.

Code Select Expand
Removed to save room

If the four dword vectors are 16 byte aligned the faster movdqa can be used in place of movdqu.

If real4 doesn't provide enough precision then a version using real8 (double precision) could be done
and two mulpd by the real8 reciprical of rhs could be used instead of one divps.

Wow, some of those commands look foreign to me, I must have not gotten to those yet. I am going to have to look into SSE2 code though because it looks really interesting. I've basically got a decent understanding of all the commands on this pdf
http://www.jegerlehner.ch/intel/IntelCodeTable.pdf
Thank for posting the code, because that will give me a starting point to learning it.

u · December 08, 2006, 01:27:07 AM

Quote from: adam23 on December 07, 2006, 01:42:28 PM
When I convert it to any of these formats I get this error.
error PRJ0019: A tool returned an error code from "Performing Custom Build Step"

The problem is that VS cannot find "\masm32\bin\ml.exe" . If you have masm32 installed, this error happens if your project isn't in the same HDD partition as the masm32 package. Example: I have masm32 in D:\masm32 , if I put this project in "F:\project2", I'd get the same error.
The solution is to either move the project's folder to the drive where masm32 is, or modify the custom-build command from "\masm32\bin\ml ...." to "D:\masm32\bin\ml ...." (if masm32 is in D:\ )

I wonder how much faster the SSE implementation is :)

adam23 · December 08, 2006, 03:02:27 AM

Thanks that worked perfectly. I ended up having to search for ml.exe, I found it buried in my Program Files/Visual Studio .Net 2003/Vc7/bin folder. That really helps a lot, and I am going to try the SSE2 and see what kind of results I get.

News:

Help with this function

adam23

u

u

adam23

u

dsouza123

raymond

adam23

adam23

adam23

u

adam23