Cycles after cycles

NightWare · July 06, 2009, 09:48:55 PM

Caught up in circles confusion
Is nothing new
Flashback, warm nights
Almost left behind
Suitcases of memories,
Cycles after
Some cycles you picture me
I'm walking too far ahead :eek OOOPS ! i forget myself... (my dark singer side, probably... :wink)

ok, here i'm going to show how to optimize simd code (not entirely, but enough to make you understand how to procced, in general...). someone have asked a simd tutorial recently, IMHO it's useless, some examples will probably give better results... someone else was also a bit confused with masking system, i think... this example is excellent... for this mini-tut, i'm going to use a part of a code i've posted here : http://www.masm32.com/board/index.php?topic=9019.0

Code Select

;
; and transform to 2D coords
;
		xorps XMM6,XMM6								;; XMM6 = 0,0,0,0
		movss XMM7,DWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1	;; XMM7 = 0,0,0,depth
; for the 1st point
		movhlps XMM4,XMM0								;; XMM4 = _,_,0,Z
		movss XMM5,XMM7								;; XMM5 = 0,0,0,depth
		addss XMM5,XMM4								;; XMM5 = _,_,0,Z+depth
		comiss XMM5,XMM6								;; compare Z+depth to 0
		jnc AvoidInvert09								;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert09
		xorps XMM0,OWORD PTR Simd_Mask_Signs_x_x_1_0			;; otherwise invert signs of X and Y
AvoidInvert09:
		rcpss XMM5,XMM5								;; XMM5 = 0,0,0,1/Z+depth
		unpcklps XMM5,XMM5								;; XMM5 = _,_,1/Z+depth,1/Z+depth
		addps XMM0,XMM0								;; XMM0 = _,_,Y*2,X*2
		mulps XMM0,XMM5								;; XMM0 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
		movlhps XMM0,XMM4								;; XMM0 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 2nd point
		movhlps XMM4,XMM1								;; XMM4 = _,_,0,Z
		movss XMM5,XMM7								;; XMM5 = 0,0,0,depth
		addss XMM5,XMM4								;; XMM5 = _,_,0,Z+depth
		comiss XMM5,XMM6								;; compare Z+depth to 0
		jnc AvoidInvert10								;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert10
		xorps XMM1,OWORD PTR Simd_Mask_Signs_x_x_1_0			;; otherwise invert signs of X and Y
AvoidInvert10:
		rcpss XMM5,XMM5								;; XMM5 = 0,0,0,1/Z+depth
		unpcklps XMM5,XMM5								;; XMM5 = _,_,1/Z+depth,1/Z+depth
		addps XMM1,XMM1								;; XMM1 = _,_,Y*2,X*2
		mulps XMM1,XMM5								;; XMM1 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
		movlhps XMM1,XMM4								;; XMM1 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 3rd point
		movhlps XMM4,XMM2								;; XMM4 = _,_,0,Z
		movss XMM5,XMM7								;; XMM5 = 0,0,0,depth
		addss XMM5,XMM4								;; XMM5 = _,_,0,Z+depth
		comiss XMM5,XMM6								;; compare Z+depth to 0
		jnc AvoidInvert11								;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert11
		xorps XMM2,OWORD PTR Simd_Mask_Signs_x_x_1_0			;; otherwise invert signs of X and Y
AvoidInvert11:
		rcpss XMM5,XMM5								;; XMM5 = 0,0,0,1/Z+depth
		unpcklps XMM5,XMM5								;; XMM5 = _,_,1/Z+depth,1/Z+depth
		addps XMM2,XMM2								;; XMM2 = _,_,Y*2,X*2
		mulps XMM2,XMM5								;; XMM2 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
		movlhps XMM2,XMM4								;; XMM2 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 4th point
		movhlps XMM4,XMM3								;; XMM4 = _,_,0,Z
		movss XMM5,XMM7								;; XMM5 = 0,0,0,depth
		addss XMM5,XMM4								;; XMM5 = _,_,0,Z+depth
		comiss XMM5,XMM6								;; compare Z+depth to 0
		jnc AvoidInvert12								;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert12
		xorps XMM3,OWORD PTR Simd_Mask_Signs_x_x_1_0			;; otherwise invert signs of X and Y
AvoidInvert12:
		rcpss XMM5,XMM5								;; XMM5 = 0,0,0,1/Z+depth
		unpcklps XMM5,XMM5								;; XMM5 = _,_,1/Z+depth,1/Z+depth
		addps XMM3,XMM3								;; XMM3 = _,_,Y*2,X*2
		mulps XMM3,XMM5								;; XMM3 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
		movlhps XMM3,XMM4								;; XMM3 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)

Result : 43 cycles

this part transform 3D coords (x3D,y3D,z3D) to 2D coords (x2D,y2D,and z3D preserved). i will show the operations made by the intructions in the comments (movments and operations is the essence of simd).
note : unless it's specified in the comment, all values are real4 (dwords), so even if you see me using integer instructions (starting with a P), it's used on real4 values !

Step 1 :
--------

when i see the previous algo, there is an evidence, it sucks. why ? there is 4 conditionnal jump (not all the time, so generator of mispredictions, so + 4*50 cycles for the first use)
it also read memory (2 different data, so + 2*100 cycles). final 400 cycles on the first read... to much for something that must be fast. we can also see that the simd registers are not always fully used (all the "_"). hmm... very inefficient code... the author is clearly an incompetent... we're going to improve this code a bit :

Code Select

;
; and transform to 2D coords
;
		xorps XMM6,XMM6								;; XMM6 = 0,0,0,0
		movq XMM7,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1	;; XMM7 = 0,0,depth,depth
; for the 1st and 2nd points
		movdqa XMM4,XMM1								;; XMM4 = 0,Z2,Y2,X2 (sse2)
		movhlps XMM4,XMM0								;; XMM4 = 0,Z2,0,Z1
		movlhps XMM0,XMM1								;; XMM0 = Y2,X2,Y1,X1
		shufps XMM4,XMM4,0F8h							;; XMM4 = 0,0,Z2,Z1
		movq XMM5,XMM7									;; XMM5 = 0,0,depth,depth
		addps XMM5,XMM4								;; XMM5 = 0,0,Z2+depth,Z1+depth
		unpcklps XMM5,XMM5								;; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
		movdqa XMM1,XMM5								;; XMM1 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
		cmpltps XMM1,XMM6								;; compare Z2+depth,Z2+depth,Z1+depth,Z1+depth if lower than 0,0,0,0
		pslld XMM1,31									;; keep the mask of the signed values in XMM1 (sse2)
		rcpps XMM5,XMM5								;; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
		addps XMM0,XMM0								;; XMM0 = Y2*2,X2*2,Y1*2,X1*2
		xorps XMM0,XMM1								;; invert the signs of X and Y, depending of the mask in XMM1
		mulps XMM0,XMM5								;; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		movhlps XMM1,XMM0								;; XMM1 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
		shufps XMM0,XMM4,0C4h							;; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		shufps XMM1,XMM4,0D4h							;; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
; for the 3rd and 4th points
		movdqa XMM4,XMM3								;; XMM4 = 0,Z4,Y4,X4 (sse2)
		movhlps XMM4,XMM2								;; XMM4 = 0,Z4,0,Z3
		movlhps XMM2,XMM3								;; XMM2 = Y4,X4,Y3,X3
		shufps XMM4,XMM4,0F8h							;; XMM4 = 0,0,Z4,Z3
		movq XMM5,XMM7									;; XMM5 = 0,0,depth,depth
		addps XMM5,XMM4								;; XMM5 = 0,0,Z4+depth,Z3+depth
		unpcklps XMM5,XMM5								;; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
		movdqa XMM3,XMM5								;; XMM3 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse2)
		cmpltps XMM3,XMM6								;; compare Z4+depth,Z4+depth,Z3+depth,Z3+depth if lower than 0,0,0,0
		pslld XMM3,31									;; keep the mask of the signed values in XMM3 (sse2)
		rcpps XMM5,XMM5								;; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
		addps XMM2,XMM2								;; XMM2 = Y4*2,X4*2,Y3*2,X3*2
		xorps XMM2,XMM3								;; invert the signs of X and Y, depending of the mask in XMM3
		mulps XMM2,XMM5								;; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		movhlps XMM3,XMM2								;; XMM3 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
		shufps XMM2,XMM4,0C4h							;; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		shufps XMM3,XMM4,0D4h							;; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth)

Result : 29 cycles

hmm, since speed is essential here, i've decided to use Sse2 instructions, when possible... i've also grouped the operations by couple, to avoid a useless work by the cpu.

here to avoid the conditionnal jumps we use a mask system, the principle is quite simple, we're going to create a mask that will validate or invalidate an operation, depending of the condition. example, we're going to add FFFFh to eax, IF eax is signed :

   mov ecx,00000FFFFh   ; here we define the operation to apply
   xor edx,edx   ; here we clean a register (it will be our mask)
   bt eax,31      ; here we test, for example, the sign of eax, the carry flag will be positionned if it's the case
   sbb edx,edx   ; here we create the mask (00000000h if eax is not signed, FFFFFFFFh if eax is signed)
   and ecx,edx   ; here we validate/invalidate ecx (the operation), depending of the condition
   add eax,ecx   ; FFFFh has been added to eax if eax was signed

now, you know how the cmovCC instruction from intel works, it's the same but in hardware (+ all flags support). however it's important to know the principle, coz if cmovCC is usefull, it's also limited to 1 conditionnal MOV. and if in your algo, severals operations have to be made depending of a condition, it can be usefull to re-use the mask you have defined, to validate/invalidate several operations.

step 2 :
--------

now that we have eradicate the branchs, the biggest part has been done. let see if we can optimize it a bit :

Code Select

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
		movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1	;; MM6 = depth,depth (here to only read what's needed)
; for the 1st and 2nd points
		movaps XMM4,XMM0								; XMM4 = 0,Z2,Y2,X2
		punpckhdq XMM4,XMM1								; XMM4 = 0,0,Z2,Z1 (sse2)
		movlhps XMM0,XMM1								; XMM0 = Y2,X2,Y1,X1
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (preserve XMM7) (sse2)
		addps XMM5,XMM4								; XMM5 = 0,0,Z2+depth,Z1+depth
		unpcklps XMM5,XMM5								; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
		movaps XMM1,XMM5								; XMM1 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
		psrad XMM1,31									; ) keep the mask of the signed values in XMM1, faster and preserve XMM6 (sse2)
		pslld XMM1,31									; )
		rcpps XMM5,XMM5								; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
		addps XMM0,XMM0								; XMM0 = Y2*2,X2*2,Y1*2,X1*2
		xorps XMM0,XMM1								; invert the signs of X and Y, depending of the mask in XMM1
		mulps XMM0,XMM5								; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		movhlps XMM1,XMM0								; XMM1 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
		shufps XMM0,XMM4,0C4h							; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		shufps XMM1,XMM4,0D4h							; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
; for the 3rd and 4th points
		movaps XMM4,XMM2								; XMM4 = 0,Z4,Y4,X4
		punpckhdq XMM4,XMM3								; XMM4 = 0,0,Z4,Z3 (sse2)
		movlhps XMM2,XMM3								; XMM2 = Y4,X4,Y3,X3
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (preserve XMM7) (sse2)
		addps XMM5,XMM4								; XMM5 = 0,0,Z4+depth,Z3+depth
		unpcklps XMM5,XMM5								; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
		movaps XMM3,XMM5								; XMM3 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
		psrad XMM3,31									; ) keep the mask of the signed values in XMM3, faster and preserve XMM6 (sse2)
		pslld XMM3,31									; )
		rcpps XMM5,XMM5								; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
		addps XMM2,XMM2								; XMM2 = Y4*2,X4*2,Y3*2,X3*2
		xorps XMM2,XMM3								; invert the signs of X and Y, depending of the mask in XMM3
		mulps XMM2,XMM5								; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		movhlps XMM3,XMM2								; XMM3 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
		shufps XMM2,XMM4,0C4h							; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		shufps XMM3,XMM4,0D4h							; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth)

Result : 22 cycles

step 3 :
--------

hmm, i don't see possible optimization... but... is there something new in simd instructions ? Sse3 ?

Code Select

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
		movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1	;; MM6 = depth,depth (here the zeros for the high part will be automatically generated later by the instruction movq2dq)
; for the 1st and 2nd points
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (sse2)
		movaps XMM4,XMM0								; XMM4 = 0,Z1,Y1,X1
		movlhps XMM0,XMM1								; XMM0 = Y2,X2,Y1,X1
		movhlps XMM1,XMM4								; XMM1 = 0,Z2,0,Z1
		movlhps XMM5,XMM5								; XMM5 = depth,depth,depth,depth
		addps XMM5,XMM1								; XMM5 = _,Z2+depth,_,Z1+depth
;		shufps XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
;		pshufd XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
		movsldup XMM5,XMM5								; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse3)
		movaps XMM4,XMM5								; XMM4 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
		rcpps XMM5,XMM5								; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
		psrad XMM4,31									; ) keep the mask of the signed values in XMM4 (sse2)
		pslld XMM4,31									; )
		addps XMM0,XMM0								; XMM0 = Y2*2,X2*2,Y1*2,X1*2
		mulps XMM0,XMM5								; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		xorps XMM0,XMM4								; invert the signs of X and Y, depending of the mask in XMM4
		movhlps XMM4,XMM0								; XMM4 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
		movlhps XMM0,XMM1								; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		movsd XMM1,XMM4								; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth) (sse2)
; for the 3rd and 4th points
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (sse2)
		movaps XMM4,XMM2								; XMM4 = 0,Z3,Y3,X3
		movlhps XMM2,XMM3								; XMM2 = Y4,X4,Y3,X3
		movhlps XMM3,XMM4								; XMM3 = 0,Z4,0,Z3
		movlhps XMM5,XMM5								; XMM5 = depth,depth,depth,depth
		addps XMM5,XMM3								; XMM5 = _,Z4+depth,_,Z3+depth
;		shufps XMM5,XMM5,0A0h							; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
;		pshufd XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
		movsldup XMM5,XMM5								; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse3)
		movaps XMM4,XMM5								; XMM4 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
		rcpps XMM5,XMM5								; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
		psrad XMM4,31									; ) keep the mask of the signed values in XMM4 (sse2)
		pslld XMM4,31									; )
		addps XMM2,XMM2								; XMM2 = Y4*2,X4*2,Y3*2,X3*2
		mulps XMM2,XMM5								; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		xorps XMM2,XMM4								; invert the signs of X and Y, depending of the mask in XMM4
		movhlps XMM4,XMM2								; XMM4 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
		movlhps XMM2,XMM3								; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		movsd XMM3,XMM4								; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth) (sse2)

Result : 14 cycles

step 4 :
--------

hmm, not bad, but when i see the code, there is somthing that continue to annoy me : the right shift of 31 bit, followed by the left shift of 31 bits... i know it avoid a memory access, but this work can certainly be reduced in some way... let's try :

Code Select

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
		movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1	;; MM6 = depth,depth (here the zeros for the high part will be automatically generated later by the instruction movq2dq)
; for the 1st and 2nd points
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (sse2)
		movaps XMM4,XMM0								; XMM4 = 0,Z1,Y1,X1
		movlhps XMM0,XMM1								; XMM0 = Y2,X2,Y1,X1
		movhlps XMM1,XMM4								; XMM1 = 0,Z2,0,Z1
		movlhps XMM5,XMM5								; XMM5 = depth,depth,depth,depth
		addps XMM5,XMM1								; XMM5 = _,Z2+depth,_,Z1+depth
;		shufps XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
;		pshufd XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
		movsldup XMM5,XMM5								; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse3)
		rcpps XMM5,XMM5								; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
		pcmpeqd XMM4,XMM4								; XMM4 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh (sse2)
		pslld XMM4,31									; XMM4 = 080000000h,080000000h,080000000h,080000000h (sse2)
		andps XMM4,XMM5								; keep the mask of the signed values in XMM4
		addps XMM0,XMM0								; XMM0 = Y2*2,X2*2,Y1*2,X1*2
		mulps XMM0,XMM5								; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		xorps XMM0,XMM4								; invert the signs of X and Y, depending of the mask in XMM4
		movhlps XMM4,XMM0								; XMM4 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
		movlhps XMM0,XMM1								; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
		movsd XMM1,XMM4								; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth) (sse2)
; for the 3rd and 4th points
		movq2dq XMM5,MM6								; XMM5 = 0,0,depth,depth (sse2)
		movaps XMM4,XMM2								; XMM4 = 0,Z3,Y3,X3
		movlhps XMM2,XMM3								; XMM2 = Y4,X4,Y3,X3
		movhlps XMM3,XMM4								; XMM3 = 0,Z4,0,Z3
		movlhps XMM5,XMM5								; XMM5 = depth,depth,depth,depth
		addps XMM5,XMM3								; XMM5 = _,Z4+depth,_,Z3+depth
;		shufps XMM5,XMM5,0A0h							; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
;		pshufd XMM5,XMM5,0A0h							; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
		movsldup XMM5,XMM5								; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse3)
		rcpps XMM5,XMM5								; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
		pcmpeqd XMM4,XMM4								; XMM4 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh (sse2)
		pslld XMM4,31									; XMM4 = 080000000h,080000000h,080000000h,080000000h (sse2)
		andps XMM4,XMM5								; keep the mask of the signed values in XMM4
		addps XMM2,XMM2								; XMM2 = Y4*2,X4*2,Y3*2,X3*2
		mulps XMM2,XMM5								; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		xorps XMM2,XMM4								; invert the signs of X and Y, depending of the mask in XMM4
		movhlps XMM4,XMM2								; XMM4 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
		movlhps XMM2,XMM3								; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
		movsd XMM3,XMM4								; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth) (sse2)

Result : 12 cycles

Farabi · July 08, 2009, 07:09:39 AM

:clap: Amazing.
So this function is to plot a pixel from 3D coordinate to 2D coordinate?
*ahem*, can you make it as a procedure like this?

Code Select


3DTo2D proc x:dword,y:dword,z:dword
 ; Your Code here

 ret
3DTo2D

Where the return value is edx for the x and eax for the y, or maybe, you can fill a structure as a return value. :green I'd like to play with 3D but too hard for me.

Farabi · July 08, 2009, 07:18:57 AM

I saw your 3D core code. Not bad its 42 FPS on my pentium dual core with 46 CPU usage. WHat if you use a texture on it? I'd like to know the speed.

mitchi · July 08, 2009, 07:19:33 AM

You're a machine NightWare :cheekygreen:

NightWare · July 09, 2009, 01:24:01 AM

Quote from: Farabi on July 08, 2009, 07:09:39 AM
So this function is to plot a pixel from 3D coordinate to 2D coordinate?
*ahem*, can you make it as a procedure like this?
Code Select Expand
3DTo2D proc x:dword,y:dword,z:dword ; Your Code here ret 3DTo2D

hi, not exactly, it transform 4*3D coords to 4*2D coords (for X and Y, only them usefull).
this part of code take place in a process (so making a procedure of that isn't usefull, without the other parts of the code), the 3D coords must be in XMM0,XMM1,XMM2 and XMM3 (in the form _,Z,Y,X) just after you have calculated the 3D coords with a global rotation matrix. here the calc convert the 3D coords to 2D representation (you just need to add X,Y half screen positions after to obtain the 2D coords), ready to convert to integer.

convert 3D to 2D coords is an easy calculation generally, here the code doesn't only do that, but a specific work, to avoid to calc tangent (very slow) for the points behind the view, (you must calc some of them, because of their links with the points on the screen). so it's only usefull in THIS context.

now concerning texturing, the 3D core posted is limited to the points calcs. and at this point a choice have to be made : you can do stuffs by yourself (slow coz only use the cpu), or use directx/opengl (fast coz cpu+gpu), the speed depend of your choice. but in both case you need to perfectly know what you are doing.

News:

Cycles after cycles

NightWare

Farabi

Farabi

mitchi

NightWare