News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Cycles after cycles

Started by NightWare, July 06, 2009, 09:48:55 PM

Previous topic - Next topic

NightWare

Caught up in circles confusion
Is nothing new
Flashback, warm nights
Almost left behind
Suitcases of memories,
Cycles after
Some cycles you picture me
I'm walking too far ahead   :eek OOOPS ! i forget myself... (my dark singer side, probably... :wink)

ok, here i'm going to show how to optimize simd code (not entirely, but enough to make you understand how to procced, in general...). someone have asked a simd tutorial recently, IMHO it's useless, some examples will probably give better results... someone else was also a bit confused with masking system, i think... this example is excellent... for this mini-tut, i'm going to use a part of a code i've posted here : http://www.masm32.com/board/index.php?topic=9019.0

;
; and transform to 2D coords
;
xorps XMM6,XMM6 ;; XMM6 = 0,0,0,0
movss XMM7,DWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1 ;; XMM7 = 0,0,0,depth
; for the 1st point
movhlps XMM4,XMM0 ;; XMM4 = _,_,0,Z
movss XMM5,XMM7 ;; XMM5 = 0,0,0,depth
addss XMM5,XMM4 ;; XMM5 = _,_,0,Z+depth
comiss XMM5,XMM6 ;; compare Z+depth to 0
jnc AvoidInvert09 ;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert09
xorps XMM0,OWORD PTR Simd_Mask_Signs_x_x_1_0 ;; otherwise invert signs of X and Y
AvoidInvert09:
rcpss XMM5,XMM5 ;; XMM5 = 0,0,0,1/Z+depth
unpcklps XMM5,XMM5 ;; XMM5 = _,_,1/Z+depth,1/Z+depth
addps XMM0,XMM0 ;; XMM0 = _,_,Y*2,X*2
mulps XMM0,XMM5 ;; XMM0 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
movlhps XMM0,XMM4 ;; XMM0 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 2nd point
movhlps XMM4,XMM1 ;; XMM4 = _,_,0,Z
movss XMM5,XMM7 ;; XMM5 = 0,0,0,depth
addss XMM5,XMM4 ;; XMM5 = _,_,0,Z+depth
comiss XMM5,XMM6 ;; compare Z+depth to 0
jnc AvoidInvert10 ;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert10
xorps XMM1,OWORD PTR Simd_Mask_Signs_x_x_1_0 ;; otherwise invert signs of X and Y
AvoidInvert10:
rcpss XMM5,XMM5 ;; XMM5 = 0,0,0,1/Z+depth
unpcklps XMM5,XMM5 ;; XMM5 = _,_,1/Z+depth,1/Z+depth
addps XMM1,XMM1 ;; XMM1 = _,_,Y*2,X*2
mulps XMM1,XMM5 ;; XMM1 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
movlhps XMM1,XMM4 ;; XMM1 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 3rd point
movhlps XMM4,XMM2 ;; XMM4 = _,_,0,Z
movss XMM5,XMM7 ;; XMM5 = 0,0,0,depth
addss XMM5,XMM4 ;; XMM5 = _,_,0,Z+depth
comiss XMM5,XMM6 ;; compare Z+depth to 0
jnc AvoidInvert11 ;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert11
xorps XMM2,OWORD PTR Simd_Mask_Signs_x_x_1_0 ;; otherwise invert signs of X and Y
AvoidInvert11:
rcpss XMM5,XMM5 ;; XMM5 = 0,0,0,1/Z+depth
unpcklps XMM5,XMM5 ;; XMM5 = _,_,1/Z+depth,1/Z+depth
addps XMM2,XMM2 ;; XMM2 = _,_,Y*2,X*2
mulps XMM2,XMM5 ;; XMM2 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
movlhps XMM2,XMM4 ;; XMM2 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)
; for the 4th point
movhlps XMM4,XMM3 ;; XMM4 = _,_,0,Z
movss XMM5,XMM7 ;; XMM5 = 0,0,0,depth
addss XMM5,XMM4 ;; XMM5 = _,_,0,Z+depth
comiss XMM5,XMM6 ;; compare Z+depth to 0
jnc AvoidInvert12 ;; if Z+depth is greater or equal  (c=lt and nc=ge with comiss), goto AvoidInvert12
xorps XMM3,OWORD PTR Simd_Mask_Signs_x_x_1_0 ;; otherwise invert signs of X and Y
AvoidInvert12:
rcpss XMM5,XMM5 ;; XMM5 = 0,0,0,1/Z+depth
unpcklps XMM5,XMM5 ;; XMM5 = _,_,1/Z+depth,1/Z+depth
addps XMM3,XMM3 ;; XMM3 = _,_,Y*2,X*2
mulps XMM3,XMM5 ;; XMM3 = _,_,Y*2*1/(Z+depth),X*2*1/(Z+depth)
movlhps XMM3,XMM4 ;; XMM3 = _,Z,Y*2*1/(Z+depth),X*2*1/(Z+depth)

Result : 43 cycles

this part transform 3D coords (x3D,y3D,z3D) to 2D coords (x2D,y2D,and z3D preserved). i will show the operations made by the intructions in the comments (movments and operations is the essence of simd).
note : unless it's specified in the comment, all values are real4 (dwords), so even if you see me using integer instructions (starting with a P), it's used on real4 values !

Step 1 :
--------

when i see the previous algo, there is an evidence, it sucks. why ? there is 4 conditionnal jump (not all the time, so generator of mispredictions, so + 4*50 cycles for the first use)
it also read memory (2 different data, so + 2*100 cycles). final 400 cycles on the first read... to much for something that must be fast. we can also see that the simd registers are not always fully used (all the "_"). hmm... very inefficient code... the author is clearly an incompetent... we're going to improve this code a bit :

;
; and transform to 2D coords
;
xorps XMM6,XMM6 ;; XMM6 = 0,0,0,0
movq XMM7,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1 ;; XMM7 = 0,0,depth,depth
; for the 1st and 2nd points
movdqa XMM4,XMM1 ;; XMM4 = 0,Z2,Y2,X2 (sse2)
movhlps XMM4,XMM0 ;; XMM4 = 0,Z2,0,Z1
movlhps XMM0,XMM1 ;; XMM0 = Y2,X2,Y1,X1
shufps XMM4,XMM4,0F8h ;; XMM4 = 0,0,Z2,Z1
movq XMM5,XMM7 ;; XMM5 = 0,0,depth,depth
addps XMM5,XMM4 ;; XMM5 = 0,0,Z2+depth,Z1+depth
unpcklps XMM5,XMM5 ;; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
movdqa XMM1,XMM5 ;; XMM1 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
cmpltps XMM1,XMM6 ;; compare Z2+depth,Z2+depth,Z1+depth,Z1+depth if lower than 0,0,0,0
pslld XMM1,31 ;; keep the mask of the signed values in XMM1 (sse2)
rcpps XMM5,XMM5 ;; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
addps XMM0,XMM0 ;; XMM0 = Y2*2,X2*2,Y1*2,X1*2
xorps XMM0,XMM1 ;; invert the signs of X and Y, depending of the mask in XMM1
mulps XMM0,XMM5 ;; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
movhlps XMM1,XMM0 ;; XMM1 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
shufps XMM0,XMM4,0C4h ;; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
shufps XMM1,XMM4,0D4h ;; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
; for the 3rd and 4th points
movdqa XMM4,XMM3 ;; XMM4 = 0,Z4,Y4,X4 (sse2)
movhlps XMM4,XMM2 ;; XMM4 = 0,Z4,0,Z3
movlhps XMM2,XMM3 ;; XMM2 = Y4,X4,Y3,X3
shufps XMM4,XMM4,0F8h ;; XMM4 = 0,0,Z4,Z3
movq XMM5,XMM7 ;; XMM5 = 0,0,depth,depth
addps XMM5,XMM4 ;; XMM5 = 0,0,Z4+depth,Z3+depth
unpcklps XMM5,XMM5 ;; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
movdqa XMM3,XMM5 ;; XMM3 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse2)
cmpltps XMM3,XMM6 ;; compare Z4+depth,Z4+depth,Z3+depth,Z3+depth if lower than 0,0,0,0
pslld XMM3,31 ;; keep the mask of the signed values in XMM3 (sse2)
rcpps XMM5,XMM5 ;; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
addps XMM2,XMM2 ;; XMM2 = Y4*2,X4*2,Y3*2,X3*2
xorps XMM2,XMM3 ;; invert the signs of X and Y, depending of the mask in XMM3
mulps XMM2,XMM5 ;; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
movhlps XMM3,XMM2 ;; XMM3 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
shufps XMM2,XMM4,0C4h ;; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
shufps XMM3,XMM4,0D4h ;; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth)

Result : 29 cycles

hmm, since speed is essential here, i've decided to use Sse2 instructions, when possible... i've also grouped the operations by couple, to avoid a useless work by the cpu.

here to avoid the conditionnal jumps we use a mask system, the principle is quite simple, we're going to create a mask that will validate or invalidate an operation, depending of the condition. example, we're going to add FFFFh to eax, IF eax is signed :

   mov ecx,00000FFFFh   ; here we define the operation to apply
   xor edx,edx   ; here we clean a register (it will be our mask)
   bt eax,31      ; here we test, for example, the sign of eax, the carry flag will be positionned if it's the case
   sbb edx,edx   ; here we create the mask (00000000h if eax is not signed, FFFFFFFFh if eax is signed)
   and ecx,edx   ; here we validate/invalidate ecx (the operation), depending of the condition
   add eax,ecx   ; FFFFh has been added to eax if eax was signed
   
now, you know how the cmovCC instruction from intel works, it's the same but in hardware (+ all flags support). however it's important to know the principle, coz if cmovCC is usefull, it's also limited to 1 conditionnal MOV. and if in your algo, severals operations have to be made depending of a condition, it can be usefull to re-use the mask you have defined, to validate/invalidate several operations.


step 2 :
--------

now that we have eradicate the branchs, the biggest part has been done. let see if we can optimize it a bit :

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1 ;; MM6 = depth,depth (here to only read what's needed)
; for the 1st and 2nd points
movaps XMM4,XMM0 ; XMM4 = 0,Z2,Y2,X2
punpckhdq XMM4,XMM1 ; XMM4 = 0,0,Z2,Z1 (sse2)
movlhps XMM0,XMM1 ; XMM0 = Y2,X2,Y1,X1
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (preserve XMM7) (sse2)
addps XMM5,XMM4 ; XMM5 = 0,0,Z2+depth,Z1+depth
unpcklps XMM5,XMM5 ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
movaps XMM1,XMM5 ; XMM1 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
psrad XMM1,31 ; ) keep the mask of the signed values in XMM1, faster and preserve XMM6 (sse2)
pslld XMM1,31 ; )
rcpps XMM5,XMM5 ; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
addps XMM0,XMM0 ; XMM0 = Y2*2,X2*2,Y1*2,X1*2
xorps XMM0,XMM1 ; invert the signs of X and Y, depending of the mask in XMM1
mulps XMM0,XMM5 ; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
movhlps XMM1,XMM0 ; XMM1 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
shufps XMM0,XMM4,0C4h ; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
shufps XMM1,XMM4,0D4h ; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
; for the 3rd and 4th points
movaps XMM4,XMM2 ; XMM4 = 0,Z4,Y4,X4
punpckhdq XMM4,XMM3 ; XMM4 = 0,0,Z4,Z3 (sse2)
movlhps XMM2,XMM3 ; XMM2 = Y4,X4,Y3,X3
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (preserve XMM7) (sse2)
addps XMM5,XMM4 ; XMM5 = 0,0,Z4+depth,Z3+depth
unpcklps XMM5,XMM5 ; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
movaps XMM3,XMM5 ; XMM3 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
psrad XMM3,31 ; ) keep the mask of the signed values in XMM3, faster and preserve XMM6 (sse2)
pslld XMM3,31 ; )
rcpps XMM5,XMM5 ; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
addps XMM2,XMM2 ; XMM2 = Y4*2,X4*2,Y3*2,X3*2
xorps XMM2,XMM3 ; invert the signs of X and Y, depending of the mask in XMM3
mulps XMM2,XMM5 ; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
movhlps XMM3,XMM2 ; XMM3 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
shufps XMM2,XMM4,0C4h ; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
shufps XMM3,XMM4,0D4h ; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth)

Result : 22 cycles

step 3 :
--------

hmm, i don't see possible optimization... but... is there something new in simd instructions ? Sse3 ?

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1 ;; MM6 = depth,depth (here the zeros for the high part will be automatically generated later by the instruction movq2dq)
; for the 1st and 2nd points
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (sse2)
movaps XMM4,XMM0 ; XMM4 = 0,Z1,Y1,X1
movlhps XMM0,XMM1 ; XMM0 = Y2,X2,Y1,X1
movhlps XMM1,XMM4 ; XMM1 = 0,Z2,0,Z1
movlhps XMM5,XMM5 ; XMM5 = depth,depth,depth,depth
addps XMM5,XMM1 ; XMM5 = _,Z2+depth,_,Z1+depth
; shufps XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
; pshufd XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
movsldup XMM5,XMM5 ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse3)
movaps XMM4,XMM5 ; XMM4 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
rcpps XMM5,XMM5 ; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
psrad XMM4,31 ; ) keep the mask of the signed values in XMM4 (sse2)
pslld XMM4,31 ; )
addps XMM0,XMM0 ; XMM0 = Y2*2,X2*2,Y1*2,X1*2
mulps XMM0,XMM5 ; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
xorps XMM0,XMM4 ; invert the signs of X and Y, depending of the mask in XMM4
movhlps XMM4,XMM0 ; XMM4 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
movlhps XMM0,XMM1 ; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
movsd XMM1,XMM4 ; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth) (sse2)
; for the 3rd and 4th points
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (sse2)
movaps XMM4,XMM2 ; XMM4 = 0,Z3,Y3,X3
movlhps XMM2,XMM3 ; XMM2 = Y4,X4,Y3,X3
movhlps XMM3,XMM4 ; XMM3 = 0,Z4,0,Z3
movlhps XMM5,XMM5 ; XMM5 = depth,depth,depth,depth
addps XMM5,XMM3 ; XMM5 = _,Z4+depth,_,Z3+depth
; shufps XMM5,XMM5,0A0h ; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
; pshufd XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
movsldup XMM5,XMM5 ; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse3)
movaps XMM4,XMM5 ; XMM4 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
rcpps XMM5,XMM5 ; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
psrad XMM4,31 ; ) keep the mask of the signed values in XMM4 (sse2)
pslld XMM4,31 ; )
addps XMM2,XMM2 ; XMM2 = Y4*2,X4*2,Y3*2,X3*2
mulps XMM2,XMM5 ; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
xorps XMM2,XMM4 ; invert the signs of X and Y, depending of the mask in XMM4
movhlps XMM4,XMM2 ; XMM4 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
movlhps XMM2,XMM3 ; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
movsd XMM3,XMM4 ; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth) (sse2)

Result : 14 cycles

step 4 :
--------

hmm, not bad, but when i see the code, there is somthing that continue to annoy me : the right shift of 31 bit, followed by the left shift of 31 bits... i know it avoid a memory access, but this work can certainly be reduced in some way... let's try :

;
; and transform to 2D coords (now XMM6 and XMM7 must not be modified, for another optimization)
;
movq MM6,QWORD PTR (A_3D_World PTR [esi]).Visual_Limit_Z_Maxi_1 ;; MM6 = depth,depth (here the zeros for the high part will be automatically generated later by the instruction movq2dq)
; for the 1st and 2nd points
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (sse2)
movaps XMM4,XMM0 ; XMM4 = 0,Z1,Y1,X1
movlhps XMM0,XMM1 ; XMM0 = Y2,X2,Y1,X1
movhlps XMM1,XMM4 ; XMM1 = 0,Z2,0,Z1
movlhps XMM5,XMM5 ; XMM5 = depth,depth,depth,depth
addps XMM5,XMM1 ; XMM5 = _,Z2+depth,_,Z1+depth
; shufps XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth
; pshufd XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
movsldup XMM5,XMM5 ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse3)
rcpps XMM5,XMM5 ; XMM5 = 1/Z2+depth,1/Z2+depth,1/Z1+depth,1/Z1+depth
pcmpeqd XMM4,XMM4 ; XMM4 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh (sse2)
pslld XMM4,31 ; XMM4 = 080000000h,080000000h,080000000h,080000000h (sse2)
andps XMM4,XMM5 ; keep the mask of the signed values in XMM4
addps XMM0,XMM0 ; XMM0 = Y2*2,X2*2,Y1*2,X1*2
mulps XMM0,XMM5 ; XMM0 = Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth),Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
xorps XMM0,XMM4 ; invert the signs of X and Y, depending of the mask in XMM4
movhlps XMM4,XMM0 ; XMM4 = _,_,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth)
movlhps XMM0,XMM1 ; XMM0 = _,Z1,Y1*2*1/(Z1+depth),X1*2*1/(Z1+depth)
movsd XMM1,XMM4 ; XMM1 = _,Z2,Y2*2*1/(Z2+depth),X2*2*1/(Z2+depth) (sse2)
; for the 3rd and 4th points
movq2dq XMM5,MM6 ; XMM5 = 0,0,depth,depth (sse2)
movaps XMM4,XMM2 ; XMM4 = 0,Z3,Y3,X3
movlhps XMM2,XMM3 ; XMM2 = Y4,X4,Y3,X3
movhlps XMM3,XMM4 ; XMM3 = 0,Z4,0,Z3
movlhps XMM5,XMM5 ; XMM5 = depth,depth,depth,depth
addps XMM5,XMM3 ; XMM5 = _,Z4+depth,_,Z3+depth
; shufps XMM5,XMM5,0A0h ; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth
; pshufd XMM5,XMM5,0A0h ; XMM5 = Z2+depth,Z2+depth,Z1+depth,Z1+depth (sse2)
movsldup XMM5,XMM5 ; XMM5 = Z4+depth,Z4+depth,Z3+depth,Z3+depth (sse3)
rcpps XMM5,XMM5 ; XMM5 = 1/Z4+depth,1/Z4+depth,1/Z3+depth,1/Z3+depth
pcmpeqd XMM4,XMM4 ; XMM4 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh (sse2)
pslld XMM4,31 ; XMM4 = 080000000h,080000000h,080000000h,080000000h (sse2)
andps XMM4,XMM5 ; keep the mask of the signed values in XMM4
addps XMM2,XMM2 ; XMM2 = Y4*2,X4*2,Y3*2,X3*2
mulps XMM2,XMM5 ; XMM2 = Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth),Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
xorps XMM2,XMM4 ; invert the signs of X and Y, depending of the mask in XMM4
movhlps XMM4,XMM2 ; XMM4 = _,_,Y4*2*1/(Z4+depth),X3*2*1/(Z3+depth)
movlhps XMM2,XMM3 ; XMM2 = _,Z3,Y3*2*1/(Z3+depth),X3*2*1/(Z3+depth)
movsd XMM3,XMM4 ; XMM3 = _,Z4,Y4*2*1/(Z4+depth),X4*2*1/(Z4+depth) (sse2)

Result : 12 cycles

Farabi

 :clap: Amazing.
So this function is to plot a pixel from 3D coordinate to 2D coordinate?
*ahem*, can you make it as a procedure like this?

3DTo2D proc x:dword,y:dword,z:dword
; Your Code here

ret
3DTo2D


Where the return value is edx for the x and eax for the y, or maybe, you can fill a structure as a return value.  :green I'd like to play with 3D but too hard for me.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Farabi

I saw your 3D core code. Not bad its 42 FPS on my pentium dual core with 46 CPU usage. WHat if you use a texture on it? I'd like to know the speed.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

mitchi

You're a machine NightWare  :cheekygreen:

NightWare

Quote from: Farabi on July 08, 2009, 07:09:39 AM
So this function is to plot a pixel from 3D coordinate to 2D coordinate?
*ahem*, can you make it as a procedure like this?

3DTo2D proc x:dword,y:dword,z:dword
; Your Code here

ret
3DTo2D

hi, not exactly, it transform 4*3D coords to 4*2D coords (for X and Y, only them usefull).
this part of code take place in a process (so making a procedure of that isn't usefull, without the other parts of the code), the 3D coords must be in XMM0,XMM1,XMM2 and XMM3 (in the form _,Z,Y,X) just after you have calculated the 3D coords with a global rotation matrix. here the calc convert the 3D coords to 2D representation (you just need to add X,Y half screen positions after to obtain the 2D coords), ready to convert to integer.

convert 3D to 2D coords is an easy calculation generally, here the code doesn't only do that, but a specific work, to avoid to calc tangent (very slow) for the points behind the view, (you must calc some of them, because of their links with the points on the screen). so it's only usefull in THIS context.

now concerning texturing, the 3D core posted is limited to the points calcs. and at this point a choice have to be made : you can do stuffs by yourself (slow coz only use the cpu), or use directx/opengl (fast coz cpu+gpu), the speed depend of your choice. but in both case you need to perfectly know what you are doing.