create simd mask

NightWare · June 21, 2008, 12:04:42 AM

to avoid mem access i generate masks with simd instructions,


; create mask _,_,_,0
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqss XMM0,XMM0								;; XMM1 = 0,0,0,0FFFFFFFFh

; create mask _,_,1,0 (sse)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqps XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		movlhps XMM0,XMM1								;; XMM0 = 0,0,0FFFFFFFFh,0FFFFFFFFh

; create mask _,_,1,0 (sse2)
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqps XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		movq XMM0,XMM0									;; XMM0 = 0,0,0FFFFFFFFh,0FFFFFFFFh

; create mask _,2,_,0 (sse)
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqss XMM0,XMM0								;; XMM0 = 0,0,0,0FFFFFFFFh
		movlhps XMM0,XMM0								;; XMM0 = 0,0FFFFFFFFh,0,0FFFFFFFFh

; create mask _,2,_,0 (sse2)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqps XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		punpckldq XMM0,XMM1								;; XMM0 = 0,0FFFFFFFFh,0,0FFFFFFFFh

; create mask 3,_,1,_ (sse)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		cmpeqps XMM1,XMM1								;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		unpcklps XMM0,XMM1								;; XMM0 = 0FFFFFFFFh,0,0FFFFFFFFh,0

; create mask 3,_,1,_ (sse2)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		cmpeqps XMM1,XMM1								;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		punpckldq XMM0,XMM1								;; XMM0 = 0FFFFFFFFh,0,0FFFFFFFFh,0

; create mask 3,2,_,_ (sse)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		cmpeqps XMM1,XMM1								;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		movlhps XMM0,XMM1								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0,0

; create mask 3,2,_,_ (sse2)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqps XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		movsd XMM0,XMM1								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0,0

; create mask 3,2,1,_ (sse)
		xorps XMM1,XMM1								;; XMM1 = 0,0,0,0
		cmpeqps XMM1,XMM1								;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqss XMM0,XMM0								;; XMM1 = 0,0,0,0FFFFFFFFh
		xorps XMM0,XMM1								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0

; create any mask _,_,_,0 (sse2)
		xorps XMM0,XMM0								;; XMM0 = 0,0,0,0
		cmpeqss XMM0,XMM0								;; XMM0 = 0,0,0,0FFFFFFFFh
		pshufd XMM0,XMM0,0h							;; XMM0 = _,_,_,_ the values depends of the immediat value...

for the previous code, the first instruction xorps may be avoided if an operation on the register has been made previously...

but, i've encoutered problems with supposed faster code (cmpeqps/ss don't react well). for example, this code doesn't work :

Code Select

		cmpeqps XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		cmpneqss XMM0,XMM0								;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0

johnsa · June 21, 2008, 04:00:08 PM

I've found that accessing memory is usually faster on my machine, might be different on a P4+

one movaps or movdqa seems to be quicker than a combination of 2/3 sse instructions.

have you checked the timings/perf of the 2 options? Id be curious to know how it runs on a "faster" box.

On another subject.. I was thinking about the W coordinate thing, in theory your vector functions can ignore preserving the W.
Reason being that the only vector operation you would apply to a Vertex would be subtract (i think) which would correctly perform w-w = 0 (as Vertices have a 1 in their W).
This is correct a produces a true directional vector.
Normalize, Magnitude, cross, dot and all the other funcs would be applied to a direction vector produced by subtracting to vertices.
I think...

johnsa · June 21, 2008, 04:05:11 PM

have u tried using cmpss xmm,xmm,imm8 instead? or double checking that cmpeqps/cmpneqss are assembling correctly? can't see any other reason why it wouldn't work.

NightWare · June 21, 2008, 11:00:28 PM

Quote from: johnsa on June 21, 2008, 04:00:08 PM
I've found that accessing memory is usually faster on my machine, might be different on a P4+
one movaps or movdqa seems to be quicker than a combination of 2/3 sse instructions.
have you checked the timings/perf of the 2 options? Id be curious to know how it runs on a "faster" box.

hi, the speed seems the same on my core2 (anyway theorically there is 3 cycles to read mem from l1 cache...), but the problem is somewhere else, the address list in code cache is quite short, so avoiding mem read and jump avoid a constant update of this list (so it speedup things a bit...)

Quote from: johnsa on June 21, 2008, 04:00:08 PMOn another subject.. I was thinking about the W coordinate thing, in theory your vector functions can ignore preserving the W.
Reason being that the only vector operation you would apply to a Vertex would be subtract (i think) which would correctly perform w-w = 0 (as Vertices have a 1 in their W).

yep, that's why i've said "it sucks", coz it slowdown the code just for vertex (and i don't see the usage for them...)

Quote from: johnsa on June 21, 2008, 04:05:11 PM
have u tried using cmpss xmm,xmm,imm8 instead? or double checking that cmpeqps/cmpneqss are assembling correctly? can't see any other reason why it wouldn't work.

cmpss xmm,xmm,imm8 is the real form of cmpCCss xmm,xmm (this syntaxe is just supported by masm). for the reason, it seems cmpCCps/cmpCCss (or the original form) don't fix the changes they made on xmmx registers... ironically for an instruction like that... no ? :lol

in the same spirit of my previous post, here it generate an identity matrix :

Code Select

; create an identity matrix
		pxor XMM0,XMM0									;; XMM0 = 0,0,0,0
		cmpeqss XMM0,XMM0								;; XMM0 = 0,0,0,0FFFFFFFFh
		pslld XMM0,25									;; XMM0 = 0,0,0,0FE000000h
		psrld XMM0,2									;; XMM0 = 0,0,0,03F800000h (1.0f)
		pshufd XMM1,XMM0,051h							;; XMM1 = 0,0,03F800000h,0 (1.0f)
		pshufd XMM2,XMM0,045h							;; XMM2 = 0,03F800000h,0,0 (1.0f)
		pshufd XMM3,XMM0,015h							;; XMM3 = 03F800000h,0,0,0 (1.0f)

EDIT :
now i know why it doesn't work, "SNaN operands always generate an Invalid Operation Exception (IE)." :red

EDIT 2 :
faster sse and sse2 masks, in macros form :

Code Select

; ¤¤¤¤¤¤¤
; ¤ SSE ¤
; ¤¤¤¤¤¤¤

;
CreateMask_xxx0_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
	ENDM


;
CreateMask_xx1x_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,051h						;; _XMMx_ = 0,0,0FFFFFFFFh,0
	ENDM


;
CreateMask_xx10_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqps _XMMx_,_XMMx_							;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,050h						;; _XMMx_ = 0,0,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_x2xx_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,045h						;; _XMMx_ = 0,0FFFFFFFFh,0,0
	ENDM


;
CreateMask_x2x0_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,044h						;; _XMMx_ = 0,0FFFFFFFFh,0,0FFFFFFFFh
	ENDM


;
CreateMask_x21x_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,041h						;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0
	ENDM


;
CreateMask_x210_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,040h						;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_3xxx_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,015h						;; _XMMx_ = 0FFFFFFFFh,0,0,0
	ENDM


;
CreateMask_3xx0_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,014h						;; _XMMx_ = 0FFFFFFFFh,0,0,0FFFFFFFFh
	ENDM


;
CreateMask_3x1x_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,011h						;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0
	ENDM


;
CreateMask_3x10_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,010h						;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_32xx_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,005h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0
	ENDM


;
CreateMask_32x0_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,004h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0FFFFFFFFh
	ENDM


;
CreateMask_321x_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,001h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
	ENDM


;
CreateMask_3210_Sse MACRO _XMMx_:REQ
		xorps _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		shufps _XMMx_,_XMMx_,000h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
	ENDM


; ¤¤¤¤¤¤¤¤
; ¤ SSE2 ¤
; ¤¤¤¤¤¤¤¤

;
CreateMask_xxx0_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
	ENDM


;
CreateMask_xx1x_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,051h						;; _XMMx_ = 0,0,0FFFFFFFFh,0
	ENDM


;
CreateMask_xx10_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqsd _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_x2xx_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,045h						;; _XMMx_ = 0,0FFFFFFFFh,0,0
	ENDM


;
CreateMask_x2x0_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,044h						;; _XMMx_ = 0,0FFFFFFFFh,0,0FFFFFFFFh
	ENDM


;
CreateMask_x21x_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,041h						;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0
	ENDM


;
CreateMask_x210_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,040h						;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_3xxx_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,015h						;; _XMMx_ = 0FFFFFFFFh,0,0,0
	ENDM


;
CreateMask_3xx0_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,014h						;; _XMMx_ = 0FFFFFFFFh,0,0,0FFFFFFFFh
	ENDM


;
CreateMask_3x1x_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,011h						;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0
	ENDM


;
CreateMask_3x10_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,010h						;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0FFFFFFFFh
	ENDM


;
CreateMask_32xx_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,005h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0
	ENDM


;
CreateMask_32x0_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,004h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0FFFFFFFFh
	ENDM


;
CreateMask_321x_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqss _XMMx_,_XMMx_							;; _XMMx_ = 0,0,0,0FFFFFFFFh
		pshufd _XMMx_,_XMMx_,001h						;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
	ENDM


;
CreateMask_3210_Sse2 MACRO _XMMx_:REQ
		pxor _XMMx_,_XMMx_								;; _XMMx_ = 0,0,0,0
		cmpeqps _XMMx_,_XMMx_							;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
	ENDM

c0d1f1ed · June 23, 2008, 07:05:17 AM

Quote from: NightWare on June 21, 2008, 11:00:28 PM
hi, the speed seems the same on my core2 (anyway theorically there is 3 cycles to read mem from l1 cache...)

It shouldn't matter. The load will be started earlier so the data is available by the time the dependent instruction executes. By computing the masks arithmetically you clog up instruction ports. So unless the load unit is a bottleneck (almost impossible on a Core 2 since it can access up to 128-bit each cycle), I'd suggest loading masks from memory. You'll also have extra registers available for more important things like reducing dependencies in the critical path.

NightWare · June 23, 2008, 09:06:32 PM

Quote from: c0d1f1ed on June 23, 2008, 07:05:17 AM
By computing the masks arithmetically you clog up instruction ports.

yep, but there is several compensations, with l1 cache/address list, l2 cache less used, ... :wink

Mark_Larson · June 25, 2008, 09:31:27 PM

I try and pre-load registers with masks in the main() routine. If I don't have enough registers, then I just read it from memory like was previously stated. Doing it the way you are doing it will slow things down.

NightWare · June 25, 2008, 11:52:04 PM

"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles" (taken from agner fogs optimizing assembly)

so, unless your mask is ALREADY (and even in this case, the mask has to pass all the process, also when pre-loaded) in the l1 cache (8kb), it will not slowdown things... or you have to explain me how...

beside, there is 128bits per mask free in the cache, usable by something else.

i've made thoses changes in my gfx-engine and haven't noted a slowdown, in the contrary...
like in points coords calc (a quite critical one) and on others (less used) algos, but i must admit :
1. they are not very numberous (2 for points coords, multiplied by the number of points/4, coz simd...)
2. it can also be due to the removed jumps (where i had to use masks...)

Mark_Larson · June 27, 2008, 08:52:15 PM

Quote from: NightWare on June 25, 2008, 11:52:04 PM
"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles" (taken from agner fogs optimizing assembly)

AMD takes 3 cycles and Intel since P4 take 2 cycles. I am not sure what it is on a P3.

the problem is you might cause register contention if you use it a lot. does that make sense?

c0d1f1ed · June 27, 2008, 09:32:08 PM

Quote from: NightWare on June 25, 2008, 11:52:04 PM
"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles" (taken from agner fogs optimizing assembly)

so, unless your mask is ALREADY (and even in this case, the mask has to pass all the process, also when pre-loaded) in the l1 cache (8kb), it will not slowdown things... or you have to explain me how...

It will be in L1 cache 99% of the time or more. You'll typically use these masks in loops that iterate thousands of times, so apart from the initial cache miss all other accesses should practically be a hit. Besides, even if it has to go to L2 cache that's not a disaster. 10 clock cycles can easily be bridged by out-of-order execution. Remember that the CPU can start the load operation as soon as the instruction is decoded and the load port is free. Unless you're totally load port limited (practically only the case if you're doing a straight memmove), it's going to execute the load well in time before the logic operation that uses the mask.

Quotebeside, there is 128bits per mask free in the cache, usable by something else.

That's peanuts. Remember that on modern CPUs the L1 cache acts more like an extension of the register set than anything else, and the L2 cache holds the actual working set. So use precomputed constants whenever you need them, just don't go overboard by using lookup tables that don't fit in L1 cache.

Quote2. it can also be due to the removed jumps (where i had to use masks...)

Jumps can have a very bad effect on performance. Even if it is well predictable, it uses a jump history slot and can cause worse predictions for other jumps. Could you benchmark it when the masks are loaded from memory?

NightWare · June 28, 2008, 01:46:54 AM

Quote from: c0d1f1ed on June 27, 2008, 09:32:08 PM
Could you benchmark it when the masks are loaded from memory?

i've tried to evaluate the two possibilities, but the 3d test scene i use hasn't enough points to see the difference, so i've generate a loop to multiply the number of points by 500 (coz i wanted to see the results in real use, and i have 2fps here...) and noted no difference (approximatively 350 ticks, and in both case there is a variation of 30 ticks). so, if there is a difference (whateve the direction) you can't see it in a 3d engine

EDIT :
since i can't see it in real use, even if i increase the number of loop, i've made a speedtest with 1000000000 iterations, and only with the code i use (generate a signs mask), results :

Code Select


3401

1065353216106535321610653532161065353216
3338

1065353216106535321610653532161065353216
3370

1065353216106535321610653532161065353216


 Press ENTER to quit...

and the order changes each time (due to GetTickCount imprecision, and probably os interaction), they're all first/second/third, but always near. so where is the slowdown ?

[attachment deleted by admin]

News:

create simd mask

c0d1f1ed

c0d1f1ed