News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

create simd mask

Started by NightWare, June 21, 2008, 12:04:42 AM

Previous topic - Next topic

NightWare

to avoid mem access i generate masks with simd instructions,

; create mask _,_,_,0
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqss XMM0,XMM0 ;; XMM1 = 0,0,0,0FFFFFFFFh

; create mask _,_,1,0 (sse)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqps XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
movlhps XMM0,XMM1 ;; XMM0 = 0,0,0FFFFFFFFh,0FFFFFFFFh

; create mask _,_,1,0 (sse2)
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqps XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
movq XMM0,XMM0 ;; XMM0 = 0,0,0FFFFFFFFh,0FFFFFFFFh

; create mask _,2,_,0 (sse)
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqss XMM0,XMM0 ;; XMM0 = 0,0,0,0FFFFFFFFh
movlhps XMM0,XMM0 ;; XMM0 = 0,0FFFFFFFFh,0,0FFFFFFFFh

; create mask _,2,_,0 (sse2)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqps XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
punpckldq XMM0,XMM1 ;; XMM0 = 0,0FFFFFFFFh,0,0FFFFFFFFh

; create mask 3,_,1,_ (sse)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
cmpeqps XMM1,XMM1 ;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
unpcklps XMM0,XMM1 ;; XMM0 = 0FFFFFFFFh,0,0FFFFFFFFh,0

; create mask 3,_,1,_ (sse2)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
cmpeqps XMM1,XMM1 ;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
punpckldq XMM0,XMM1 ;; XMM0 = 0FFFFFFFFh,0,0FFFFFFFFh,0

; create mask 3,2,_,_ (sse)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
cmpeqps XMM1,XMM1 ;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
movlhps XMM0,XMM1 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0,0

; create mask 3,2,_,_ (sse2)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqps XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
movsd XMM0,XMM1 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0,0

; create mask 3,2,1,_ (sse)
xorps XMM1,XMM1 ;; XMM1 = 0,0,0,0
cmpeqps XMM1,XMM1 ;; XMM1 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqss XMM0,XMM0 ;; XMM1 = 0,0,0,0FFFFFFFFh
xorps XMM0,XMM1 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0

; create any mask _,_,_,0 (sse2)
xorps XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqss XMM0,XMM0 ;; XMM0 = 0,0,0,0FFFFFFFFh
pshufd XMM0,XMM0,0h ;; XMM0 = _,_,_,_ the values depends of the immediat value...

for the previous code, the first instruction xorps may be avoided if an operation on the register has been made previously...

but, i've encoutered problems with supposed faster code (cmpeqps/ss don't react well). for example, this code doesn't work :
cmpeqps XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
cmpneqss XMM0,XMM0 ;; XMM0 = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0






johnsa

I've found that accessing memory is usually faster on my machine, might be different on a P4+

one movaps or movdqa seems to be quicker than a combination of 2/3 sse instructions.

have you checked the timings/perf of the 2 options? Id be curious to know how it runs on a "faster" box.

On another subject.. I was thinking about the W coordinate thing, in theory your vector functions can ignore preserving the W.
Reason being that the only vector operation you would apply to a Vertex would be subtract (i think) which would correctly perform w-w = 0 (as Vertices have a 1 in their W).
This is correct a produces a true directional vector.
Normalize, Magnitude, cross, dot and all the other funcs would be applied to a direction vector produced by subtracting to vertices.
I think...

johnsa

have u tried using cmpss xmm,xmm,imm8 instead? or double checking that cmpeqps/cmpneqss are assembling correctly? can't see any other reason why it wouldn't work.

NightWare

#3
Quote from: johnsa on June 21, 2008, 04:00:08 PM
I've found that accessing memory is usually faster on my machine, might be different on a P4+
one movaps or movdqa seems to be quicker than a combination of 2/3 sse instructions.
have you checked the timings/perf of the 2 options? Id be curious to know how it runs on a "faster" box.
hi, the speed seems the same on my core2 (anyway theorically there is 3 cycles to read mem from l1 cache...), but the problem is somewhere else, the address list in code cache is quite short, so avoiding mem read and jump avoid a constant update of this list (so it speedup things a bit...)

Quote from: johnsa on June 21, 2008, 04:00:08 PMOn another subject.. I was thinking about the W coordinate thing, in theory your vector functions can ignore preserving the W.
Reason being that the only vector operation you would apply to a Vertex would be subtract (i think) which would correctly perform w-w = 0 (as Vertices have a 1 in their W).
yep, that's why i've said "it sucks", coz it slowdown the code just for vertex (and i don't see the usage for them...)

Quote from: johnsa on June 21, 2008, 04:05:11 PM
have u tried using cmpss xmm,xmm,imm8 instead? or double checking that cmpeqps/cmpneqss are assembling correctly? can't see any other reason why it wouldn't work.
cmpss xmm,xmm,imm8 is the real form of cmpCCss xmm,xmm (this syntaxe is just supported by masm). for the reason, it seems cmpCCps/cmpCCss (or the original form) don't fix the changes they made on xmmx registers... ironically for an instruction like that... no ?  :lol

in the same spirit of my previous post, here it generate an identity matrix :
; create an identity matrix
pxor XMM0,XMM0 ;; XMM0 = 0,0,0,0
cmpeqss XMM0,XMM0 ;; XMM0 = 0,0,0,0FFFFFFFFh
pslld XMM0,25 ;; XMM0 = 0,0,0,0FE000000h
psrld XMM0,2 ;; XMM0 = 0,0,0,03F800000h (1.0f)
pshufd XMM1,XMM0,051h ;; XMM1 = 0,0,03F800000h,0 (1.0f)
pshufd XMM2,XMM0,045h ;; XMM2 = 0,03F800000h,0,0 (1.0f)
pshufd XMM3,XMM0,015h ;; XMM3 = 03F800000h,0,0,0 (1.0f)


EDIT :
now i know why it doesn't work, "SNaN operands always generate an Invalid Operation Exception (IE)."  :red

EDIT 2 :
faster sse and sse2 masks, in macros form :
; ¤¤¤¤¤¤¤
; ¤ SSE ¤
; ¤¤¤¤¤¤¤

;
CreateMask_xxx0_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
ENDM


;
CreateMask_xx1x_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,051h ;; _XMMx_ = 0,0,0FFFFFFFFh,0
ENDM


;
CreateMask_xx10_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqps _XMMx_,_XMMx_ ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
shufps _XMMx_,_XMMx_,050h ;; _XMMx_ = 0,0,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_x2xx_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,045h ;; _XMMx_ = 0,0FFFFFFFFh,0,0
ENDM


;
CreateMask_x2x0_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,044h ;; _XMMx_ = 0,0FFFFFFFFh,0,0FFFFFFFFh
ENDM


;
CreateMask_x21x_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,041h ;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0
ENDM


;
CreateMask_x210_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,040h ;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_3xxx_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,015h ;; _XMMx_ = 0FFFFFFFFh,0,0,0
ENDM


;
CreateMask_3xx0_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,014h ;; _XMMx_ = 0FFFFFFFFh,0,0,0FFFFFFFFh
ENDM


;
CreateMask_3x1x_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,011h ;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0
ENDM


;
CreateMask_3x10_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,010h ;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_32xx_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,005h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0
ENDM


;
CreateMask_32x0_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,004h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0FFFFFFFFh
ENDM


;
CreateMask_321x_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,001h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
ENDM


;
CreateMask_3210_Sse MACRO _XMMx_:REQ
xorps _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
shufps _XMMx_,_XMMx_,000h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
ENDM


; ¤¤¤¤¤¤¤¤
; ¤ SSE2 ¤
; ¤¤¤¤¤¤¤¤

;
CreateMask_xxx0_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
ENDM


;
CreateMask_xx1x_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,051h ;; _XMMx_ = 0,0,0FFFFFFFFh,0
ENDM


;
CreateMask_xx10_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqsd _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_x2xx_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,045h ;; _XMMx_ = 0,0FFFFFFFFh,0,0
ENDM


;
CreateMask_x2x0_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,044h ;; _XMMx_ = 0,0FFFFFFFFh,0,0FFFFFFFFh
ENDM


;
CreateMask_x21x_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,041h ;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0
ENDM


;
CreateMask_x210_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,040h ;; _XMMx_ = 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_3xxx_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,015h ;; _XMMx_ = 0FFFFFFFFh,0,0,0
ENDM


;
CreateMask_3xx0_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,014h ;; _XMMx_ = 0FFFFFFFFh,0,0,0FFFFFFFFh
ENDM


;
CreateMask_3x1x_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,011h ;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0
ENDM


;
CreateMask_3x10_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,010h ;; _XMMx_ = 0FFFFFFFFh,0,0FFFFFFFFh,0FFFFFFFFh
ENDM


;
CreateMask_32xx_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,005h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0
ENDM


;
CreateMask_32x0_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,004h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0,0FFFFFFFFh
ENDM


;
CreateMask_321x_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqss _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0FFFFFFFFh
pshufd _XMMx_,_XMMx_,001h ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
ENDM


;
CreateMask_3210_Sse2 MACRO _XMMx_:REQ
pxor _XMMx_,_XMMx_ ;; _XMMx_ = 0,0,0,0
cmpeqps _XMMx_,_XMMx_ ;; _XMMx_ = 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
ENDM


c0d1f1ed

Quote from: NightWare on June 21, 2008, 11:00:28 PM
hi, the speed seems the same on my core2 (anyway theorically there is 3 cycles to read mem from l1 cache...)

It shouldn't matter. The load will be started earlier so the data is available by the time the dependent instruction executes. By computing the masks arithmetically you clog up instruction ports. So unless the load unit is a bottleneck (almost impossible on a Core 2 since it can access up to 128-bit each cycle), I'd suggest loading masks from memory. You'll also have extra registers available for more important things like reducing dependencies in the critical path.

NightWare

Quote from: c0d1f1ed on June 23, 2008, 07:05:17 AM
By computing the masks arithmetically you clog up instruction ports.
yep, but there is several compensations, with l1 cache/address list, l2 cache less used, ...  :wink

Mark_Larson

I try and pre-load registers with masks in the main() routine.  If I don't have enough registers, then I just read it from memory like was previously stated.  Doing it the way you are doing it will slow things down.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

NightWare

"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles
" (taken from agner fogs optimizing assembly)

so, unless your mask is ALREADY (and even in this case, the mask has to pass all the process, also when pre-loaded) in the l1 cache (8kb), it will not slowdown things... or you have to explain me how...

beside, there is 128bits per mask free in the cache, usable by something else.

i've made thoses changes in my gfx-engine and haven't noted a slowdown, in the contrary...
like in points coords calc (a quite critical one) and on others (less used) algos, but i must admit :
1. they are not very numberous (2 for points coords, multiplied by the number of points/4, coz simd...)
2. it can also be due to the removed jumps (where i had to use masks...)

Mark_Larson

Quote from: NightWare on June 25, 2008, 11:52:04 PM
"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles
" (taken from agner fogs optimizing assembly)

AMD takes 3 cycles and Intel since P4 take 2 cycles.  I am not sure what it is on a P3.

the problem is you might cause register contention if you use it a lot.  does that make sense?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

c0d1f1ed

Quote from: NightWare on June 25, 2008, 11:52:04 PM
"Reading from the level-1 cache takes approximately 3 clock cycles. Reading from the level-2 cache takes in the order of magnitude of 10 clock cycles. Reading from main memory
takes in the order of magnitude of 100 clock cycles
" (taken from agner fogs optimizing assembly)

so, unless your mask is ALREADY (and even in this case, the mask has to pass all the process, also when pre-loaded) in the l1 cache (8kb), it will not slowdown things... or you have to explain me how...

It will be in L1 cache 99% of the time or more. You'll typically use these masks in loops that iterate thousands of times, so apart from the initial cache miss all other accesses should practically be a hit. Besides, even if it has to go to L2 cache that's not a disaster. 10 clock cycles can easily be bridged by out-of-order execution. Remember that the CPU can start the load operation as soon as the instruction is decoded and the load port is free. Unless you're totally load port limited (practically only the case if you're doing a straight memmove), it's going to execute the load well in time before the logic operation that uses the mask.

Quotebeside, there is 128bits per mask free in the cache, usable by something else.

That's peanuts. Remember that on modern CPUs the L1 cache acts more like an extension of the register set than anything else, and the L2 cache holds the actual working set. So use precomputed constants whenever you need them, just don't go overboard by using lookup tables that don't fit in L1 cache.

Quote2. it can also be due to the removed jumps (where i had to use masks...)

Jumps can have a very bad effect on performance. Even if it is well predictable, it uses a jump history slot and can cause worse predictions for other jumps. Could you benchmark it when the masks are loaded from memory?

NightWare

#10
Quote from: c0d1f1ed on June 27, 2008, 09:32:08 PM
Could you benchmark it when the masks are loaded from memory?
i've tried to evaluate the two possibilities, but the 3d test scene i use hasn't enough points to see the difference, so i've generate a loop to multiply the number of points by 500 (coz i wanted to see the results in real use, and i have 2fps here...) and noted no difference (approximatively 350 ticks, and in both case there is a variation of 30 ticks). so, if there is a difference (whateve the direction) you can't see it in a 3d engine

EDIT :
since i can't see it in real use, even if i increase the number of loop, i've made a speedtest with 1000000000 iterations, and only with the code i use (generate a signs mask), results :

3401

1065353216106535321610653532161065353216
3338

1065353216106535321610653532161065353216
3370

1065353216106535321610653532161065353216


Press ENTER to quit...

and the order changes each time (due to GetTickCount imprecision, and probably os interaction), they're all first/second/third, but always near. so where is the slowdown ?


[attachment deleted by admin]