Hey all,
Been working on various bits and pieces and in the process I needed a fast sse ceil function that works for both positive and negative reals.
So it must follow the rules:
ceil(1.2) = 2.0
ceil(1.0) = 1.0
ceil(-0.3) = 0.0
ceil(-1.2) = -1.0 and so on
sofar this is the best I've come up with. Anyone have any ideas how to improve this while still obeying the above rules?
It assumes the input real single will be in xmm0.
(In this example i've just loaded a constant into xmm0).
; somwhere in data section.....
align 16
SingleOne REAL4 1.0
movss xmm0,FP4(1.0)
movss xmm6,dword ptr SingleOne
movaps xmm3,xmm0
cvttss2si eax,xmm0
cvtsi2ss xmm0,eax
movd eax,xmm3
test eax,80000000h
jnz short @F
test eax,7fffffh
jz short @F
addss xmm0,xmm6
@@:
Slight variation, using cmov's instead of branching... this one gets 30ms for both positive/negative values on my machine (10million iterations).
The previous version with branches got 11ms for negative numbers and 20 for positive (10million iterations) ... so even with the branches it's still faster...
CEIL2 MACRO reg
mov ebx,SingleOne
xor edx,edx
movaps xmm3,reg
cvttss2si eax,reg
cvtsi2ss reg,eax
movd eax,xmm3
test eax,80000000h
cmovnz ebx,edx
test eax,7fffffh
cmovz ebx,edx
movd xmm6,ebx
addss reg,xmm6
@@:
ENDM
John,
I would be inclined to go with the version that branches anyway as I have almost always found them faster and with the coming generation of Intel quads they have improved the performance of jumps so they do not stall the instruction queue like the older PIVs.
johnsa,
This is a little faster on my Pentium D.
.DATA
sngMinusOneHalf REAL4 -0.5
.CODE
movss xmm0, FP4(1.2)
addss xmm0, xmm0
movss xmm1, sngMinusOneHalf
subss xmm1, xmm0
cvtss2si eax, xmm1
sar eax, 1
neg eax
cvtsi2ss xmm0, eax
hi,
not tested (just coded now :P), try this... here no branch (so no need to wait intel improvments :wink) :
movss XMM1,FP4(1.0) ; XMM1 = Val
movss XMM2,dword ptr SingleOne ; XMM2 = 1
cvttss2si eax,XMM1 ; ) XMM3 = tVal
cvtsi2ss XMM3,eax ; )
movaps XMM0,XMM3 ; XMM0 = tVal
subss XMM0,XMM1 ; XMM0 = -t
psrad XMM0,31 ; XMM0 = cc-t
pand XMM0,XMM2 ; XMM0 = cc1
addss XMM0,XMM3 ; XMM0 = tVal+cc1
if it work, you should be able to do the job in parallel with sse2...
As it turns out I need the result in general reg eventually anyway, so Greg.... that is a bloody marvellous piece of code :U and I can even leave off the last convert back to xmm!
I get 8ms (+/-)for 10 million iterations of that vs the (11/20 for branching version).
Nightware, your function seems to work well too, but comes in at 23ms for the 10million iterations. :)
I am on a public browser does the MASM32 package have a ceil function ?
askm,
Well, there is the crt_ceil C run-time function. It works great, but it's pretty slow when compared to the above methods.
SSE2 version of the code I posted above
Ceil_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
.DATA
dblMinusOneHalf REAL8 -0.5
.CODE
mov eax, pIn
movsd xmm0, [eax]
addsd xmm0, xmm0
movsd xmm1, dblMinusOneHalf
subsd xmm1, xmm0
cvtsd2si eax, xmm1
sar eax, 1
neg eax
cvtsi2sd xmm0, eax
mov eax, pOut
movsd [eax], xmm0
ret
Ceil_SSE2 ENDP
[edit]
If you are using ml.exe 6.15 you only need these two macros for SSE2
MOVSD_ MACRO A, B
DB 0F2H
MOVUPS A, B
ENDM
CMPSD_ MACRO A, B, C
DB 0F2H
CMPPS A, B, C
ENDM
If you are using ml.exe 6.14 then you can use the macos here (http://www.masm32.com/board/index.php?topic=973.msg7023#msg7023).