The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on July 17, 2008, 10:10:18 PM

Title: Fast SSE Ceil() Function
Post by: johnsa on July 17, 2008, 10:10:18 PM
Hey all,

Been working on various bits and pieces and in the process I needed a fast sse ceil function that works for both positive and negative reals.
So it must follow the rules:
ceil(1.2) = 2.0
ceil(1.0) = 1.0
ceil(-0.3) = 0.0
ceil(-1.2) = -1.0 and so on

sofar this is the best I've come up with. Anyone have any ideas how to improve this while still obeying the above rules?

It assumes the input real single will be in xmm0.
(In this example i've just loaded a constant into xmm0).



; somwhere in data section.....
align 16
SingleOne REAL4 1.0

movss xmm0,FP4(1.0)

movss xmm6,dword ptr SingleOne
movaps xmm3,xmm0
cvttss2si eax,xmm0
cvtsi2ss xmm0,eax
movd eax,xmm3
test eax,80000000h
jnz short @F
test eax,7fffffh
jz short @F
addss xmm0,xmm6
@@:
Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 12:05:22 AM


Slight variation, using cmov's instead of branching... this one gets 30ms for both positive/negative values on my machine (10million iterations).

The previous version with branches got 11ms for negative numbers and 20 for positive (10million iterations) ... so even with the branches it's still faster...



CEIL2 MACRO reg
mov ebx,SingleOne
xor edx,edx
movaps xmm3,reg
cvttss2si eax,reg
cvtsi2ss reg,eax
movd eax,xmm3
test eax,80000000h
cmovnz ebx,edx
test eax,7fffffh
cmovz ebx,edx
movd xmm6,ebx
addss reg,xmm6
@@:
ENDM

Title: Re: Fast SSE Ceil() Function
Post by: hutch-- on July 18, 2008, 02:33:03 AM
John,

I would be inclined to go with the version that branches anyway as I have almost always found them faster and with the coming generation of Intel quads they have improved the performance of jumps so they do not stall the instruction queue like the older PIVs.
Title: Re: Fast SSE Ceil() Function
Post by: GregL on July 18, 2008, 03:14:59 AM
johnsa,

This is a little faster on my Pentium D.


    .DATA
      sngMinusOneHalf REAL4 -0.5
    .CODE
      movss xmm0, FP4(1.2)
      addss xmm0, xmm0
      movss xmm1, sngMinusOneHalf
      subss xmm1, xmm0
      cvtss2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2ss xmm0, eax

Title: Re: Fast SSE Ceil() Function
Post by: NightWare on July 18, 2008, 03:19:53 AM
hi,
not tested (just coded now  :P), try this... here no branch (so no need to wait intel improvments  :wink) :
movss XMM1,FP4(1.0) ; XMM1 = Val
movss XMM2,dword ptr SingleOne ; XMM2 = 1
cvttss2si eax,XMM1 ; ) XMM3 = tVal
cvtsi2ss XMM3,eax ; )
movaps XMM0,XMM3 ; XMM0 = tVal
subss XMM0,XMM1 ; XMM0 = -t
psrad XMM0,31 ; XMM0 = cc-t
pand XMM0,XMM2 ; XMM0 = cc1
addss XMM0,XMM3 ; XMM0 = tVal+cc1
 if it work, you should be able to do the job in parallel with sse2...
Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 07:34:24 AM
As it turns out I need the result in general reg eventually anyway, so Greg.... that is a bloody marvellous piece of code  :U and I can even leave off the last convert back to xmm!
I get 8ms (+/-)for 10 million iterations of that vs the (11/20 for branching version).
Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 07:40:04 AM
Nightware, your function seems to work well too, but comes in at 23ms for the 10million iterations. :)
Title: Re: Fast SSE Ceil() Function
Post by: askm on July 18, 2008, 03:40:39 PM
I am on a public browser does the MASM32 package have a ceil function ?
Title: Re: Fast SSE Ceil() Function
Post by: GregL on July 19, 2008, 12:58:00 AM
askm,

Well, there is the crt_ceil C run-time function. It works great, but it's pretty slow when compared to the above methods.

Title: Re: Fast SSE Ceil() Function
Post by: GregL on February 04, 2009, 03:09:53 AM
SSE2 version of the code I posted above


Ceil_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
    .DATA
      dblMinusOneHalf REAL8 -0.5
    .CODE
      mov eax, pIn
      movsd xmm0, [eax]
      addsd xmm0, xmm0
      movsd xmm1, dblMinusOneHalf
      subsd xmm1, xmm0
      cvtsd2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2sd xmm0, eax
      mov eax, pOut
      movsd [eax], xmm0
      ret
Ceil_SSE2 ENDP


[edit]

If you are using ml.exe 6.15 you only need these two macros for SSE2

MOVSD_ MACRO A, B
  DB 0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB 0F2H
  CMPPS A, B, C
ENDM


If you are using ml.exe 6.14 then you can use the macos here (http://www.masm32.com/board/index.php?topic=973.msg7023#msg7023).