Print Page - Fast SSE Ceil() Function

Title: Fast SSE Ceil() Function
Post by: johnsa on July 17, 2008, 10:10:18 PM

Hey all,

Been working on various bits and pieces and in the process I needed a fast sse ceil function that works for both positive and negative reals.
So it must follow the rules:
ceil(1.2) = 2.0
ceil(1.0) = 1.0
ceil(-0.3) = 0.0
ceil(-1.2) = -1.0 and so on

sofar this is the best I've come up with. Anyone have any ideas how to improve this while still obeying the above rules?

It assumes the input real single will be in xmm0.
(In this example i've just loaded a constant into xmm0).

Code Select



; somwhere in data section.....
align 16
SingleOne REAL4 1.0

	movss xmm0,FP4(1.0)

	movss xmm6,dword ptr SingleOne
	movaps xmm3,xmm0
	cvttss2si eax,xmm0
	cvtsi2ss xmm0,eax
	movd eax,xmm3
	test eax,80000000h
	jnz short @F
	test eax,7fffffh
	jz short @F	
	addss xmm0,xmm6
@@:

Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 12:05:22 AM

Slight variation, using cmov's instead of branching... this one gets 30ms for both positive/negative values on my machine (10million iterations).

The previous version with branches got 11ms for negative numbers and 20 for positive (10million iterations) ... so even with the branches it's still faster...

Code Select



CEIL2 MACRO reg
	mov ebx,SingleOne
	xor edx,edx
	movaps xmm3,reg
	cvttss2si eax,reg
	cvtsi2ss reg,eax
	movd eax,xmm3
	test eax,80000000h
	cmovnz ebx,edx
	test eax,7fffffh
	cmovz ebx,edx
	movd xmm6,ebx
	addss reg,xmm6
@@:	
ENDM

Title: Re: Fast SSE Ceil() Function
Post by: hutch-- on July 18, 2008, 02:33:03 AM

John,

I would be inclined to go with the version that branches anyway as I have almost always found them faster and with the coming generation of Intel quads they have improved the performance of jumps so they do not stall the instruction queue like the older PIVs.

Title: Re: Fast SSE Ceil() Function
Post by: GregL on July 18, 2008, 03:14:59 AM

johnsa,

This is a little faster on my Pentium D.

Code Select


    .DATA
      sngMinusOneHalf REAL4 -0.5
    .CODE
      movss xmm0, FP4(1.2)
      addss xmm0, xmm0
      movss xmm1, sngMinusOneHalf
      subss xmm1, xmm0
      cvtss2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2ss xmm0, eax

Title: Re: Fast SSE Ceil() Function
Post by: NightWare on July 18, 2008, 03:19:53 AM

hi,
not tested (just coded now :P), try this... here no branch (so no need to wait intel improvments :wink) :

Code Select

		movss XMM1,FP4(1.0)				; XMM1 = Val
		movss XMM2,dword ptr SingleOne	; XMM2 = 1
		cvttss2si eax,XMM1				; ) XMM3 = tVal
		cvtsi2ss XMM3,eax				; )
		movaps XMM0,XMM3				; XMM0 = tVal
		subss XMM0,XMM1				; XMM0 = -t
		psrad XMM0,31					; XMM0 = cc-t
		pand XMM0,XMM2					; XMM0 = cc1
		addss XMM0,XMM3				; XMM0 = tVal+cc1

if it work, you should be able to do the job in parallel with sse2...

Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 07:34:24 AM

As it turns out I need the result in general reg eventually anyway, so Greg.... that is a bloody marvellous piece of code :U and I can even leave off the last convert back to xmm!
I get 8ms (+/-)for 10 million iterations of that vs the (11/20 for branching version).

Title: Re: Fast SSE Ceil() Function
Post by: johnsa on July 18, 2008, 07:40:04 AM

Nightware, your function seems to work well too, but comes in at 23ms for the 10million iterations. :)

Title: Re: Fast SSE Ceil() Function
Post by: askm on July 18, 2008, 03:40:39 PM

I am on a public browser does the MASM32 package have a ceil function ?

Title: Re: Fast SSE Ceil() Function
Post by: GregL on July 19, 2008, 12:58:00 AM

askm,

Well, there is the crt_ceil C run-time function. It works great, but it's pretty slow when compared to the above methods.

Title: Re: Fast SSE Ceil() Function
Post by: GregL on February 04, 2009, 03:09:53 AM

SSE2 version of the code I posted above

Code Select


Ceil_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
    .DATA
      dblMinusOneHalf REAL8 -0.5
    .CODE
      mov eax, pIn
      movsd xmm0, [eax]
      addsd xmm0, xmm0
      movsd xmm1, dblMinusOneHalf
      subsd xmm1, xmm0
      cvtsd2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2sd xmm0, eax
      mov eax, pOut
      movsd [eax], xmm0
      ret
Ceil_SSE2 ENDP

[edit]

If you are using ml.exe 6.15 you only need these two macros for SSE2

Code Select


MOVSD_ MACRO A, B
  DB 0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB 0F2H
  CMPPS A, B, C
ENDM

If you are using ml.exe 6.14 then you can use the macos here (http://www.masm32.com/board/index.php?topic=973.msg7023#msg7023).

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on July 17, 2008, 10:10:18 PM