News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fast SSE Ceil() Function

Started by johnsa, July 17, 2008, 10:10:18 PM

Previous topic - Next topic

johnsa

Hey all,

Been working on various bits and pieces and in the process I needed a fast sse ceil function that works for both positive and negative reals.
So it must follow the rules:
ceil(1.2) = 2.0
ceil(1.0) = 1.0
ceil(-0.3) = 0.0
ceil(-1.2) = -1.0 and so on

sofar this is the best I've come up with. Anyone have any ideas how to improve this while still obeying the above rules?

It assumes the input real single will be in xmm0.
(In this example i've just loaded a constant into xmm0).



; somwhere in data section.....
align 16
SingleOne REAL4 1.0

movss xmm0,FP4(1.0)

movss xmm6,dword ptr SingleOne
movaps xmm3,xmm0
cvttss2si eax,xmm0
cvtsi2ss xmm0,eax
movd eax,xmm3
test eax,80000000h
jnz short @F
test eax,7fffffh
jz short @F
addss xmm0,xmm6
@@:

johnsa



Slight variation, using cmov's instead of branching... this one gets 30ms for both positive/negative values on my machine (10million iterations).

The previous version with branches got 11ms for negative numbers and 20 for positive (10million iterations) ... so even with the branches it's still faster...



CEIL2 MACRO reg
mov ebx,SingleOne
xor edx,edx
movaps xmm3,reg
cvttss2si eax,reg
cvtsi2ss reg,eax
movd eax,xmm3
test eax,80000000h
cmovnz ebx,edx
test eax,7fffffh
cmovz ebx,edx
movd xmm6,ebx
addss reg,xmm6
@@:
ENDM


hutch--

John,

I would be inclined to go with the version that branches anyway as I have almost always found them faster and with the coming generation of Intel quads they have improved the performance of jumps so they do not stall the instruction queue like the older PIVs.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

GregL

johnsa,

This is a little faster on my Pentium D.


    .DATA
      sngMinusOneHalf REAL4 -0.5
    .CODE
      movss xmm0, FP4(1.2)
      addss xmm0, xmm0
      movss xmm1, sngMinusOneHalf
      subss xmm1, xmm0
      cvtss2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2ss xmm0, eax


NightWare

hi,
not tested (just coded now  :P), try this... here no branch (so no need to wait intel improvments  :wink) :
movss XMM1,FP4(1.0) ; XMM1 = Val
movss XMM2,dword ptr SingleOne ; XMM2 = 1
cvttss2si eax,XMM1 ; ) XMM3 = tVal
cvtsi2ss XMM3,eax ; )
movaps XMM0,XMM3 ; XMM0 = tVal
subss XMM0,XMM1 ; XMM0 = -t
psrad XMM0,31 ; XMM0 = cc-t
pand XMM0,XMM2 ; XMM0 = cc1
addss XMM0,XMM3 ; XMM0 = tVal+cc1
 if it work, you should be able to do the job in parallel with sse2...

johnsa

As it turns out I need the result in general reg eventually anyway, so Greg.... that is a bloody marvellous piece of code  :U and I can even leave off the last convert back to xmm!
I get 8ms (+/-)for 10 million iterations of that vs the (11/20 for branching version).

johnsa

Nightware, your function seems to work well too, but comes in at 23ms for the 10million iterations. :)

askm

I am on a public browser does the MASM32 package have a ceil function ?

GregL

askm,

Well, there is the crt_ceil C run-time function. It works great, but it's pretty slow when compared to the above methods.


GregL

#9
SSE2 version of the code I posted above


Ceil_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
    .DATA
      dblMinusOneHalf REAL8 -0.5
    .CODE
      mov eax, pIn
      movsd xmm0, [eax]
      addsd xmm0, xmm0
      movsd xmm1, dblMinusOneHalf
      subsd xmm1, xmm0
      cvtsd2si eax, xmm1
      sar eax, 1
      neg eax
      cvtsi2sd xmm0, eax
      mov eax, pOut
      movsd [eax], xmm0
      ret
Ceil_SSE2 ENDP


[edit]

If you are using ml.exe 6.15 you only need these two macros for SSE2

MOVSD_ MACRO A, B
  DB 0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB 0F2H
  CMPPS A, B, C
ENDM


If you are using ml.exe 6.14 then you can use the macos here.