Print Page - Fast SSE floor() function

Title: Fast SSE floor() function
Post by: GregL on July 18, 2008, 05:02:20 AM

To go along with the fast SSE ceil() function in the other post.

floor()


    .DATA
      sngMinusOneHalf REAL4 -0.5
    .CODE
      movss xmm0, FP4(1.2)
      addss xmm0, xmm0
      addss xmm0, sngMinusOneHalf
      cvtss2si eax, xmm0
      sar eax, 1
      cvtsi2ss xmm0, eax

Title: Re: Fast SSE floor() function
Post by: chrisw on January 30, 2009, 01:39:26 PM

in my programs, i do use a version of floor(), that is somewhat faster on my system:

Code Select


cvttss2si    ebx,    [float_value]
mov          eax,    [float_value]
shr          eax,    31
sub          ebx     eax
cvtsi2ss     xmm0,   ebx

and in the packed version to process large amounts of data:

Code Select


movaps       xmm0,   [float_value]
cvttps2dq    xmm1,   xmm0
psrld        xmm0,   31
psubd        xmm1,   xmm0
cvtsq2ps     xmm0,   xmm1

Both versions use faster integer but float arithmetic and allow concurrent execution of the conversion to integer and the shift instruction

Title: Re: Fast SSE floor() function
Post by: GregL on February 01, 2009, 01:08:58 AM

Hey, that seems to work just fine. I just did some quick tests on the scaler version and I didn't test for speed on my system. What is your system?

It also works with cvtss2si instead of cvttss2si. Any reason for using cvttss2si?

Title: Re: Fast SSE floor() function
Post by: chrisw on February 03, 2009, 09:45:01 AM

Hi Greg,

i'm using Opteron K8 processors for testing the performance of my programs. In the scalar version of the algorithm, i loaded the eax register from the memory, while it could be loaded from ebx with an additional RAW dependency, too. On my AMD processor, the additional memory access is faster, but i did not test this with an Intel processor, since my target system is an Opteron K10 anyway.

But: the scalar routine (as well as the packed routine) does not work correctly with cvtss2si.

Suppose the value -1.6f as input and suppose further the round control bits of your processor to be set to 'round to nearest', which is the default value for those bits.

After conversion with rounding, the integer register ebx holds the value -2, while after conversion truncated, the integer register holds the value -1.

In the next step of the algorithm, and additional 1 is subtracted for the negative sign of the float value. Now the rounded result is -3, while the truncated result is -2, which is correct here.

Title: Re: Fast SSE floor() function
Post by: GregL on February 04, 2009, 02:07:38 AM

chrisw,

Ah ha!, now I see why you used cvttss2si.

Title: Re: Fast SSE floor() function
Post by: GregL on February 04, 2009, 02:46:35 AM

Here's the SSE2 version I wrote

Code Select


Floor_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
    .DATA
      dblMinusOneHalf REAL8 -0.5
    .CODE
      mov eax, pIn
      movsd xmm0, [eax]
      addsd xmm0, xmm0
      addsd xmm0, dblMinusOneHalf
      cvtsd2si eax, xmm0
      sar eax, 1
      cvtsi2sd xmm0, eax
      mov eax, pOut
      movsd [eax], xmm0
      ret
Floor_SSE2 ENDP

[edit]

If you are using ml.exe 6.15 you only need these two macros for SSE2

Code Select


MOVSD_ MACRO A, B
  DB 0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB 0F2H
  CMPPS A, B, C
ENDM

If you are using ml.exe 6.14 then you can use the macos here (http://www.masm32.com/board/index.php?topic=973.msg7023#msg7023).

Title: Re: Fast SSE floor() function
Post by: jj2007 on February 06, 2009, 03:39:06 PM

Quote from: chrisw on January 30, 2009, 01:39:26 PM
in my programs, i do use a version of floor(), that is somewhat faster on my system:

Code Select Expand
cvttss2si ebx, [float_value] mov eax, [float_value] shr eax, 31 sub ebx eax cvtsi2ss xmm0, ebx

Very cute, Chris. I will grab it for my private library :wink

One question - I am a bloody beginner for SSE: What is the purpose of the cvtsi2ss instruction below? To allow further processing in an XMM register? I works just fine without... :dazzled:

; Usage:
; invoke FloorCW, MyReal4
; print str$(eax)

FloorCW proc float_value:REAL4 ; credits to chrisw (http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719)
cvttss2si eax, float_value
mov ecx, float_value
shr ecx, 31
sub eax, ecx
; cvtsi2ss xmm0, ecx ; purpose??
ret
FloorCW endp

Title: Re: Fast SSE floor() function
Post by: qWord on February 06, 2009, 05:10:40 PM

Quote from: jj2007 on February 06, 2009, 03:39:06 PM
What is the purpose of the cvtsi2ss instruction below?

-> page 73: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26568.pdf

Title: Re: Fast SSE floor() function
Post by: chrisw on February 06, 2009, 05:11:29 PM

Hi jj,

nice that you like this :-)

And yes, the very last conversion keeps the floor as float in an xmm for further processing. I do use the code in a routine to calculate exp(x) as fast as possible on very large data sets and i simply copied the code snippet from there.

Title: Re: Fast SSE floor() function
Post by: GregL on February 06, 2009, 08:13:18 PM

chrisw,

Yes, your SSE version is clever. I'm going to save it too.

Title: Re: Fast SSE floor() function
Post by: jj2007 on February 06, 2009, 09:00:30 PM

Thanxalot, Chris - you inspired me to look up the double precision equivalent. I am not sure if my logic is OK, so no warranties ;-)

The macro accepts REAL4 and REAL8.
EDIT: And it makes a difference:

MyReal4 REAL4 0.99999999
MyReal8 REAL8 0.99999999

Output:
Real8= 0.999999990 floor= 0
Real4= 1.00000000 floor= 1

qword: thanks for the link, extremely valuable.

Usage:
print str$(FLOOR(MyReal4))
print str$(FLOOR(MyReal8))
mov MyInt32, FLOOR(MyReal8)
etc

Code Select

.xmm

Floor4	PROTO: REAL4
Floor8	PROTO: REAL8
MbCeil	PROTO: REAL4

FLOOR MACRO arg
  if SIZEOF arg eq 8
	invoke Floor8, arg
  else
	invoke Floor4, arg
  endif
  EXITM <eax>
ENDM

.code
Floor8     proc float_value:REAL8	; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
  cvttsd2si eax, float_value
  mov ecx, dword ptr float_value+4
  shr ecx, 31
  sub eax, ecx
  ; cvtsi2ss xmm0, ecx		; if needed for further processing
  ret
Floor8    endp

Floor4     proc float_value:REAL4	; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
  cvttss2si eax, float_value
  mov ecx, float_value
  shr ecx, 31
  sub eax, ecx
  ; cvtsi2ss xmm0, ecx		; if needed for further processing
  ret
Floor4    endp

Title: Re: Fast SSE floor() function
Post by: jj2007 on February 06, 2009, 10:30:54 PM

Time for a game, folks??

Floor timings:

Code Select


MySingle1       REAL4 12.34567890123456 floor= 12
MyDouble1       REAL8 12.34567890123456 floor= 12
MySingle2       REAL4 -12.345678901234  floor= -13
MyDouble2       REAL8 -12.345678901234  floor= -13
MySingle3       REAL4 0.99999999        floor= 1        (sic!)
MyDouble3       REAL8 0.99999999        floor= 0

84      cycles for Floor8
29      cycles for Floor4
32      cycles for Floor4a
33      cycles for Floor4b
35      cycles for Floor4c

85      cycles for Floor8
29      cycles for Floor4
31      cycles for Floor4a
33      cycles for Floor4b
34      cycles for Floor4c

84      cycles for Floor8
28      cycles for Floor4
31      cycles for Floor4a
33      cycles for Floor4b
35      cycles for Floor4c

84      cycles for Floor8
28      cycles for Floor4
32      cycles for Floor4a
33      cycles for Floor4b
34      cycles for Floor4c

Celeron M (Core 2) on XP SP2

[attachment deleted by admin]

Title: Re: Fast SSE floor() function
Post by: herge on March 12, 2009, 01:38:57 AM

Hi jj2007:

And results for floor.exe

Code Select


Floor timings:

MySingle1	REAL4 12.34567890123456	floor= 12
MyDouble1	REAL8 12.34567890123456	floor= 12
MySingle2	REAL4 -12.345678901234	floor= -13
MyDouble2	REAL8 -12.345678901234	floor= -13
MySingle3	REAL4 0.99999999	floor= 1	(sic!)
MyDouble3	REAL8 0.99999999	floor= 0

96	cycles for Floor8
20	cycles for Floor4
23	cycles for Floor4a
21	cycles for Floor4b
24	cycles for Floor4c

97	cycles for Floor8
20	cycles for Floor4
22	cycles for Floor4a
20	cycles for Floor4b
23	cycles for Floor4c

97	cycles for Floor8
20	cycles for Floor4
24	cycles for Floor4a
21	cycles for Floor4b
24	cycles for Floor4c

97	cycles for Floor8
21	cycles for Floor4
22	cycles for Floor4a
20	cycles for Floor4b
22	cycles for Floor4c

Regards herge

Title: Re: Fast SSE floor() function
Post by: jj2007 on March 12, 2009, 02:38:27 PM

Quote from: herge on March 12, 2009, 01:38:57 AM
Hi jj2007:

And results for floor.exe

Thanks, herge, you are a real sportsman :U

Title: Re: Fast SSE floor() function
Post by: Mark Jones on March 12, 2009, 03:27:13 PM

Lets see if I get a "sporty" reply... :P

Quote from: AMD x64 4000+ / WinXP x32 SP3
Floor timings:

MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0

137 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c

137 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c

138 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c

138 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
37 cycles for Floor4c

Title: Re: Fast SSE floor() function
Post by: Neil on March 12, 2009, 03:55:54 PM

Intel Quad core 9550

Floor timings:

MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0

95 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
20 cycles for Floor4b
23 cycles for Floor4c

95 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c

96 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
20 cycles for Floor4b
24 cycles for Floor4c

96 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c

Title: Re: Fast SSE floor() function
Post by: jj2007 on March 12, 2009, 04:08:49 PM

Quote from: Mark Jones on March 12, 2009, 03:27:13 PM
Lets see if I get a "sporty" reply... :P

4b seems to perform well, see also herge's and Neil's timings. Now one of the other sportsmen will probably cry foul because a fast floor() is a waste of the forum's bandwidth... :bg

(to be honest, I have never used floor() in any real programming, but I guess there are applications where huge chunks of data need to be processed fast enough)

Title: Re: Fast SSE floor() function
Post by: MichaelW on April 05, 2009, 08:14:37 AM

This procedure leaves the return value on the FPU stack, and runs in 50 cycles on a P3:

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4
_floor8 proc double:REAL8
    fld FP8(2.0)
    fmul QWORD PTR [esp+4]
    fadd FP8(-0.5)
    sub esp, 8
    fistp QWORD PTR [esp]
    shr DWORD PTR [esp+4], 1
    rcr DWORD PTR [esp], 1
    fild QWORD PTR [esp]
    add esp, 8
    ret 8
_floor8 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: GregL on July 18, 2008, 05:02:20 AM