News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fast SSE floor() function

Started by GregL, July 18, 2008, 05:02:20 AM

Previous topic - Next topic

GregL

To go along with the fast SSE ceil() function in the other post.

floor()

    .DATA
      sngMinusOneHalf REAL4 -0.5
    .CODE
      movss xmm0, FP4(1.2)
      addss xmm0, xmm0
      addss xmm0, sngMinusOneHalf
      cvtss2si eax, xmm0
      sar eax, 1
      cvtsi2ss xmm0, eax



chrisw

in my programs, i do use a version of floor(), that is somewhat faster on my system:


cvttss2si    ebx,    [float_value]
mov          eax,    [float_value]
shr          eax,    31
sub          ebx     eax
cvtsi2ss     xmm0,   ebx


and in the packed version to process large amounts of data:

movaps       xmm0,   [float_value]
cvttps2dq    xmm1,   xmm0
psrld        xmm0,   31
psubd        xmm1,   xmm0
cvtsq2ps     xmm0,   xmm1


Both versions use faster integer but float arithmetic and allow concurrent execution of the conversion to integer and the shift instruction

GregL

Hey, that seems to work just fine. I just did some quick tests on the scaler version and I didn't test for speed on my system. What is your system?

It also works with cvtss2si instead of cvttss2si. Any reason for using cvttss2si?


chrisw

Hi Greg,

i'm using Opteron K8 processors for testing the performance of my programs. In the scalar version of the algorithm, i loaded the eax register from the memory, while it could be loaded from ebx with an additional RAW dependency, too. On my AMD processor, the additional memory access is faster, but i did not test this with an Intel processor, since my target system is an Opteron K10 anyway.

But: the scalar routine (as well as the packed routine) does not work correctly with cvtss2si.

Suppose the value -1.6f as input and suppose further the round control bits of your processor to be set to 'round to nearest', which is the default value for those bits.

After conversion with rounding, the integer register ebx holds the value -2, while after conversion truncated, the integer register holds the value -1.

In the next step of the algorithm, and additional 1 is subtracted for the negative sign of the float value. Now the rounded result is -3, while the truncated result is -2, which is correct here.

GregL

chrisw,

Ah ha!, now I see why you used cvttss2si.


GregL

#5
Here's the SSE2 version I wrote


Floor_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
    .DATA
      dblMinusOneHalf REAL8 -0.5
    .CODE
      mov eax, pIn
      movsd xmm0, [eax]
      addsd xmm0, xmm0
      addsd xmm0, dblMinusOneHalf
      cvtsd2si eax, xmm0
      sar eax, 1
      cvtsi2sd xmm0, eax
      mov eax, pOut
      movsd [eax], xmm0
      ret
Floor_SSE2 ENDP


[edit]

If you are using ml.exe 6.15 you only need these two macros for SSE2

MOVSD_ MACRO A, B
  DB 0F2H
  MOVUPS A, B
ENDM

CMPSD_ MACRO A, B, C
  DB 0F2H
  CMPPS A, B, C
ENDM


If you are using ml.exe 6.14 then you can use the macos here.


jj2007

#6
Quote from: chrisw on January 30, 2009, 01:39:26 PM
in my programs, i do use a version of floor(), that is somewhat faster on my system:


cvttss2si    ebx,    [float_value]
mov          eax,    [float_value]
shr          eax,    31
sub          ebx     eax
cvtsi2ss     xmm0,   ebx


Very cute, Chris. I will grab it for my private library  :wink

One question - I am a bloody beginner for SSE: What is the purpose of the cvtsi2ss instruction below? To allow further processing in an XMM register? I works just fine without... :dazzled:

; Usage:
; invoke FloorCW, MyReal4
; print str$(eax)

FloorCW proc float_value:REAL4   ; credits to chrisw
  cvttss2si eax, float_value
  mov ecx, float_value
  shr ecx, 31
  sub eax, ecx
  ; cvtsi2ss xmm0, ecx      ; purpose??
  ret
FloorCW endp


chrisw

Hi jj,

nice that you like this :-)

And yes, the very last conversion keeps the floor as float in an xmm for further processing. I do use the code in a routine to calculate exp(x) as fast as possible on very large data sets and i simply copied the code snippet from there.

GregL

chrisw,

Yes, your SSE version is clever. I'm going to save it too.


jj2007

Thanxalot, Chris - you inspired me to look up the double precision equivalent. I am not sure if my logic is OK, so no warranties ;-)

The macro accepts REAL4 and REAL8.
EDIT: And it makes a difference:

MyReal4   REAL4 0.99999999
MyReal8   REAL8 0.99999999

Output:
Real8= 0.999999990       floor= 0
Real4= 1.00000000        floor= 1

qword: thanks for the link, extremely valuable.

Usage:
print str$(FLOOR(MyReal4))
print str$(FLOOR(MyReal8))
mov MyInt32, FLOOR(MyReal8)
etc

.xmm

Floor4 PROTO: REAL4
Floor8 PROTO: REAL8
MbCeil PROTO: REAL4

FLOOR MACRO arg
  if SIZEOF arg eq 8
invoke Floor8, arg
  else
invoke Floor4, arg
  endif
  EXITM <eax>
ENDM

.code
Floor8     proc float_value:REAL8 ; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
  cvttsd2si eax, float_value
  mov ecx, dword ptr float_value+4
  shr ecx, 31
  sub eax, ecx
  ; cvtsi2ss xmm0, ecx ; if needed for further processing
  ret
Floor8    endp

Floor4     proc float_value:REAL4 ; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
  cvttss2si eax, float_value
  mov ecx, float_value
  shr ecx, 31
  sub eax, ecx
  ; cvtsi2ss xmm0, ecx ; if needed for further processing
  ret
Floor4    endp

jj2007

Time for a game, folks??

Floor timings:


MySingle1       REAL4 12.34567890123456 floor= 12
MyDouble1       REAL8 12.34567890123456 floor= 12
MySingle2       REAL4 -12.345678901234  floor= -13
MyDouble2       REAL8 -12.345678901234  floor= -13
MySingle3       REAL4 0.99999999        floor= 1        (sic!)
MyDouble3       REAL8 0.99999999        floor= 0

84      cycles for Floor8
29      cycles for Floor4
32      cycles for Floor4a
33      cycles for Floor4b
35      cycles for Floor4c

85      cycles for Floor8
29      cycles for Floor4
31      cycles for Floor4a
33      cycles for Floor4b
34      cycles for Floor4c

84      cycles for Floor8
28      cycles for Floor4
31      cycles for Floor4a
33      cycles for Floor4b
35      cycles for Floor4c

84      cycles for Floor8
28      cycles for Floor4
32      cycles for Floor4a
33      cycles for Floor4b
34      cycles for Floor4c


Celeron M (Core 2) on XP SP2

[attachment deleted by admin]

herge

 Hi jj2007:

And results for floor.exe


Floor timings:

MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0

96 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
21 cycles for Floor4b
24 cycles for Floor4c

97 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
23 cycles for Floor4c

97 cycles for Floor8
20 cycles for Floor4
24 cycles for Floor4a
21 cycles for Floor4b
24 cycles for Floor4c

97 cycles for Floor8
21 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c


Regards herge

// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: herge on March 12, 2009, 01:38:57 AM
Hi jj2007:

And results for floor.exe


Thanks, herge, you are a real sportsman :U

Mark Jones

Lets see if I get a "sporty" reply... :P

Quote from: AMD x64 4000+ / WinXP x32 SP3
Floor timings:

MySingle1       REAL4 12.34567890123456 floor= 12
MyDouble1       REAL8 12.34567890123456 floor= 12
MySingle2       REAL4 -12.345678901234  floor= -13
MyDouble2       REAL8 -12.345678901234  floor= -13
MySingle3       REAL4 0.99999999        floor= 1        (sic!)
MyDouble3       REAL8 0.99999999        floor= 0

137     cycles for Floor8
42      cycles for Floor4
42      cycles for Floor4a
32      cycles for Floor4b
32      cycles for Floor4c

137     cycles for Floor8
42      cycles for Floor4
42      cycles for Floor4a
32      cycles for Floor4b
32      cycles for Floor4c

138     cycles for Floor8
42      cycles for Floor4
42      cycles for Floor4a
32      cycles for Floor4b
32      cycles for Floor4c

138     cycles for Floor8
42      cycles for Floor4
42      cycles for Floor4a
32      cycles for Floor4b
37      cycles for Floor4c
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08