To go along with the fast SSE ceil() function in the other post.
floor()
.DATA
sngMinusOneHalf REAL4 -0.5
.CODE
movss xmm0, FP4(1.2)
addss xmm0, xmm0
addss xmm0, sngMinusOneHalf
cvtss2si eax, xmm0
sar eax, 1
cvtsi2ss xmm0, eax
in my programs, i do use a version of floor(), that is somewhat faster on my system:
cvttss2si ebx, [float_value]
mov eax, [float_value]
shr eax, 31
sub ebx eax
cvtsi2ss xmm0, ebx
and in the packed version to process large amounts of data:
movaps xmm0, [float_value]
cvttps2dq xmm1, xmm0
psrld xmm0, 31
psubd xmm1, xmm0
cvtsq2ps xmm0, xmm1
Both versions use faster integer but float arithmetic and allow concurrent execution of the conversion to integer and the shift instruction
Hey, that seems to work just fine. I just did some quick tests on the scaler version and I didn't test for speed on my system. What is your system?
It also works with cvtss2si instead of cvttss2si. Any reason for using cvttss2si?
Hi Greg,
i'm using Opteron K8 processors for testing the performance of my programs. In the scalar version of the algorithm, i loaded the eax register from the memory, while it could be loaded from ebx with an additional RAW dependency, too. On my AMD processor, the additional memory access is faster, but i did not test this with an Intel processor, since my target system is an Opteron K10 anyway.
But: the scalar routine (as well as the packed routine) does not work correctly with cvtss2si.
Suppose the value -1.6f as input and suppose further the round control bits of your processor to be set to 'round to nearest', which is the default value for those bits.
After conversion with rounding, the integer register ebx holds the value -2, while after conversion truncated, the integer register holds the value -1.
In the next step of the algorithm, and additional 1 is subtracted for the negative sign of the float value. Now the rounded result is -3, while the truncated result is -2, which is correct here.
chrisw,
Ah ha!, now I see why you used cvttss2si.
Here's the SSE2 version I wrote
Floor_SSE2 PROC pIn:PTR REAL8, pOut:PTR REAL8
.DATA
dblMinusOneHalf REAL8 -0.5
.CODE
mov eax, pIn
movsd xmm0, [eax]
addsd xmm0, xmm0
addsd xmm0, dblMinusOneHalf
cvtsd2si eax, xmm0
sar eax, 1
cvtsi2sd xmm0, eax
mov eax, pOut
movsd [eax], xmm0
ret
Floor_SSE2 ENDP
[edit]
If you are using ml.exe 6.15 you only need these two macros for SSE2
MOVSD_ MACRO A, B
DB 0F2H
MOVUPS A, B
ENDM
CMPSD_ MACRO A, B, C
DB 0F2H
CMPPS A, B, C
ENDM
If you are using ml.exe 6.14 then you can use the macos here (http://www.masm32.com/board/index.php?topic=973.msg7023#msg7023).
Quote from: chrisw on January 30, 2009, 01:39:26 PM
in my programs, i do use a version of floor(), that is somewhat faster on my system:
cvttss2si ebx, [float_value]
mov eax, [float_value]
shr eax, 31
sub ebx eax
cvtsi2ss xmm0, ebx
Very cute, Chris. I will grab it for my private library :wink
One question - I am a bloody beginner for SSE: What is the purpose of the cvtsi2ss instruction below? To allow further processing in an XMM register? I works just fine without... :dazzled:
; Usage:
; invoke FloorCW, MyReal4
; print str$(eax)
FloorCW proc float_value:REAL4 ; credits to chrisw (http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719)
cvttss2si eax, float_value
mov ecx, float_value
shr ecx, 31
sub eax, ecx
; cvtsi2ss xmm0, ecx ; purpose??
ret
FloorCW endp
Quote from: jj2007 on February 06, 2009, 03:39:06 PM
What is the purpose of the cvtsi2ss instruction below?
-> page 73: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26568.pdf
Hi jj,
nice that you like this :-)
And yes, the very last conversion keeps the floor as float in an xmm for further processing. I do use the code in a routine to calculate exp(x) as fast as possible on very large data sets and i simply copied the code snippet from there.
chrisw,
Yes, your SSE version is clever. I'm going to save it too.
Thanxalot, Chris - you inspired me to look up the double precision equivalent. I am not sure if my logic is OK, so no warranties ;-)
The macro accepts REAL4 and REAL8.
EDIT: And it makes a difference:
MyReal4 REAL4 0.99999999
MyReal8 REAL8 0.99999999
Output:
Real8= 0.999999990 floor= 0
Real4= 1.00000000 floor= 1
qword: thanks for the link, extremely valuable.
Usage:
print str$(FLOOR(MyReal4))
print str$(FLOOR(MyReal8))
mov MyInt32, FLOOR(MyReal8)
etc
.xmm
Floor4 PROTO: REAL4
Floor8 PROTO: REAL8
MbCeil PROTO: REAL4
FLOOR MACRO arg
if SIZEOF arg eq 8
invoke Floor8, arg
else
invoke Floor4, arg
endif
EXITM <eax>
ENDM
.code
Floor8 proc float_value:REAL8 ; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
cvttsd2si eax, float_value
mov ecx, dword ptr float_value+4
shr ecx, 31
sub eax, ecx
; cvtsi2ss xmm0, ecx ; if needed for further processing
ret
Floor8 endp
Floor4 proc float_value:REAL4 ; credits to [url=http://www.masm32.com/board/index.php?topic=9515.msg78719#msg78719]chrisw[/url]
cvttss2si eax, float_value
mov ecx, float_value
shr ecx, 31
sub eax, ecx
; cvtsi2ss xmm0, ecx ; if needed for further processing
ret
Floor4 endp
Time for a game, folks??
Floor timings:
MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0
84 cycles for Floor8
29 cycles for Floor4
32 cycles for Floor4a
33 cycles for Floor4b
35 cycles for Floor4c
85 cycles for Floor8
29 cycles for Floor4
31 cycles for Floor4a
33 cycles for Floor4b
34 cycles for Floor4c
84 cycles for Floor8
28 cycles for Floor4
31 cycles for Floor4a
33 cycles for Floor4b
35 cycles for Floor4c
84 cycles for Floor8
28 cycles for Floor4
32 cycles for Floor4a
33 cycles for Floor4b
34 cycles for Floor4c
Celeron M (Core 2) on XP SP2
[attachment deleted by admin]
Hi jj2007:
And results for floor.exe
Floor timings:
MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0
96 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
21 cycles for Floor4b
24 cycles for Floor4c
97 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
23 cycles for Floor4c
97 cycles for Floor8
20 cycles for Floor4
24 cycles for Floor4a
21 cycles for Floor4b
24 cycles for Floor4c
97 cycles for Floor8
21 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c
Regards herge
Quote from: herge on March 12, 2009, 01:38:57 AM
Hi jj2007:
And results for floor.exe
Thanks, herge, you are a real sportsman :U
Lets see if I get a "sporty" reply... :P
Quote from: AMD x64 4000+ / WinXP x32 SP3
Floor timings:
MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0
137 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c
137 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c
138 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
32 cycles for Floor4c
138 cycles for Floor8
42 cycles for Floor4
42 cycles for Floor4a
32 cycles for Floor4b
37 cycles for Floor4c
Intel Quad core 9550
Floor timings:
MySingle1 REAL4 12.34567890123456 floor= 12
MyDouble1 REAL8 12.34567890123456 floor= 12
MySingle2 REAL4 -12.345678901234 floor= -13
MyDouble2 REAL8 -12.345678901234 floor= -13
MySingle3 REAL4 0.99999999 floor= 1 (sic!)
MyDouble3 REAL8 0.99999999 floor= 0
95 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
20 cycles for Floor4b
23 cycles for Floor4c
95 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c
96 cycles for Floor8
20 cycles for Floor4
23 cycles for Floor4a
20 cycles for Floor4b
24 cycles for Floor4c
96 cycles for Floor8
20 cycles for Floor4
22 cycles for Floor4a
20 cycles for Floor4b
22 cycles for Floor4c
Quote from: Mark Jones on March 12, 2009, 03:27:13 PM
Lets see if I get a "sporty" reply... :P
4b seems to perform well, see also herge's and Neil's timings. Now one of the other sportsmen will probably cry foul because a fast floor() is a waste of the forum's bandwidth... :bg
(to be honest, I have never used floor() in any real programming, but I guess there are applications where huge chunks of data need to be processed fast enough)
This procedure leaves the return value on the FPU stack, and runs in 50 cycles on a P3:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
_floor8 proc double:REAL8
fld FP8(2.0)
fmul QWORD PTR [esp+4]
fadd FP8(-0.5)
sub esp, 8
fistp QWORD PTR [esp]
shr DWORD PTR [esp+4], 1
rcr DWORD PTR [esp], 1
fild QWORD PTR [esp]
add esp, 8
ret 8
_floor8 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef