SSE floating point Math instructions

except SSE versions of the usual ADD,SUB,MUL,DIV,SQRT

we have RCP which returns reciprocal (1/x) of operand

and RSQRT which returns 1/sqrt(x)

MIN returns the smallest number of both operands

MAX returns the largest number of both operands

there are two types of instructions:

1:instructions for only doing the operation in Single low FP which has SS in its name

and second type for doing 4 parallel(packed) math op's on all 4 FP's which have PS in its name

single op on lowest Single FP packed(parallel) op on all 4 Single FP
ADDSS xmm1,xmm2/mem32 ; F3 0F 58 /r ADDPS xmm1,xmm2/mem128 ; 0F 58 /r
SUBSS xmm1,xmm2/m128 ; F3 0F 5C /r SUBPS xmm1,xmm2/m128 ; 0F 5C /r
MULSS xmm1,xmm2/mem32 ; F3 0F 59 /r MULPS xmm1,xmm2/mem128 ; 0F 59 /r
DIVSS xmm1,xmm2/mem32 ; F3 0F 5E /r DIVPS xmm1,xmm2/mem128 ; 0F 5E /r
RCPSS xmm1,xmm2/m128 ; F3 0F 53 /r * RCPPS xmm1,xmm2/m128 ; 0F 53 /r *
RSQRTSS xmm1,xmm2/m128 ; F3 0F 52 /r ** RSQRTPS xmm1,xmm2/m128 ; 0F 52 /r **
SQRTSS xmm1,xmm2/m128 ; F3 0F 51 /r square root SQRTPS xmm1,xmm2/m128 ; 0F 51 /r
MAXSS xmm1,xmm2/m32 MAXPS xmm1,xmm2/m128
MINSS xmm1,xmm2/m32 MINPS xmm1,xmm2/m128

 

*performs a fast reciprocal 1/x from a internal lookup table which has only has 12 bits precision on the mantissa compared to normal 24bits precision, so you can replace slow DIV with MUL with the reciprocal, if you need fast DIV but dont need precision

instead of

DIVPS xmm1,xmm2 ;slow division

you code

RCPPS xmm2,xmm2 ;a fast lookup of the reciprocal of the divider

MULPS xmm1,xmm2 ;and mul with reciprocal that is lot faster than div

** performs reciprocal of square root 1/sqrt(x) with the same lookup table as RCP uses

SSE2 instructions

for single op's on lower double precision float which have SD in its name

for parallel op's on two double precision floats which have PD in its name

Single op on low Double FP packed (parallel) op on both Double FP
ADDSD ADDPD
SUBSD SUBPD
MULSD MULPD
DIVSD DIVPD
RCPSD RCPPD
RSQRTSD RSQRTPD
SQRTSD SQRTPD
MAXSD xmm1,xmm2/m64 MAXPD xmm1,xmm2/m128
MINSD xmm1,xmm2/m64 MINPD xmm1,xmm2/m128