News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SSE2: pxor vs subps

Started by jj2007, July 24, 2010, 08:55:43 PM

Previous topic - Next topic

jj2007

Simple question: Is it safe to use subps for zeroing an XMM register? Are there situations where it could fail?

pxor xmm0, xmm0 ; zero xmm0
subps xmm0, xmm0 ; one byte shorter but may raise exceptions??

KeepingRealBusy

All sorts of floating point exceptions are possible, especially if you loaded character values or packed words or dwords. I would say, "not safe". I know what you are doing, "one more byte and I can keep this loop within 16 bytes"! :lol

Dave.

jj2007

Quote from: KeepingRealBusy on July 24, 2010, 09:30:42 PM"one more byte and I can keep this loop within 16 bytes"! :lol

Yep :bg
But I think your advice is wise. Damn it. Anybody experience with UCOMISD? It works but seems incredibly slow...

KeepingRealBusy

I just looked it up. It appears to be like an ordinary double floating point compare, but the results are returned in the RFlags. I didn't think any SSE instructions did this. This may be the reason for the slowness, passing the flags to the CPU. The unordered condition seems to be for when one or both of the operands are NANs.

Dave.

dioxin

Does XORPS xmm0,xmm0 not do what you want?


jj2007

Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does XORPS xmm0,xmm0 not do what you want?

Yes it does but it's one byte longer than subps xmm0, xmm0.

dioxin

Quoteit's one byte longer than subps xmm0, xmm0
No it isn't:
CPU Disasm
Address   Hex dump          Command               
00401173    0F57C0          XORPS XMM0,XMM0
00401176    0F5CC0          SUBPS XMM0,XMM0

MichaelW

The encodings are the same length for ML 6.15 and 7.00. If they are different for the later versions then why not just db the shorter version?
eschew obfuscation

jj2007

Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does XORPS xmm0,xmm0 not do what you want?

Sorry, I mistook XORPS for PXOR. I made some tests, and it seems only subps chokes over NaNs. Grateful for some tests on other CPU types (e.g. AMD).

12750 ms for psubb
15969 ms for psubq
13094 ms for xorps
13266 ms for xorpd
13250 ms for pxor
16515 ms for subps

dioxin

AMD Phenom II 3GHz
7140 ms for psubb
7188 ms for psubq
7141 ms for xorps
5718 ms for xorpd
7157 ms for pxor
6468 ms for subps

jj2007

Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg

Now that I know what I was looking for, it becomes also clear from the documentation:

SUBPS--Packed Single-Precision Floating-Point Subtract
QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values
QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.

KeepingRealBusy

Quote from: jj2007 on July 25, 2010, 09:03:22 PM
Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg

Now that I know what I was looking for, it becomes also clear from the documentation:

SUBPS--Packed Single-Precision Floating-Point Subtract
QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values
QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.

And you saved yourself a byte!

Dave.

dioxin

There should be no problem with AMD processors as it's the way AMD recommend to clear registers.

From the AMD Software Optimization Guide for AMD Family 10h Processors:
Quote9.9Clearing MMX™ and XMM Registers with XOR Instructions
Optimization
Use instructions that perform XOR operations (PXOR, XORPS, and XORPD) to clear all the bits in
MMX and XMM registers.
Application
This optimization applies to:
•32-bit software
•64-bit software

Rationale
The PXOR, XORPS, and XORPD instructions are more efficient than loading a zero value into an
MMX or XMM register from memory and then storing it (see Appendix C, "Instruction Latencies,"
on page 227). In addition, the processor "knows" that the PXOR, XORPS and XORPD instructions
that use the same register for both source and destination do not have a real dependency on the
previous contents of the register, and thus, do not have to wait before completing.
Examples
The following examples illustrate how to clear the bits in a register using the different exclusive-OR
instructions:
; MMX
pxor mm0, mm0 ; Clear the MM0 register.
; SSE
xorps xmm0, xmm0 ; Clear the XMM0 register.
; SSE2
xorpd xmm0, xmm0 ; Clear the XMM0 register.

jj2007

Quote from: KeepingRealBusy on July 25, 2010, 09:16:08 PM
And you saved yourself a byte!

Dave.

Actually, it saved 16 bytes, as it pushed the proc from 65 to 64 bytes :bg
(not that I really believed in align 16, but many here do ::))

Quote from: dioxin on July 25, 2010, 09:49:38 PMThere should be no problem with AMD processors as it's the way AMD recommend to clear registers.

Wow, thanks for the link!