Simple question: Is it safe to use subps for zeroing an XMM register? Are there situations where it could fail?
pxor xmm0, xmm0 ; zero xmm0
subps xmm0, xmm0 ; one byte shorter but may raise exceptions??
All sorts of floating point exceptions are possible, especially if you loaded character values or packed words or dwords. I would say, "not safe". I know what you are doing, "one more byte and I can keep this loop within 16 bytes"! :lol
Dave.
Quote from: KeepingRealBusy on July 24, 2010, 09:30:42 PM"one more byte and I can keep this loop within 16 bytes"! :lol
Yep :bg
But I think your advice is wise. Damn it. Anybody experience with UCOMISD? It works but seems incredibly slow...
I just looked it up. It appears to be like an ordinary double floating point compare, but the results are returned in the RFlags. I didn't think any SSE instructions did this. This may be the reason for the slowness, passing the flags to the CPU. The unordered condition seems to be for when one or both of the operands are NANs.
Dave.
Does XORPS xmm0,xmm0
not do what you want?
Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does XORPS xmm0,xmm0
not do what you want?
Yes it does but it's one byte longer than subps xmm0, xmm0.
Quoteit's one byte longer than subps xmm0, xmm0
No it isn't:
CPU Disasm
Address Hex dump Command
00401173 0F57C0 XORPS XMM0,XMM0
00401176 0F5CC0 SUBPS XMM0,XMM0
The encodings are the same length for ML 6.15 and 7.00. If they are different for the later versions then why not just db the shorter version?
Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does XORPS xmm0,xmm0
not do what you want?
Sorry, I mistook XORPS for PXOR. I made some tests, and it seems only subps chokes over NaNs. Grateful for some tests on other CPU types (e.g. AMD).
12750 ms for psubb
15969 ms for psubq
13094 ms for xorps
13266 ms for xorpd
13250 ms for pxor
16515 ms for subps
AMD Phenom II 3GHz
7140 ms for psubb
7188 ms for psubq
7141 ms for xorps
5718 ms for xorpd
7157 ms for pxor
6468 ms for subps
Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg
Now that I know what I was looking for, it becomes also clear from the documentation:
SUBPS--Packed Single-Precision Floating-Point Subtract (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc308.htm)
QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.
XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc330.htm)
QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.
Quote from: jj2007 on July 25, 2010, 09:03:22 PM
Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg
Now that I know what I was looking for, it becomes also clear from the documentation:
SUBPS--Packed Single-Precision Floating-Point Subtract (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc308.htm)
QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.
XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc330.htm)
QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.
And you saved yourself a byte!
Dave.
There should be no problem with AMD processors as it's the way AMD recommend to clear registers.
From the AMD Software Optimization Guide for AMD Family 10h Processors:
Quote9.9Clearing MMX™ and XMM Registers with XOR Instructions
Optimization
Use instructions that perform XOR operations (PXOR, XORPS, and XORPD) to clear all the bits in
MMX and XMM registers.
Application
This optimization applies to:
•32-bit software
•64-bit software
Rationale
The PXOR, XORPS, and XORPD instructions are more efficient than loading a zero value into an
MMX or XMM register from memory and then storing it (see Appendix C, "Instruction Latencies,"
on page 227). In addition, the processor "knows" that the PXOR, XORPS and XORPD instructions
that use the same register for both source and destination do not have a real dependency on the
previous contents of the register, and thus, do not have to wait before completing.
Examples
The following examples illustrate how to clear the bits in a register using the different exclusive-OR
instructions:
; MMX
pxor mm0, mm0 ; Clear the MM0 register.
; SSE
xorps xmm0, xmm0 ; Clear the XMM0 register.
; SSE2
xorpd xmm0, xmm0 ; Clear the XMM0 register.
Quote from: KeepingRealBusy on July 25, 2010, 09:16:08 PM
And you saved yourself a byte!
Dave.
Actually, it saved 16 bytes, as it pushed the proc from 65 to 64 bytes :bg
(not that I really believed in align 16, but many here do ::))
Quote from: dioxin on July 25, 2010, 09:49:38 PMThere should be no problem with AMD processors as it's the way AMD recommend to clear registers.
Wow, thanks for the link!
http://support.amd.com/us/Processor_TechDocs/40546-PUB-Optguide_3-11_5-21-09.pdf
Quotethanks for the link!
You're welcome.
Quotenot that I really believed in align 16, but many here do
You don't believe in aligning code?
But it makes a big difference if you align it. I get the following results if I sprinkle a few ALIGN 16s in the code:
5748 ms for psubb
5729 ms for psubq
5730 ms for xorps
5730 ms for xorpd
5729 ms for pxor
5990 ms for subps
That's much better than the original.
Putting align 16 in front of the innermost loop has absolutely NO effect on my Celeron M... but it's true that there are more sensitive CPUs around.
Branch target aligning is not so effective on Intel (currently) due to the trace cache, but it doesnt usually hurt to align anyways.