Print Page - SSE2: pxor vs subps

Title: SSE2: pxor vs subps
Post by: jj2007 on July 24, 2010, 08:55:43 PM

Simple question: Is it safe to use subps for zeroing an XMM register? Are there situations where it could fail?

	pxor xmm0, xmm0		; zero xmm0
	subps xmm0, xmm0	; one byte shorter but may raise exceptions??

Title: Re: SSE2: pxor vs subps
Post by: KeepingRealBusy on July 24, 2010, 09:30:42 PM

All sorts of floating point exceptions are possible, especially if you loaded character values or packed words or dwords. I would say, "not safe". I know what you are doing, "one more byte and I can keep this loop within 16 bytes"! :lol

Dave.

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 24, 2010, 09:38:56 PM

Quote from: KeepingRealBusy on July 24, 2010, 09:30:42 PM"one more byte and I can keep this loop within 16 bytes"! :lol

Yep :bg
But I think your advice is wise. Damn it. Anybody experience with UCOMISD? It works but seems incredibly slow...

Title: Re: SSE2: pxor vs subps
Post by: KeepingRealBusy on July 24, 2010, 10:06:53 PM

I just looked it up. It appears to be like an ordinary double floating point compare, but the results are returned in the RFlags. I didn't think any SSE instructions did this. This may be the reason for the slowness, passing the flags to the CPU. The unordered condition seems to be for when one or both of the operands are NANs.

Dave.

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 11:47:56 AM

Does

Code Select

XORPS xmm0,xmm0 not do what you want?

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 25, 2010, 01:28:13 PM

Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does
Code Select Expand
XORPS xmm0,xmm0 not do what you want?

Yes it does but it's one byte longer than subps xmm0, xmm0.

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 03:24:23 PM

Quoteit's one byte longer than subps xmm0, xmm0

No it isn't:

Code Select

CPU Disasm
Address   Hex dump          Command               
00401173    0F57C0          XORPS XMM0,XMM0
00401176    0F5CC0          SUBPS XMM0,XMM0

Title: Re: SSE2: pxor vs subps
Post by: MichaelW on July 25, 2010, 03:36:42 PM

The encodings are the same length for ML 6.15 and 7.00. If they are different for the later versions then why not just db the shorter version?

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 25, 2010, 06:03:13 PM

Quote from: dioxin on July 25, 2010, 11:47:56 AM
Does
Code Select Expand
XORPS xmm0,xmm0 not do what you want?

Sorry, I mistook XORPS for PXOR. I made some tests, and it seems only subps chokes over NaNs. Grateful for some tests on other CPU types (e.g. AMD).

Code Select

12750 ms for psubb
15969 ms for psubq
13094 ms for xorps
13266 ms for xorpd
13250 ms for pxor
16515 ms for subps

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 06:16:45 PM

AMD Phenom II 3GHz

Code Select

7140 ms for psubb
7188 ms for psubq
7141 ms for xorps
5718 ms for xorpd
7157 ms for pxor
6468 ms for subps

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 25, 2010, 09:03:22 PM

Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg

Now that I know what I was looking for, it becomes also clear from the documentation:

SUBPS--Packed Single-Precision Floating-Point Subtract (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc308.htm)

QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc330.htm)

QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.

Title: Re: SSE2: pxor vs subps
Post by: KeepingRealBusy on July 25, 2010, 09:16:08 PM

Quote from: jj2007 on July 25, 2010, 09:03:22 PM
Thanks. On my CPU, only the subps does not zero xmm0 if it contains a NaN. Since you didn't see a box shouting error, it means AMD also has no objections against the use of the 3-byte xorps xmmn, xmmn for zeroing an XMM reg. Good to know :bg

Now that I know what I was looking for, it becomes also clear from the documentation:

SUBPS--Packed Single-Precision Floating-Point Subtract (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc308.htm)
QuoteSIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

XORPS--Bitwise Logical XOR for Single-Precision Floating-Point Values (http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc330.htm)
QuotePerforms a bitwise logical exclusive-OR of the four packed single-precision floating-point values...
SIMD Floating-Point Exceptions
None.

And you saved yourself a byte!

Dave.

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 09:49:38 PM

There should be no problem with AMD processors as it's the way AMD recommend to clear registers.

From the AMD Software Optimization Guide for AMD Family 10h Processors:

Quote9.9Clearing MMX™ and XMM Registers with XOR Instructions
Optimization
Use instructions that perform XOR operations (PXOR, XORPS, and XORPD) to clear all the bits in
MMX and XMM registers.
Application
This optimization applies to:
•32-bit software
•64-bit software

Rationale
The PXOR, XORPS, and XORPD instructions are more efficient than loading a zero value into an
MMX or XMM register from memory and then storing it (see Appendix C, "Instruction Latencies,"
on page 227). In addition, the processor "knows" that the PXOR, XORPS and XORPD instructions
that use the same register for both source and destination do not have a real dependency on the
previous contents of the register, and thus, do not have to wait before completing.
Examples
The following examples illustrate how to clear the bits in a register using the different exclusive-OR
instructions:
; MMX
pxor mm0, mm0 ; Clear the MM0 register.
; SSE
xorps xmm0, xmm0 ; Clear the XMM0 register.
; SSE2
xorpd xmm0, xmm0 ; Clear the XMM0 register.

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 25, 2010, 10:26:45 PM

Quote from: KeepingRealBusy on July 25, 2010, 09:16:08 PM
And you saved yourself a byte!

Dave.

Actually, it saved 16 bytes, as it pushed the proc from 65 to 64 bytes :bg
(not that I really believed in align 16, but many here do ::))

Quote from: dioxin on July 25, 2010, 09:49:38 PMThere should be no problem with AMD processors as it's the way AMD recommend to clear registers.

Wow, thanks for the link!

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 10:32:51 PM

http://support.amd.com/us/Processor_TechDocs/40546-PUB-Optguide_3-11_5-21-09.pdf

Quotethanks for the link!

You're welcome.

Title: Re: SSE2: pxor vs subps
Post by: dioxin on July 25, 2010, 11:21:17 PM

Quotenot that I really believed in align 16, but many here do

You don't believe in aligning code?
But it makes a big difference if you align it. I get the following results if I sprinkle a few ALIGN 16s in the code:

Code Select

 5748 ms for psubb
 5729 ms for psubq
 5730 ms for xorps
 5730 ms for xorpd
 5729 ms for pxor
 5990 ms for subps

That's much better than the original.

Title: Re: SSE2: pxor vs subps
Post by: jj2007 on July 26, 2010, 06:49:50 AM

Putting align 16 in front of the innermost loop has absolutely NO effect on my Celeron M... but it's true that there are more sensitive CPUs around.

Title: Re: SSE2: pxor vs subps
Post by: Rockoon on July 28, 2010, 12:52:22 AM

Branch target aligning is not so effective on Intel (currently) due to the trace cache, but it doesnt usually hurt to align anyways.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on July 24, 2010, 08:55:43 PM