SSE2 or FPU for floating point calcs?

jj2007 · June 09, 2008, 08:03:21 AM

I am trying to find out whether it's worth investing in SSE2 for floating point algos.

Here are some bits & pieces I found:

a) Speed: Be warned though, MMX/SSE is only fast if you vectorize. _mm_sqrt_ps is twice as fast as calling fsqrt and it does 4 sqrts instead of one. However, _mm_load_ps+_mm_store_ps takes longer than the sqrt function itself; That sounds like FPU is faster for non-vectorised apps (which is the normal case).

b) Backward compatibility: SSE2 not supported by pre-PIVs.

c) Forward compatibility: FPU supported by Win XP64, see this post.

d) Precision: FPU 80-bit, SSE2 64-bit

e) Space: FPU code inherently a bit shorter (2 bytes each), no libs required

f) Complexity: Serious problems using approximate math library, Intel Software Network, recommended in this post by Greg. Google finds only a handful of refs to amaths.lib, which is discouraging.

Overall my belly says FPU is good enough, especially if you leave stuff on the FPU until you're done. Other thoughts?

NightWare · June 10, 2008, 09:59:10 PM

it depends of your needs, if mmx have to be used in your code, or speed, or parallelism is essential then sse2 is the choice, if precision is essential FPU is the choice... Now, it's also possible to advantagously use both (if you don't need mmx).

Neo · June 11, 2008, 05:55:00 AM

Quote from: jj2007 on June 09, 2008, 08:03:21 AM
I am trying to find out whether it's worth investing in SSE2 for floating point algos.

Here are some bits & pieces I found:

a) Speed: Be warned though, MMX/SSE is only fast if you vectorize. _mm_sqrt_ps is twice as fast as calling fsqrt and it does 4 sqrts instead of one. However, _mm_load_ps+_mm_store_ps takes longer than the sqrt function itself; That sounds like FPU is faster for non-vectorised apps (which is the normal case).

sqrtps isn't designed to be the fast way; you can use rsqrtps then rcpps (or rsqrtss and rcpss for scalar). Especially don't use MMX, since it's obsolete, not as useful as SSE, and causes problems if you need to combine the code with FPU operations. The claim that the loading+storing takes longer than doing one fsqrt is bull, unless you've got a page fault on the loading/storing. Ideally, you'd keep the values in registers and just use them anyway, since storing them in memory is usually unnecessary.

Quoteb) Backward compatibility: SSE2 not supported by pre-PIVs.

This is a concern if you're still using a P3 or equivalent. rsqrtps and rcpps should be SSE, not SSE2, but I don't know whether that helps much. I tend to completely ignore legacy systems, but I realize that it's an important issue for some people.

Quotec) Forward compatibility: FPU supported by Win XP64, see this post.

They can't prevent people from using the FPU, since the CPU has no mechanism to do so, and there are some FPU operations that can't feasibly be done using SSE, such as trigonometry and exponentiation. They could stop supporting the FPU by not storing the FPU state on task switches, which would mean that you'd randomly lose all FPU information without warning, but they might as well keep the info because FXSAVE saves the FPU and SSE state.

Quoted) Precision: FPU 80-bit, SSE2 64-bit

If you need higher precision, this is a concern. There may be more complicated ways of simulating higher precision with SSE, but that'd be a big pain. I found that for some image processing, 16-bit fixed point calculations were precise enough, so I can do 8 of them at once using SSE2 integer instructions.

Quotee) Space: FPU code inherently a bit shorter (2 bytes each), no libs required

Is code space a concern for you? If you're in the Master Boot Record, it's definitely a concern, but often data size is much more of an issue.

Quotef) Complexity: Serious problems using approximate math library, Intel Software Network, recommended in this post by Greg. Google finds only a handful of refs to amaths.lib, which is discouraging.

I dunno much about the math library you reference. It's true that it's tough to learn how to use SSE, though, since there isn't a very good tutorial on it out there. :(

QuoteOverall my belly says FPU is good enough, especially if you leave stuff on the FPU until you're done. Other thoughts?

For most things, performance isn't that important, but then you probably wouldn't be using assembly language for those things. From experience, I've found that you can often get a massive speedup (i.e. 2 to 20 times faster) from SSE over FPU operations if that's where most of the time is spent. Some things that don't look vectorizable often are, including some things with branches, but it can be very difficult.

Best of luck with your decision :U

Draakie · June 11, 2008, 06:28:14 AM

QuoteI dunno much about the math library you reference. It's true that it's tough to learn how to use SSE, though, since there isn't a very good tutorial on it out there.

Please see :

http://www.masm32.com/board/index.php?topic=8498.0

jj2007 · June 11, 2008, 12:24:43 PM

Thanks to all of you, especially Neo for his detailed post :thumbu

GregL · June 13, 2008, 03:15:14 AM

Quote... recommended in this post by Greg.

I wasn't recommending it, I was just mentioning it, because it claimed to have sine, cosine, tangent etc. functions that were supposedly faster than the equivalent FPU instructions (but only at double precision). At that time, three years ago, they were saying the FPU wasn't supported in x64, which turned out to not be the case.

jj2007 · June 13, 2008, 07:33:30 AM

Quote from: Greg on June 13, 2008, 03:15:14 AM
I wasn't recommending it, I was just mentioning it, because it claimed to have sine, cosine, tangent etc.

Intel's approximate math library does have these functions, but apparently those who tried using them failed bitterly. Strange that there is no working SSE2 math library. Anyway, for a general purpose lib, the FPU seems fast enough, especially since >90% of code doesn't require parallel maths - the main strong point of SSE2.

c0d1f1ed · June 13, 2008, 09:07:47 AM

Quote from: Neo on June 11, 2008, 05:55:00 AM
sqrtps isn't designed to be the fast way; you can use rsqrtps then rcpps (or rsqrtss and rcpss for scalar).

Better yet, use rsqrtps xmm1, xmm0 | mulps xmm1, xmm0.

Neo · June 13, 2008, 04:38:52 PM

Quote from: c0d1f1ed on June 13, 2008, 09:07:47 AM
Quote from: Neo on June 11, 2008, 05:55:00 AM
sqrtps isn't designed to be the fast way; you can use rsqrtps then rcpps (or rsqrtss and rcpss for scalar).

Better yet, use rsqrtps xmm1, xmm0 | mulps xmm1, xmm0.

Good thinking! Looks like the latency and throughput of mulps and rcpps are similar, though. However, does having the dependency on the previous instruction be the destination and not the source help in your suggestion? I guess this could use a benchmark test, hehe. Cool idea. :U

GregL · June 14, 2008, 03:06:20 AM

QuoteAnyway, for a general purpose lib, the FPU seems fast enough ...

I agree, I would use the FPU, unless I really needed the speed. SSE/SSE2 is a good thing to be at least a little familiar with though.

c0d1f1ed · June 14, 2008, 03:29:16 PM

Quote from: Neo on June 13, 2008, 04:38:52 PM
Good thinking! Looks like the latency and throughput of mulps and rcpps are similar, though. However, does having the dependency on the previous instruction be the destination and not the source help in your suggestion? I guess this could use a benchmark test, hehe. Cool idea. :U

The destination of the mulps also works as a source, so there's no performance difference compared to writing rsqrtps xmm1, xmm0 | mulps xmm0, xmm1. Today's x86 processors have register renaming and internally use three operands. Anyway, the real reason to use mulps instead of rcpps is precision. Cheers.

News:

SSE2 or FPU for floating point calcs?

c0d1f1ed

c0d1f1ed