News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

mmx running in parallel

Started by thomas_remkus, August 26, 2007, 01:46:55 AM

Previous topic - Next topic

thomas_remkus

What does it mean when someone says that MMX/SSE/SSE2 can run in parallel to give a performance boost? Parallel?

u

those do the same operation on 2/4/8 integers/floats at once (in parallel).

if you have
var0+=add0;  var1+=add1; var2+=add2; var3+=add3;
you can do it all in 1 instruction:
addps xmm0,xmm1 ; this is SSE
Please use a smaller graphic in your signature.

thomas_remkus

Ultrano, I really respect your code and your comments. I have followed some of your posting before and really respect that you know what you are doing. I am not attempting to offend you ... but I have no idea what you are talking about. Can you offer a little more description on what this does? I was thinking this was an unroll but it seems totally alien.

u

oks, np.
Consider the C code:


float var0=1.0, var1=1.0, var2=1.3, var3=2.0;
float add0=3.0, add1=7.0, add2=2.1, add3=2.2;

var0+=add0;
var1+=add1;
var2+=add2;
var3+=add3;


With asm, you do the 4 additions at once:


.data
align 16
var0 real4 1.0, 1.0, 1.3, 2.0
add0 real4 3.0, 7.0, 2.1, 2.2
.code

movaps xmm2,var0 ; loads var0,var1,var2,var3 into the XMM2 register (128-bit)
addps  xmm2,add0 ; adds to var0,var1,var2,var3 the values of add0,add1,add2,add3 respectively
movaps var0,xmm2 ; stores XMM2 into var0,var1,var2,var3



otherwise, you'd have to do:


fld var0
fadd add0
fstp var0
fld var1
fadd add1
fstp var1
fld var2
fadd add2
fstp var2
fld var3
fadd add3
fstp var3



Loop unrolling, yea. When you have to do the same simple operation on two arrays:

for(int i =0;i<size;i++)array1[i]+=array2[i];

You'll make the loop initially process 4 elements at once. Like replacing one simple "rep movsb" with a "rep movsd"
Please use a smaller graphic in your signature.

thomas_remkus

Ah, that really explains it well. I'm excited to try this out myself now and see what sort of performance I might see from using one method over the next. People say it's "faster" but I have not seen any stats on what that mean.

Thank you again for all your help!

Rockoon

Its faster but usualy doesnt live up to expectations on modern CPU's..

..this can be explained with two facts:

A) a modern CPU is already very good at performing non-SIMD instructions in parallel
B) very few CPU's have enough execution units to perform an entire SIMD instruction all at once

If your non-SIMD code already nearly saturates the execution units of the CPU because the operations are well ordered and interleaved, then the benefits of converting to SIMD are very ellusive.

There are also some fairly annoying downsides to SIMD, such as really only being worthwhile with a Structure-Of-Arrays data layout rather than an Array-of-Structures data layout. Regular non-SIMD code couldn't care less (for the most part.)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

daydreamer

Quote from: Rockoon on August 27, 2007, 06:30:09 AM
Its faster but usualy doesnt live up to expectations on modern CPU's..

..this can be explained with two facts:

A) a modern CPU is already very good at performing non-SIMD instructions in parallel
B) very few CPU's have enough execution units to perform an entire SIMD instruction all at once
NO
A)a modern CPU already getting slower at for example bitshifting but if you choose to use floats and SSE and SSE2 "bitshifts" can be performed 4 in parallel with Donkey's trick on subtract exponent part of float with help of psubb
which is the whole point of why we wanna code in assembler is seek perfection to be the very Worldbest performance otherwise we coded in HLL's
B)stop misinform and read up on Agner Fog's newest doc about up to 3 SSE instructions on MODERN CPU these latest duo 2 core are
modern cpus decoded 3 opcodes simultanously, why it not perform 3 simultanously is when you have dependency on earlier


Rockoon

Quote from: daydreamer on August 27, 2007, 07:54:55 AM
Quote from: Rockoon on August 27, 2007, 06:30:09 AM
Its faster but usualy doesnt live up to expectations on modern CPU's..

..this can be explained with two facts:

A) a modern CPU is already very good at performing non-SIMD instructions in parallel
B) very few CPU's have enough execution units to perform an entire SIMD instruction all at once

NO


lets see

Quote from: daydreamer on August 27, 2007, 07:54:55 AM
A)a modern CPU already getting slower at for example bitshifting but if you choose to use floats and SSE and SSE2 "bitshifts" can be performed 4 in parallel with Donkey's trick on subtract exponent part of float with help of psubb

You found a single example where SSE2 (which is not SSE) can significantly outperform regular integer instructions.

If I then show a single example where FPU code can significantly outperform SSE2 instructions does that invalidate YOUR point?

Quote from: daydreamer on August 27, 2007, 07:54:55 AM
which is the whole point of why we wanna code in assembler is seek perfection to be the very Worldbest performance otherwise we coded in HLL's

Yes. But perfection is an *entire* algorithm performing faster than all the alternatives, and is not related to the performance of single instructions in issolation. To quote Abrash: "TANSTATFC - There Aint No Such Thing As The Fastest Code"

Quote from: daydreamer on August 27, 2007, 07:54:55 AM
B)stop misinform and read up on Agner Fog's newest doc about up to 3 SSE instructions on MODERN CPU these latest duo 2 core are
modern cpus decoded 3 opcodes simultanously, why it not perform 3 simultanously is when you have dependency on earlier

Stop misinforming, huh? 

There are plenty of reasons why the execution units might not be (nearly) saturated.

This is not restricted to instruction decode bottlenecks and generally has not been on any of the intel processors because intel processors have been caching decoded instructions for quite some time. The vast majority of the time good code that does not nearly saturate the execution units of a modern processor does not because either (A) there arent any candidate instructions within the pipeline than can be applied to specific execution units or (B) the dreaded cache miss.

For instance on the core2, each execution unit ("port") can only handle specific uops - all uops with latency 3 must be handled by port 1, all uops with latencies higher than 3 must be handled by port 0. In the example you have given, the psubb instruction gets handled on either port 0 or 1 (the only 2 execution units of the core2's *6* it is allowed on), and it can block the pairing of high latency instructions .. stalling a high latency instruction is a disaster for the pipeline and this can happen with or without there being a dependency chain on that high latency instruction. It can be much better to perform 6 or even 10 latency 1 integer instructions in parallel with that high latency instruction even if it means abandoning the SIMD instructions because the total cost for the block of code can be exactly equal to the latency of that slowest instruction rather than causing it to stall and be even more expensive.

I would again warn you that timing code in isolation is not really a good idea and debating the merits of single instructions without regard to the contexts it will be put in is a waste of time.

As a generalization I would say that slim loops (low amounts of work per iteration) will most often benefit from the SIMD instructions, but that fat loops (large amounts of work per iteration) are usualy better in general (with or without SIMD) and they often exhibit pairing opportunities very similar to SIMD specific code (sometimes better, sometimes worse.) This includes pairing up upkeep instructions such as maintaining counters and pointers, address generations, branching, and memory reads and writes.

And just so you know, the core2 isnt the only modern CPU. When people discuss modern CPU's they are usualy also refering to Pentium M's, AMD64's, and to a lesser extent P4's .. And since AMD64's comprise nearly half of the high end PC gaming market according to the latest valve survey it is very hard to justify arguements based on a single CPU like the core2.

Now stop being so rude, especialy when you are so narrow minded. Thanks.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.