News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

c0d1f1ed

Quote from: hutch-- on June 16, 2008, 04:02:52 PM
Algos like encryption, compression, searching, data structures(trees - hash tables) are fundamentally sequential (serial) in nature.

Encryption: CTR mode allows parallelization without weakening security
Compression: Huffman coding benefits from multiple processors and is optimal
Searching: the most embarassingly parallel problem of all, ask Google
Data structures: The Art of Multiprocessor Programming has chapters on lock-free trees, hash tables, and others

QuoteThere may be alternatives but they tend to be inferior designs in terms of encryption and compression as both designs are serial chained algos.

Bollocks. But that being said, there are lots of interesting alternative algorithms that have better scaling behavior while only sacrificing a marginal bit of effectiveness.

QuoteWho cares if they cannot do the same task.

Who cares about sequential algorithms if they can't deliver.

QuotePhysical limits change as in fact they have over time, its a brave man who predicts no speed increase in instruction throughput.

Good luck changing the speed of light.

The writing is on the wall. AnandTech's review of Nehalem shows that it is almost no faster at running single-threaded code. Any increase can entirely be attributed to the integrated memory controller, which is nothing more than an increment improvement that won't bring back the steady single-threaded performance increases from the MHz-race days.

QuoteThis IS magical technology in that it parallels single thread instructions through multiple pipelines and YES it did get faster because of it.

Sure, 15 years ago. Nowadays superscalar execution is exploited to the practical maximum. And there's no new trick up the chip designer's sleeves to furhter increase single-threaded instruction throughput in any substantial way.

Read Chip Multi-Processor Scalability for Single-Threaded Applications and note how the number of instructions that can execute in parallel is not much higher than 4 for realistic designs. Guess what, current CPUs already have 4 execution ports. Also note the rapidly diminishing returns for throwing more resources at it. Oh and don't forget the conclusion, which clearly states that even the most aggressive approach to increasing single-threaded performance would run out of steam in a mere 6-8 years. Looking at Intel and other chipmakers roadmaps its pretty clear they rather spend their transistors on multiple cores.

QuotePredicting the future is best done with a crystal ball, most have to be satisfied with continuity and that has been faster machines over time, multicore technology is still in its infancy in the PC market, it is useful but its not universal in its application.

I don't need a crystal ball to see that all roadmaps are going towards massive numbers of cores. It's also clear that the continuity of the MHz-race got disrupted when Tejas got ditched in favor of multi-cores based on P6. We have a new continuity now: performance per watt. And multi-core right now gives us the biggest increase in performance for every extra transistor burning power.

QuoteMost multicore technology on current PCs is win95 multithreading technology applied to multicore hardware. Fine for where its faster but lousy where its not.

That's what this thread is supposed to be about. Yes, classic approaches of having one task per thread have little good applications. But by scheduling tasks within a thread, using lock/wait-free approaches, practically all software can scale to many cores.

hutch--

#151
 :bg

A sad tale but true, http://www.intel.com/products/processor/itanium/

In particular view the link on this page "Dual-Core Intel® Itanium® processor demo" to answer your questions about the advantage of EPIC architecture and what its advantages are over RISC and similar older technology.

Quote
Itanium-based servers are incredibly scalable, allowing configuration in systems of as many as 512 processors and a full petabyte (1024TB) of RAM. Together with full support for both 32-bit and 64-bit applications, that capacity provides unmatched flexibility in tailoring systems to your enterprise needs.

Quote
Dual-core processing, EPIC architecture, and Hyper-Threading Technology†      Supports massive, multi-level parallelism for today's data-intensive workloads

Quote
Support for up to 512 processors and one petabyte (1024TB) RAM      Provides scalable performance for enterprise flexibility

Quote
Up to 24MB of low-latency L3 cache      Prevents idle processing cycles with a high-bandwidth data supply to the execution cores

Quote
Core Level Lock-step     Enables one processor core to mirror the operations of the other

more ...............

You have continued to make assuptions based on current x86 architecture without realising that it is an ancient achitecture. The future of genuine high performance computing is not a threaded model on low performance multiple cores, it is BOTH synchronous and asynchronous applications running on very high performance hardware.

I have said to you before, don't be without ambition, think in terms of 512 dual core Itaniums as current technology that can do things you would not believe.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

On a slightly off-topic point.. why oh why haven't they updated the x86 fpu to a non-stack based model (keep it there for compatability if needed) but FPU performance could be increased in the margin of about 15-20% I would reckon just by changing the instruction set / opcodes. Every Stack FPU based piece i've ever written would be about 20% shorter (instruction count) when implemented on a normal register based FPU (ala 68k).

Just a thought.

c0d1f1ed

hutch, I suggest you first read the manual for the actual facts instead of throwing Itanium marketing talk at me as an argument.

QuoteIn particular view the link on this page "Dual-Core Intel® Itanium® processor demo" to answer your questions about the advantage of EPIC architecture and what its advantages are over RISC and similar older technology.

The advantage of EPIC is that instruction dependencies are Explicit. The disadvantage of EPIC is that instruction dependencies are Explicit.

That's right. The task of instruction scheduling is entirely the compiler's responsability (or the assembly programmer's, if you're that brave). It's an in-order architecture so cache misses and dependencies mean stalls. So much for hardware helping you reach higher performance. The six instructions per cycle is a maximum, just like Core 2's five instructions per cycle is a maximum. And the reality is that this maximum is hardly ever reached, for the simple reason that cache misses and instruction dependencies are unavoidable. There's not enough intrinsic parallelism in a thread to sustain the maximum throughput. The only thing Itanium is rather good at is multimedia and scientific computing, but ironically that's perfectly suited for multi-core as well.

It's also interesting that with Montecito Intel added switch-on-event multithreading. This allows it to execute another thread when a cache miss occured. In other words they resorted to multi-threading to increase the effective throughput. The fact that it's also a dual-core should tell you just how much of a dead end trying to increase single-threaded performance would be.

QuoteYou have continued to make assuptions based on current x86 architecture without realising that it is an ancient achitecture.

It hardly matters. An add, mul or div on x86 is just as good as on any other ISA. The actual architecture inside has changed tremendously from one generation to the next. What started as an in-order CISC processor became an out-of-order multi-issue RISC processor. The instruction set is nothing more than a facade, an interface used by the software. The reason why x86 is still alive and kicking is because its flaws have started to matter less and less. It matters so little that Larrabee, primarily a GPU, will be x86 based.

Going multi-core is not another technology to hide any of x86's flaws. All processors are going multi-core, including IA64, ARM, SPARC, PowerPC, and many more.

QuoteI have said to you before, don't be without ambition, think in terms of 512 dual core Itaniums as current technology that can do things you would not believe.

Rest assured, I'm very ambitious, but I'm also certain that just having this kind of hardware won't automagically result in faster software that makes good use of it. It requires considerable effort in software design to scale it to such a high number of cores.

c0d1f1ed

Quote from: johnsa on June 17, 2008, 11:46:53 AM
On a slightly off-topic point.. why oh why haven't they updated the x86 fpu to a non-stack based model (keep it there for compatability if needed) but FPU performance could be increased in the margin of about 15-20% I would reckon just by changing the instruction set / opcodes. Every Stack FPU based piece i've ever written would be about 20% shorter (instruction count) when implemented on a normal register based FPU (ala 68k).

What's wrong with SSE(2)?

johnsa

Nothing wrong with sse imho.. I just think they should update the std. fpu instruction set..

fmov f0,dword ptr [esi]
fsqr f0
fmov f1,f0
fsqrt f1
fmov f2,1.0
fneg f2

that sort of thing.. it produces code which on average is about 20% fewer instructions than the whole st(n) stack model.

hutch--

This much I have learnt about Intel over time, they have an irritating habit of knowing what they are talking about with their own hardware lines and as the worlds leading chipmaker they have the proof of what they say. While the Itanium has a terrible instruction set, they are not some pie in the sky future model, they are in production and are the base component for massive supercomputers from companies like SGI and others with customers like NASA, the JPL and many resaerch universities.

In rushing to avoid the data on production hardware already doing the job, you have missed some of its important capaciies. Massive extendability with added cores up to 512 dual and the coming quad core versions. Then there is the existing "Core Level Lock-step" capacity, hardware synchronisation of multiple cores for mirroring. So much for the x86 based notions of limitations of processor locking.

Then there is the notion that an Itanium is at some disadvantage in terms of stalls yet you have to go back to a 386 or earlier to avoid stalls on x86 hardware. You may have the ambtion but you are still wearing blinkers in terms of hardware that is coming and current hardware limits. The future holds BOTH synchronous and asynchronous parallel processing, not multithreaded model asynchronous processing alone.

The need for both is obvious as even if you can achieve asynchronous parllelism speed improvements, there is a limit to the number of useful spits in the task so if you have 64 cores but can only use 8 of them, your task is limited by the core count it can use, not the hardware. Synchronous parallel processing is the only way arond this limit so you have synchronous parallel processing being run in each thread of an asynchronous application.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Quote from: johnsa on June 17, 2008, 04:59:44 PM
Nothing wrong with sse imho.. I just think they should update the std. fpu instruction set..

fmov f0,dword ptr [esi]
fsqr f0
fmov f1,f0
fsqrt f1
fmov f2,1.0
fneg f2

that sort of thing.. it produces code which on average is about 20% fewer instructions than the whole st(n) stack model.

No seriously, what's wrong with SSE? :wink

movss xmm0, dword ptr [esi]
mulss xmm0, xmm0
movss xmm1, xmm0
sqrtss xmm1, xmm1
movss xmm2, one
xorps xmm2, sign

johnsa