News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

c0d1f1ed

Quote from: hutch-- on June 02, 2008, 12:41:17 PM
Have a look at this thread and you will see that your assumptions are unsound. In particular look at the graphs that Greg has posted and it shows clearly that a single thread is being processed by both cores. Forget abstraction, high level tool to make it all easier and magical high level libraries that will do it all for you, this IS being done in hardware on a modern dual core Intel processor. The future is more of the same but with many more cores interfaced in hardware.

What are you talking about? There's no performance improvement. The O.S. simply decides to run the thread on another core from time to time.

c0d1f1ed

Quote from: johnsa on June 03, 2008, 09:02:36 PM
don't multi-thread and let the cpu work out how best to split the instructions up that it receives amongst it's cores.

It's not splitting up the instructions amongst its cores.

Even if it did, that approach just doesn't scale. There's only a limited amount of instruction parallelism in any one thread, and extracting it becomes exponentially more complex. By having independent instruction steams (threads) you can get much higher throuhput.

c0d1f1ed

Quote from: hutch-- on June 04, 2008, 12:49:33 AM
Its my guess over time that with the core count increase that we will see a technique of core clustering in much the same way as the 4 core double core 2 Duos work at the moment but on a much larger scale.

Why would you want to do that? You're probably not getting that ALUs are pretty cheap now. Sandy bridge will feature vector units of eight elements, for only a limited range of applications. The real cost is in execution ports, for which sharing of resources between cores just doesn't help.

QuoteI would expect to see later OS versions with far more accurate thead synchronisation methods that have far finer granularity than current OS versions which should allow greater parrallelism between multiple clusters, each in itself scheduling instructions within a single thread across the number of cores in a cluster.

How, and why? First of all, there is an inherent cost for switching threads. You can have Hyper-Threading which can switch threads at a per clock basis but it's too expensive to have much more than four threads, while you need a lot more to get the finer granularity you're talking about. And secondly, you don't need it if you have one thread per core that just continuously schedule and execute tasks. So why use expensive hardware solutions if there are perfectly adequate and cheap software solutions?

And don't get me wrong, I'm not excluding some kind of high speed thread migration and such in the far future, but again it's just not the silver bullet that will free developers from writing a proper multi-core software design.

hutch--

c0d1f1ed,

The interesting part of the graphs that Greg posted is the load distribution across both cores. If you bother to look the load is not being switched from one core to another, its being distributed across both cores and this is with single thread code. Your technical assumptions are simply incorrect and it is because you have based your ideas on current software technology, not the emerging hardware capacity.

Nor is the capacity some pie in the sky idea far into the future, it is ALREADY being done in high end hardware and it is being done the only way possible to pick up processor parallelism, directly in hardware. While near 4 gig clock speeds may sound fast at a computing level, at the electronic level there is hardware that is at least hundreds of times faster and if this type of technology is built into the chip directly for path length reasons the capacity to synchronise hardware will dramatically expand and at far higher speeds.

Software multitasking is old junk from the past at about win 3.0 technology, the future is multiprocessor / multicore hardware that can do both close range single thread load distribution as well as the current technology of multiple concurrent threads. The dustbin of history is full of hybrid stopgap technology, why waste your time with junk like this when hardware will change it all as it always has.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Hutch, do you think perahps the graph is just really inaccurate? It could possibly just be switching the load from core 0 to core 1 but due to the lack of fine granularity in the graph you cant see the alternating sqaure wave pattern so to speak.

hutch--

John,

While its possible, the first graph that Greg posted certainly did not look like it. The interesting part is even if it was switching between cores alternately it tells you something about how data is cached between two core and how the two cores are sharing the data. The speeds coming out of the later Core 2 Duos does not look like alternate core stalls and with normal pipeline length this would be unavoidable due to result dependencies of sequential instructions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Agreed, in a standard two core setup the way I've understood it from Intel's perspective is that each core has it's own cache and instruction prefetch etc. So moving a thread backwards and forward between 2 cores would cause all sorts of performance issues in terms of the instruction pipeline, re-fetch and cache updates.

So it would seem that even if the cores aren't sharing the load so to speak, that they have been implementing some sort of shared prefetch/cache setup to allow code to transition seamlessly from core to core.
Perhaps this is what we could consider step 1 of something bigger down the line.

hutch--

John,

> Perhaps this is what we could consider step 1 of something bigger down the line.

It would seem so. While a much faster threaded model has clear enough logic, it does not look like the hardware is going in that direction yet. Multiple parallel processing through multiple cores is probably where the future is going as it can beat the clock speed bottleneck and if done properly by a reasonably large degree.

What will be interesting is if instruction order in code can effect the distribution of load across multiple cores in much the same way as multiple pipelines respond well to preferred instruction choice and order. The PIII was going in this direction and the PIVs certainly responded well to proper instruction choice, effectively RISC for preferred instructions while leaving the antiques to be emulated in microcode.

The other end was if you did mess up an instruction ordering with pipeline design of this type, you were stung with a big performance drop. Something tha most asm people are familiar with is different encodings suit different hardware and part of the art is to produce reasonable averages across most common hardware. I have no doubt that the coming generation of multiple core hardware will have its own set of optimisations as well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Quote from: hutch-- on June 06, 2008, 01:52:46 AMThe interesting part of the graphs that Greg posted is the load distribution across both cores. If you bother to look the load is not being switched from one core to another, its being distributed across both cores and this is with single thread code.

It's not distributed across the cores. What you see is an averaging of the thread running for some time on one core and for some time on the other core, never simultaneously.

c0d1f1ed

Quote from: hutch-- on June 06, 2008, 09:10:44 AM
The interesting part is even if it was switching between cores alternately it tells you something about how data is cached between two core and how the two cores are sharing the data. The speeds coming out of the later Core 2 Duos does not look like alternate core stalls and with normal pipeline length this would be unavoidable due to result dependencies of sequential instructions.

Thread switching happens either way. So wether you're on a single-core or a multi-core, the O.S. schedules a different thread at every interrupt.

So you get some stalls either way and you won't see much of a speed difference running a single thread on a single-core or multi-core processor.

c0d1f1ed

Quote from: johnsa on June 06, 2008, 01:45:23 PMSo moving a thread backwards and forward between 2 cores would cause all sorts of performance issues in terms of the instruction pipeline, re-fetch and cache updates.

Those penalties are minor compared to the length of the time slices (in the order of 10-100 milliseconds). And even if other threads are essentially idle (there could be thousands of them), the O.S. still schedules them from time to time so they can see whether they got any new tasks waiting. This happens on a single-core too so your thread is going to get interrupted several times per second anyway. On a dual-core, it doesn't matter much if, when your thread gets scheduled again, it gets the first or the second core.

QuoteSo it would seem that even if the cores aren't sharing the load so to speak, that they have been implementing some sort of shared prefetch/cache setup to allow code to transition seamlessly from core to core.

Core 2 has a shared L2 cache, which means that the worst data access penalties you're going to get is an L2 fetch latency. That's negligible compare to the length of the time slice. The transient is practically entirely compensated with out-of-order execution.

hutch--

c0d1f1ed,

It is evident that you have not had a good look at the graph that Greg posted in a file named p4timing.jpg. The test piece is a deliberately processor intensive single thread yet the graph clearly shows that at startup the app loads BOTH cores for the beginning of the graph then settles down to a reverse symetrical load sharing. You appear to have missed that it is running a single thread, the OS thread scheduler has nothing to do with the load distribution between the two cores.

I have as comparison a 3.2 gig Prescott PIV which is similar in clock speed to Greg's Duo yet his timings are faster than the Prescott even allowing for it having faster memory and a higher BUS speed than the PIV. It is hardly a problem on a single core PIV to set the priority to a test piece so the time slice for test purposes is no big deal but the simple fact is the Duo is faster due to its multicore processing of the single thread test piece.

Forget abstraction, magic libraries, ring3 emulation of Win 3.0 co-operative multitasking, its all old hat technology from the dustbin of history. The brave new world will continue as it is developing at the moment, multiple core processing of single thread code in conjunction with existing capacity of multiple threads for essentially concurent threads for tasks ike servers and the like.

Vertical performance (how fast a thread will execute) will come from the former, horizontal performance (how many concurrent threads) will come from the latter.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Quote from: hutch-- on June 07, 2008, 04:34:28 AM
the graph clearly shows that at startup the app loads BOTH cores for the beginning of the graph
Some of that can be windows itself (looking up bizarre registry keys etc.) before the program even gets loaded.

Seems to me that one side here is fixed on hardware and the other on software. I would like an OS that will run on
two of my cores and leave the other two for the one or two programs that I actually use 'simultaneously/multitaskingly'

Famous names: Multi-threaded development joins Gates as yesterday's man

edit: interesting that 8 out of 20 new topics in the lab are about multi core/cpu...
Light travels faster than sound, that's why some people seem bright until you hear them.

hutch--

Thanks for the link, its a good article. i come down on the side of knowing what you are doing, not the magic library approach. It also comes across that this area is both in its infancy in terms of cheap PCs and the futuer design direction is not all that clear. Donald Knuth's comments are indeed interesting and the sad part is he may be right.

I wonder when we will see the first terahertz processor ?  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

hutch,

When you start a process a lot more happens than setting the instruction pointer to the entry point. For every new memory page accessed, an interrupt is generated an the O.S. loads the page from disk. Disk access is controlled by a separate thread to allow asynchronous I/O and arbitration between other processes. So this is one way in which a single-threaded application can still cause multiple threads to run concurrently. But this is just a startup phenomenon. After that you get practically zero benefit from running a single-threaded application on a multi-core. Every few timeslices the O.S. simply schedules the thread on another core, which when averaged out over a second or so becomes a 50/50 distribution. The reason the O.S. schedules them on different cores is because time slices expire either way, and there are many other threads wanting some execution time. With just my browser open, Task Manager tells me Windows is juggling over 500 threads. So the O.S. schedules our thread on whatever core is available at the time it decides it's our turn again.

Greg's Core 2 Duo is not faster than your Prescott at running single-threaded applications because of dual-core. It's faster because the significantly different architecture can sustain a much higher IPC, and it has a more advanced cache hierarchy. The NetBurst architecture had a low IPC by design to allow a higher clock frequency to compensate it (3+ GHz on 90 nm is quite impressive, we're only slowly seeing that return on 45 nm). Unfortunately they relied on clock frequency to not only compensate the low IPC, but they also expected to get a competitive advantage. Fundamantal physics largely prevent that from happening. From 4+ Ghz the power consumption becomes unmanageable. Power increases quadratically with voltage and linearly with frequency. But voltage at a given process node can't be lowered because increasing frequency means there's less time to charge wires (make them transition from logical 0 to 1) for which you need higher voltage (any overclocker will tell you this). Voltage can only be reduced marginally at every new process node. So if you want power consumption to stay below a certain level, clock frequency can only be increased slowly for future generations of processors. The only reason the Pentium 4 has been able to increase frequency from 1.3 to 3.8 agressively is because they started at 50 Watt and went up to 115 Watt, and they implemented clock gating. There is no headroom left.

Now, the hardware technology you seem to be expecting is called Reverse Hyper-Threading. The term was coined by a site rumoring that AMD's dual-core might have been able to use resources from both cores on a single thread. The fact of the matter is that Reverse Hyper-Threading is a myth, AMD's multi-core chips don't have it, and the reson they don't is because it's physically impossible. Electrical signals travel at a fraction of the speed of light, but at multi-GHz frequencies that's only enough to get from one pipeline stage to the next. Distributing instructions from one thread across different cores would require communication with execution units so distant (at this scale) that it would require multiple clock cycles of latency and high power. So you're not gaining anything. There is some very interesting research going on that uses low-power lasers to communicate between cores, and also stacking of chips to bring components closer is an interesting approach, but these are again merely incremental improvements necessary for steady advancement in the next decades. It will be most useful to scale beyond a single digit number of cores, not to run a single thread that much faster.

There is no way to avoid having to redesign your software to take advantage of current and future multi-core chips. Have a look at these Nehalem benchmarks: AnandTech. In the single-threaded benchmark it performs the same as a Penryn. And I see no reason to fight multi-core programming either. Once you've put in the effort you can get massive performance improvements.