Multicore theory proposal

johnsa · May 23, 2008, 09:24:41 AM

Hey all,

I've been keeping abreast of the recent, should I say, highly volatile (pardon the pun) conversations around multi-core programming and threading and the the subsequent performance issues around locking, thread management and the Windows kernel's ability to deal with synchronization.

In any event, I did quite a bit of thinking about the problem, more generally on how, to my mind multi-core software design should be implemented in a perfect world. This idea/theory is quite a departure from the standard concept of threading, and while I believe that threading is a perfect candidate for some multi-processor scenarios / algorithms it is not an ideal model for all. This theory that I am proposing ideally would require new processor opcodes to facililtate it effectively, however it may be possible with a bit of cunning to achieve the same result through some abusive use of current threading technologies, macros etc.

The theory is quite difficult to explain but it goes something along the following lines:
We assume here a system with N=4 cores.

There is always a master core, this is the core that your application starts up on, it runs continuously and operates exactly the way a current single-threaded process would be running.

This master core is responsible for designating work to the other cores which act as "Function Executors".

These additional cores, unlike in a traditional threading model are NOT constantly running, they would perform a request function and then HALT until another function execution is requested.
(That may not be possible currently without additional CPU opcodes or at the very least some sort of ring0 kernel code to make it happen).
Possible solution to this (just an idea): Allocate as many threads as there are cores at the outset.. IE: 4. Each thread has some sort of spin look + lookup which sits idle until a new function address is specified is some sort of almost vtable like structure.

During execution of the "function" ( I use the term losely as it's more of a code-block ) it is possible for that core the make standard branches and calls which would execute on that same core.
Upon completion of the function instead of a RET type construct there would be a CORE_HALT opcode / macro to either stop that core or in our emulation return to the wait-loop.

These multi - core "functions" would be executed/signalled by the master core using a syntax based on the traditional CALL but with an additional parameter of the core number or perhaps a code to indicate a round-robin selection of the first non-active core.
CALL ProcessBlock,CORE2 or CALL ProcessBlock,FIRST_AVAILABLE_CORE

Two other opcode/macros would be create namely WAIT and SIGNAL.
In the scenario where a "function" on core N has to wait for operations on other cores to complete it can use the WAIT waitCode1,waitCode2,.... (IE it can wait for multiple signals to be raised).
The "functions" on those other cores would then use SIGNAL (1..255) a byte value to act as a unique signal flag.. just before their CORE_HALT.

In a number of cases we need to pay close attention to the design/implementation of a multi-core solution as was discovered previously with using multi-threaded memory fills. The limiting factor is still bus/bandwidth/memory.
So no matter how many cores you have, although you might be saving a few clock cycles on loop over-head by filling in parallel the end result will never be much better than allowing a single core to fill the memory. It would be better to find some other real-wold task (preferrably not memory intensive). that the other core could perform while core N completes the memory fill.

This also raises some concerns for me in terms of how a multi-core environment allocates memory/bus bandwidth to it's individual cores.. it would be great if you had dynamic control over this and were able to set some sort of minimum / max to enable a memory-hungry function/core to use say 90% of available bus/mem but a more computationally-intensive core/function to only require the remainding 10%.

Another approach here would be to have sort sort of CORE(n)-LOCAL memory .. in which case "function a"(computational) could load it's data into the CORE-LOCAL memory, then signal the start of the memory-intensive fill on the next core while it continues to perform it's computation on it's core-local data. We could perhaps use a forced cache-fetch to some extent to achieve this if the computation process could work with a small amount of in-cache data while the memory hungry process using non-temporal?

So as an example of how I see this working (3 cores):
we have to "functions" which SUM the values in 2 different buffers and then add the results.

Master Core | Core 2 | Core 3
-------------------------|--------------------------|-------------------------------
; initiate calls | |
; to other cores | |
call SumIt1,CORE2 | SumIt1: | SumIt2:
call SumIt2,CORE3 | mov ecx,100 | mov ecx,100
| lea edi,[buffer1] | lea edi,[buffer2] ; could be buffer1 but half-way in etc.
WAIT 2, DoGarbage | xor eax,eax | xor eax,eax
mov eax,SumResult | loop: | loop2:
| add eax,[edi] | add eax,[edi]
| add edi,4 | add edi,4
| dec ecx | dec ecx
| jnz short loop | jnz short loop2
| mov SumResult,eax | WAIT 1
| SIGNAL 1 | add SumResult,eax
| CORE_HALT | SIGNAL 2
| CORE_HALT

; in the case of core 3s WAIT, we need not worry as we can assume from the code that both loops should finish in approximately the same time
; due to each having 50% of the task. So this WAIT shouldn't take too long and doesn't need to perform any processing while waiting.

; However WAIT 2 on the master core will wait for the entire duration of core2/core3's processing, we would like to be able to carry on with
; something else in the meantime... if we had some sort of Garbage Collection process, that would be a perfect time to do it... think of a java/c# type
; environment.. as you can see the extra parameter on WAIT would indicate a function to execute while waiting, ideally the function wouldn't perform the full
; task but a tiny bit of it incrementally as once it returns the WAIT will still be in effect and it can call it over and over until either it is complete
; or the WAIT is over and we've only lost a tiny bit of time/overlap before being able to read SumResult.

; That function could be replaced with anything incremental, perhaps some data elsewhere needs sorting etc.

; Using the WAIT in the above fashion we can provide sync access to certain data that we can ensure will be ready and won't be concurrently accessed.

Feel free to comment, expand.. - even if you think it's complete bollocks!
John

johnsa · May 23, 2008, 03:56:41 PM

http://arstechnica.com/articles/paedia/cpu/valve-multicore.ars

Interesting read, nothing earth shattering though.

hutch-- · May 24, 2008, 04:51:25 AM

John,

I did in fact raed your post but too many variables came to mind, the most recent read of Intel's new hardware specs in development mention different types of multicore hardware from simple stripped down cores up to more complicated fully specced ones. Current multicore hardware apparently well suits multithread code for large servers that handle large counts of concurrent threads but I have yet to see the design work for close range algorithm code that is linear in nature.

I think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry. I think the corrent core 2 duo's from Intel are more like a PIII core which makes them a bit more predictable in coding terms but I suspect that unless multiple cores can do at kleast some of the things that the early PIV designs had in mind, they will not do this stuff well. Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.

johnsa · May 26, 2008, 02:30:33 PM

F3 90 - PAUSE
Acts as a hint to the CPU (P4+) that the code is a spin-wait loop.

F4 - HLT (halt)
Forces the logical processor to halt execution and will be resumed by an interrupt or NMI.

0F 01 C8 - MONITOR
0F 01 C9 - MWAIT
Used in conjunction to setup an address range monitor and put the logical processor into an optimized wait state.

Looking at my code example and these opcodes, I reckon the process could be implemented quite effectively. It's a shame about Windows being in my way.. I need ring0 PM again.. :)

c0d1f1ed · May 27, 2008, 05:54:15 AM

Quote from: johnsa on May 23, 2008, 09:24:41 AMThis master core is responsible for designating work to the other cores which act as "Function Executors".

Unfortunately this is a bad idea to start with because the master core will most of the time be waiting on other cores. So you waste a whole core. Furthermore, while it's busy giving a core a new task, other cores can finish and they have to all wait for the master core. This is called convoying, and will drastically lower effective performance. Last but not least, this approach doesn't scale well to more cores.

QuoteIn a number of cases we need to pay close attention to the design/implementation of a multi-core solution as was discovered previously with using multi-threaded memory fills. The limiting factor is still bus/bandwidth/memory.
So no matter how many cores you have, although you might be saving a few clock cycles on loop over-head by filling in parallel the end result will never be much better than allowing a single core to fill the memory. It would be better to find some other real-wold task (preferrably not memory intensive). that the other core could perform while core N completes the memory fill.

This also raises some concerns for me in terms of how a multi-core environment allocates memory/bus bandwidth to it's individual cores.. it would be great if you had dynamic control over this and were able to set some sort of minimum / max to enable a memory-hungry function/core to use say 90% of available bus/mem but a more computationally-intensive core/function to only require the remainding 10%.

Most tasks, except pure memory copying or filling, are not bandwidth limited. You have to make sure that each task maximizes cache utilization (batching may help). In my experience there is no need to bother with how the CPU balances RAM accesses.

QuoteFeel free to comment, expand.. - even if you think it's complete bollocks!

It's great that you're giving this some though, and some concepts are sound, but please do yourself and others a favor and get a quad-core as soon as possible to get some hands-on experience. Some ideas, even if they seem interesting in theory, simply don't work. There's little point discussing proposals for future extensions if you don't already have a firm grip on how things work today. Amazing efficiency can be achieved with lock-free task scheduling...

sinsi · May 27, 2008, 06:16:25 AM

Quote from: c0d1f1ed on May 27, 2008, 05:54:15 AM...the master core will most of the time be waiting on other cores.

But to use more than one core for one thing (e.g. windows,linux etc) there has to be a 'master' or supervisor core (the BSP) to control the other cores (AP's) and tell
them where the code is, when to run, and to arrange new code.

c0d1f1ed · May 27, 2008, 06:21:23 AM

Quote from: hutch-- on May 24, 2008, 04:51:25 AMI did in fact raed your post but too many variables came to mind, the most recent read of Intel's new hardware specs in development mention different types of multicore hardware from simple stripped down cores up to more complicated fully specced ones.

Both Nehalem and Sandy Bridge have complex cores. So things won't change radically from today's architectures till at the very least 2012, from a programming point of view. Furthermore, I really don't expect them to suddenly lower single-threaded performance. They might transition to simpler cores so they can fit more on a chip, while keeping per-core performance at least at the same level.

But there are still indications that they want to actually make bigger cores. AVX doubles the width of SSE registers, which theoretically doubles performance. FMA also increases computation density. All it lacks is scatter/gather to allow parallelization of any loop. This is more interesting than doubling the number of cores. What happens beyond that is going to be very interesting. Beyond about 16 cores current programming models become unmaintainable, two SIMD units is likely the maximum, and wider vectors are also impractical...

QuoteI think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry.

What are you talking about? They are perfectly capable of accessing the same memory.

QuoteI think the corrent core 2 duo's from Intel are more like a PIII core which makes them a bit more predictable in coding terms but I suspect that unless multiple cores can do at kleast some of the things that the early PIV designs had in mind, they will not do this stuff well. Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.

Again: what? Core 2 has speculative branching, out-of-order execution, and multiple pipelines.

sinsi · May 27, 2008, 06:32:23 AM

Quote from: c0d1f1ed on May 27, 2008, 06:21:23 AM
QuoteI think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry.

What are you talking about? They are perfectly capable of accessing the same memory.

Intel:

Quote
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed
to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.

c0d1f1ed · May 27, 2008, 06:34:28 AM

Quote from: sinsi on May 27, 2008, 06:16:25 AMBut to use more than one core for one thing (e.g. windows,linux etc) there has to be a 'master' or supervisor core (the BSP) to control the other cores (AP's) and tell them where the code is, when to run, and to arrange new code.

A process can start on any core. After it created additional threads there is no need to designate one as the master. They can all be equivalent and share the same scheduler.

sinsi · May 27, 2008, 06:39:57 AM

Quote from: c0d1f1ed on May 27, 2008, 06:34:28 AM
They can all be equivalent and share the same scheduler.

But where does the scheduler run? Wouldn't it be an endless loop?

c0d1f1ed · May 27, 2008, 06:59:57 AM

Quote from: sinsi on May 27, 2008, 06:32:23 AM
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.

Where's the problem? That's for RAM access, which you obviously want to behave atomically. But threads can perfectly read the same data in parallel when each cache has a copy. Writes obviously invalidate the copies in other caches, and this unavoidably incurs a penalty. The only solution is to avoid reading and writing the same memory simultaneously.

c0d1f1ed · May 27, 2008, 07:09:47 AM

Quote from: sinsi on May 27, 2008, 06:39:57 AMBut where does the scheduler run? Wouldn't it be an endless loop?

Each thread just calls the scheduler function when it needs a new task. It runs on whatever core the thread runs on.

Maybe you're missing a basic understanding of simultaneous multi-threading here. Code does not belong to any thread or core. A thread is just a point of execution, and with four cores you can have four points of execution simultaneously in one program. So they can all run the same endless loop consisting of scheduling a new task and executing it. The stack and thread-local-storage are unique for each thread, all other memory is shared (including code).

sinsi · May 27, 2008, 07:16:24 AM

OK, so we can have 4 cores running the same code?

Quote from: c0d1f1ed on May 27, 2008, 06:59:57 AM
Quote from: sinsi on May 27, 2008, 06:32:23 AM
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.

Where's the problem? That's for RAM access, which you obviously want to behave atomically. But threads can perfectly read the same data in parallel when each cache has a copy. Writes obviously invalidate the copies in other caches, and this unavoidably incurs a penalty. The only solution is to avoid reading and writing the same memory simultaneously.

I was agreeing with you and pointing out the relevant part...

c0d1f1ed · May 27, 2008, 10:40:39 AM

Quote from: sinsi on May 27, 2008, 07:16:24 AM
OK, so we can have 4 cores running the same code?

Certainly. A core is just an autonomous processor reading instructions from memory and executing them. The threads running on the cores can pull instructions from the same code in memory. A thread is not much more than an instruction pointer and a set of registers. The O.S. ensures that every thread starts with its own stack, so that if they call the same functions they have their own context (i.e. as long as the function doesn't touch shared memory, multiple threads can call it without influencing each other - they share the instructions but not the stack variables).

QuoteI was agreeing with you and pointing out the relevant part...

Ok. I'm just not sure what hutch is expecting here. Current multi-core CPUs are already doing everything possible.

hutch-- · May 27, 2008, 11:23:02 AM

Having bothered to read the Intel data on the LOCK prefix, the range of instructions that it can be used as a prefix to in an increasing number of cases with later hardware are already locked for the duration of the instruction which excludes close range parallel memory operations. here I mean something like every alternate 4 bytes is read and processed by a different core so that the memory is being treated something like 2 identical hard disks striped as raid 0. The notion that there is such a thing as LOCK free operation is simply incorect.

Multithread operations as are common at the server level work fine with current multicore processors as they run multiple concurrent threads which are then timesliced across multiple cores but the vast number of tasks running on current computers are single thread in their logic and they do not get faster with the increase in core count.

Put simply not every task is as simple as a network connections which can be parallelled to the boundary of processing power and the rest do not benefit from an increase in core count. Until the hardware supports close range parallel computing in a single thread algorithm, the vast majority of computer programs are not going to get much faster.

Currently the only real gain in single thread performance is faster memory which does reduce one of the major bottlenecks in performance. There are much faster substrates than silicon that have been around for years but cost is the major factor in not going in this direction but this would allow running higher clock speeds than the current heat based limits.

> Again: what? Core 2 has speculative branching, out-of-order execution, and multiple pipelines.

you appear to have missed the context of the statement,

> Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.

Anything else is trivial, it is already being done with dula processor machines and later multicore machines.

News:

Multicore theory proposal

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed