The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on May 23, 2008, 09:24:41 AM

Title: Multicore theory proposal
Post by: johnsa on May 23, 2008, 09:24:41 AM
Hey all,

I've been keeping abreast of the recent, should I say, highly volatile (pardon the pun) conversations around multi-core programming and threading and the the subsequent performance issues around locking, thread management and the Windows kernel's ability to deal with synchronization.

In any event, I did quite a bit of thinking about the problem, more generally on how, to my mind multi-core software design should be implemented in a perfect world. This idea/theory is quite a departure from the standard concept of threading, and while I believe that threading is a perfect candidate for some multi-processor scenarios / algorithms it is not an ideal model for all. This theory that I am proposing ideally would require new processor opcodes to facililtate it effectively, however it may be possible with a bit of cunning to achieve the same result through some abusive use of current threading technologies, macros etc.

The theory is quite difficult to explain but it goes something along the following lines:
We assume here a system with N=4 cores.

There is always a master core, this is the core that your application starts up on, it runs continuously and operates exactly the way a current single-threaded process would be running.

This master core is responsible for designating work to the other cores which act as "Function Executors".

These additional cores, unlike in a traditional threading model are NOT constantly running, they would perform a request function and then HALT until another function execution is requested.
(That may not be possible currently without additional CPU opcodes or at the very least some sort of ring0 kernel code to make it happen).
Possible solution to this (just an idea): Allocate as many threads as there are cores at the outset.. IE: 4. Each thread has some sort of spin look + lookup which sits idle until a new function address is specified is some sort of almost vtable like structure.

During execution of the "function" ( I use the term losely as it's more of a code-block ) it is possible for that core the make standard branches and calls which would execute on that same core.
Upon completion of the function instead of a RET type construct there would be a CORE_HALT opcode / macro to either stop that core or in our emulation return to the wait-loop.

These multi - core "functions" would be executed/signalled by the master core using a syntax based on the traditional CALL but with an additional parameter of the core number or perhaps a code to indicate a round-robin selection of the first non-active core.
CALL ProcessBlock,CORE2   or CALL ProcessBlock,FIRST_AVAILABLE_CORE

Two other opcode/macros would be create namely WAIT and SIGNAL.
In the scenario where a "function" on core N has to wait for operations on other cores to complete it can use the WAIT waitCode1,waitCode2,.... (IE it can wait for multiple signals to be raised).
The "functions" on those other cores would then use SIGNAL (1..255) a byte value to act as a unique signal flag.. just before their CORE_HALT.

In a number of cases we need to pay close attention to the design/implementation of a multi-core solution as was discovered previously with using multi-threaded memory fills. The limiting factor is still bus/bandwidth/memory.
So no matter how many cores you have, although you might be saving a few clock cycles on loop over-head by filling in parallel the end result will never be much better than allowing a single core to fill the memory. It would be better to find some other real-wold task (preferrably not memory intensive). that the other core could perform while core N completes the memory fill.

This also raises some concerns for me in terms of how a multi-core environment allocates memory/bus bandwidth to it's individual cores.. it would be great if you had dynamic control over this and were able to set some sort of minimum / max to enable a memory-hungry function/core to use say 90% of available bus/mem but a more computationally-intensive core/function to only require the remainding 10%.

Another approach here would be to have sort sort of CORE(n)-LOCAL memory .. in which case "function a"(computational) could load it's data into the CORE-LOCAL memory, then signal the start of the memory-intensive fill on the next core while it continues to perform it's computation on it's core-local data. We could perhaps use a forced cache-fetch to some extent to achieve this if the computation process could work with a small amount of in-cache data while the memory hungry process using non-temporal?

So as an example of how I see this working (3 cores):
we have to "functions" which SUM the values in 2 different buffers and then add the results.

Master Core             | Core 2                   |  Core 3           
-------------------------|--------------------------|-------------------------------
; initiate calls         |                          |
; to other cores         |                          | 
call SumIt1,CORE2        | SumIt1:                  | SumIt2:
call SumIt2,CORE3        | mov ecx,100              | mov ecx,100
                         | lea edi,[buffer1]        | lea edi,[buffer2]  ; could be buffer1 but half-way in etc.
WAIT 2, DoGarbage        | xor eax,eax              | xor eax,eax
mov eax,SumResult        | loop:                    | loop2:
                         | add eax,[edi]            | add eax,[edi]
                         | add edi,4                | add edi,4
                         | dec ecx                  | dec ecx
                         | jnz short loop           | jnz short loop2
                         | mov SumResult,eax        | WAIT 1
                         | SIGNAL 1                 | add SumResult,eax
                         | CORE_HALT                | SIGNAL 2
                                                    | CORE_HALT


; in the case of core 3s WAIT, we need not worry as we can assume from the code that both loops should finish in approximately the same time
; due to each having 50% of the task. So this WAIT shouldn't take too long and doesn't need to perform any processing while waiting.

; However WAIT 2 on the master core will wait for the entire duration of core2/core3's processing, we would like to be able to carry on with
; something else in the meantime... if we had some sort of Garbage Collection process, that would be a perfect time to do it... think of a java/c# type
; environment.. as you can see the extra parameter on WAIT would indicate a function to execute while waiting, ideally the function wouldn't perform the full
; task but a tiny bit of it incrementally as once it returns the WAIT will still be in effect and it can call it over and over until either it is complete
; or the WAIT is over and we've only lost a tiny bit of time/overlap before being able to read SumResult.

; That function could be replaced with anything incremental, perhaps some data elsewhere needs sorting etc.

; Using the WAIT in the above fashion we can provide sync access to certain data that we can ensure will be ready and won't be concurrently accessed.


Feel free to comment, expand..  - even if you think it's complete bollocks!
John
Title: Re: Multicore theory proposal
Post by: johnsa on May 23, 2008, 03:56:41 PM
http://arstechnica.com/articles/paedia/cpu/valve-multicore.ars

Interesting read, nothing earth shattering though.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 24, 2008, 04:51:25 AM
John,

I did in fact raed your post but too many variables came to mind, the most recent read of Intel's new hardware specs in development mention different types of multicore hardware from simple stripped down cores up to more complicated fully specced ones. Current multicore hardware apparently well suits multithread code for large servers that handle large counts of concurrent threads but I have yet to see the design work for close range algorithm code that is linear in nature.

I think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry. I think the corrent core 2 duo's from Intel are more like a PIII core which makes them a bit more predictable in coding terms but I suspect that unless multiple cores can do at kleast some of the things that the early PIV designs had in mind, they will not do this stuff well. Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.
Title: Re: Multicore theory proposal
Post by: johnsa on May 26, 2008, 02:30:33 PM

F3 90 - PAUSE
Acts as a hint to the CPU (P4+) that the code is a spin-wait loop.

F4 - HLT (halt)
Forces the logical processor to halt execution and will be resumed by an interrupt or NMI.

0F 01 C8 - MONITOR
0F 01 C9 - MWAIT
Used in conjunction to setup an address range monitor and put the logical processor into an optimized wait state.

Looking at my code example and these opcodes, I reckon the process could be implemented quite effectively. It's a shame about Windows being in my way.. I need ring0 PM again.. :)
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 05:54:15 AM
Quote from: johnsa on May 23, 2008, 09:24:41 AMThis master core is responsible for designating work to the other cores which act as "Function Executors".

Unfortunately this is a bad idea to start with because the master core will most of the time be waiting on other cores. So you waste a whole core. Furthermore, while it's busy giving a core a new task, other cores can finish and they have to all wait for the master core. This is called convoying, and will drastically lower effective performance. Last but not least, this approach doesn't scale well to more cores.

QuoteIn a number of cases we need to pay close attention to the design/implementation of a multi-core solution as was discovered previously with using multi-threaded memory fills. The limiting factor is still bus/bandwidth/memory.
So no matter how many cores you have, although you might be saving a few clock cycles on loop over-head by filling in parallel the end result will never be much better than allowing a single core to fill the memory. It would be better to find some other real-wold task (preferrably not memory intensive). that the other core could perform while core N completes the memory fill.

This also raises some concerns for me in terms of how a multi-core environment allocates memory/bus bandwidth to it's individual cores.. it would be great if you had dynamic control over this and were able to set some sort of minimum / max to enable a memory-hungry function/core to use say 90% of available bus/mem but a more computationally-intensive core/function to only require the remainding 10%.

Most tasks, except pure memory copying or filling, are not bandwidth limited. You have to make sure that each task maximizes cache utilization (batching may help). In my experience there is no need to bother with how the CPU balances RAM accesses.

QuoteFeel free to comment, expand..  - even if you think it's complete bollocks!

It's great that you're giving this some though, and some concepts are sound, but please do yourself and others a favor and get a quad-core as soon as possible to get some hands-on experience. Some ideas, even if they seem interesting in theory, simply don't work. There's little point discussing proposals for future extensions if you don't already have a firm grip on how things work today. Amazing efficiency can be achieved with lock-free task scheduling...
Title: Re: Multicore theory proposal
Post by: sinsi on May 27, 2008, 06:16:25 AM
Quote from: c0d1f1ed on May 27, 2008, 05:54:15 AM...the master core will most of the time be waiting on other cores.
But to use more than one core for one thing (e.g. windows,linux etc) there has to be a 'master' or supervisor core (the BSP) to control the other cores (AP's) and tell
them where the code is, when to run, and to arrange new code.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 06:21:23 AM
Quote from: hutch-- on May 24, 2008, 04:51:25 AMI did in fact raed your post but too many variables came to mind, the most recent read of Intel's new hardware specs in development mention different types of multicore hardware from simple stripped down cores up to more complicated fully specced ones.

Both Nehalem and Sandy Bridge have complex cores. So things won't change radically from today's architectures till at the very least 2012, from a programming point of view. Furthermore, I really don't expect them to suddenly lower single-threaded performance. They might transition to simpler cores so they can fit more on a chip, while keeping per-core performance at least at the same level.

But there are still indications that they want to actually make bigger cores. AVX doubles the width of SSE registers, which theoretically doubles performance. FMA also increases computation density. All it lacks is scatter/gather to allow parallelization of any loop. This is more interesting than doubling the number of cores. What happens beyond that is going to be very interesting. Beyond about 16 cores current programming models become unmaintainable, two SIMD units is likely the maximum, and wider vectors are also impractical...

QuoteI think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry.

What are you talking about? They are perfectly capable of accessing the same memory.

QuoteI think the corrent core 2 duo's from Intel are more like a PIII core which makes them a bit more predictable in coding terms but I suspect that unless multiple cores can do at kleast some of the things that the early PIV designs had in mind, they will not do this stuff well. Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.

Again: what? Core 2 has speculative branching, out-of-order execution, and multiple pipelines.
Title: Re: Multicore theory proposal
Post by: sinsi on May 27, 2008, 06:32:23 AM
Quote from: c0d1f1ed on May 27, 2008, 06:21:23 AM
QuoteI think it was Bogdan that mentioned that until two or more cores can access the same memory this area may not improve in any hurry.

What are you talking about? They are perfectly capable of accessing the same memory.

Intel:
Quote
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed
to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 06:34:28 AM
Quote from: sinsi on May 27, 2008, 06:16:25 AMBut to use more than one core for one thing (e.g. windows,linux etc) there has to be a 'master' or supervisor core (the BSP) to control the other cores (AP's) and tell them where the code is, when to run, and to arrange new code.

A process can start on any core. After it created additional threads there is no need to designate one as the master. They can all be equivalent and share the same scheduler.
Title: Re: Multicore theory proposal
Post by: sinsi on May 27, 2008, 06:39:57 AM
Quote from: c0d1f1ed on May 27, 2008, 06:34:28 AM
They can all be equivalent and share the same scheduler.

But where does the scheduler run? Wouldn't it be an endless loop?
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 06:59:57 AM
Quote from: sinsi on May 27, 2008, 06:32:23 AM
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.

Where's the problem? That's for RAM access, which you obviously want to behave atomically. But threads can perfectly read the same data in parallel when each cache has a copy. Writes obviously invalidate the copies in other caches, and this unavoidably incurs a penalty. The only solution is to avoid reading and writing the same memory simultaneously.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 07:09:47 AM
Quote from: sinsi on May 27, 2008, 06:39:57 AMBut where does the scheduler run? Wouldn't it be an endless loop?

Each thread just calls the scheduler function when it needs a new task. It runs on whatever core the thread runs on.

Maybe you're missing a basic understanding of simultaneous multi-threading here. Code does not belong to any thread or core. A thread is just a point of execution, and with four cores you can have four points of execution simultaneously in one program. So they can all run the same endless loop consisting of scheduling a new task and executing it. The stack and thread-local-storage are unique for each thread, all other memory is shared (including code).
Title: Re: Multicore theory proposal
Post by: sinsi on May 27, 2008, 07:16:24 AM
OK, so we can have 4 cores running the same code?

Quote from: c0d1f1ed on May 27, 2008, 06:59:57 AM
Quote from: sinsi on May 27, 2008, 06:32:23 AM
Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location.

Where's the problem? That's for RAM access, which you obviously want to behave atomically. But threads can perfectly read the same data in parallel when each cache has a copy. Writes obviously invalidate the copies in other caches, and this unavoidably incurs a penalty. The only solution is to avoid reading and writing the same memory simultaneously.
I was agreeing with you and pointing out the relevant part...
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 10:40:39 AM
Quote from: sinsi on May 27, 2008, 07:16:24 AM
OK, so we can have 4 cores running the same code?

Certainly. A core is just an autonomous processor reading instructions from memory and executing them. The threads running on the cores can pull instructions from the same code in memory. A thread is not much more than an instruction pointer and a set of registers. The O.S. ensures that every thread starts with its own stack, so that if they call the same functions they have their own context (i.e. as long as the function doesn't touch shared memory, multiple threads can call it without influencing each other - they share the instructions but not the stack variables).

QuoteI was agreeing with you and pointing out the relevant part...

Ok. I'm just not sure what hutch is expecting here. Current multi-core CPUs are already doing everything possible.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 27, 2008, 11:23:02 AM
Having bothered to read the Intel data on the LOCK prefix, the range of instructions that it can be used as a prefix to in an increasing number of cases with later hardware are already locked for the duration of the instruction which excludes close range parallel memory operations. here I mean something like every alternate 4 bytes is read and processed by a different core so that the memory is being treated something like 2 identical hard disks striped as raid 0. The notion that there is such a thing as LOCK free operation is simply incorect.

Multithread operations as are common at the server level work fine with current multicore processors as they run multiple concurrent threads which are then timesliced across multiple cores but the vast number of tasks running on current computers are single thread in their logic and they do not get faster with the increase in core count.

Put simply not every task  is as simple as a network connections which can be parallelled to the boundary of processing power and the rest do not benefit from an increase in core count. Until the hardware supports close range parallel computing in a single thread algorithm, the vast majority of computer programs are not going to get much faster.

Currently the only real gain in single thread performance is faster memory which does reduce one of the major bottlenecks in performance. There are much faster substrates than silicon that have been around for years but cost is the major factor in not going in this direction but this would allow running higher clock speeds than the current heat based limits.

> Again: what? Core 2 has speculative branching, out-of-order execution, and multiple pipelines.

you appear to have missed the context of the statement,

> Here I refer to speculative branch prediction, out of order execution and multiple pipelines that all work in a linear algorithm with no problems.

Anything else is trivial, it is already being done with dula processor machines and later multicore machines.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 27, 2008, 03:18:44 PM
Quote from: hutch-- on May 27, 2008, 11:23:02 AM
Having bothered to read the Intel data on the LOCK prefix, the range of instructions that it can be used as a prefix to in an increasing number of cases with later hardware are already locked for the duration of the instruction which excludes close range parallel memory operations.

Could you rephrase that? Anyway, you might also want to read up about CAS (http://en.wikipedia.org/wiki/Compare-and-swap).

QuoteThe notion that there is such a thing as LOCK free operation is simply incorect.

Are you referring to lock-free algorithms? As has been said a dozen times before, it has nothing to do with the x86 lock prefix. It's about not using mutual exclusion, i.e. not having to aquire and release synchronization locks.

QuoteUntil the hardware supports close range parallel computing in a single thread algorithm, the vast majority of computer programs are not going to get much faster.

You're dead wrong. The vast majority of software has a lot of opportunity to execute multiple tasks in parallel. A typical function contains multiple calls to other functions, and those that are independent of each other can execute in parallel. Also as soon as you have a loop in which you process independent elements, multiple iterations could execute simultaneously. And for algorithms that at first seem serial there are more often than not ways to rewrite them as a parallel algorithm.

QuoteCurrently the only real gain in single thread performance is faster memory which does reduce one of the major bottlenecks in performance.

Faster memory isn't going to help one bit if you're bottlenecked by arithmetic work. Think about encoding a movie. Several GB/s of RAM bandwidth is plenty to load in uncompressed frames and output the compressed stream, and the cache is large enough to hold the entire working set. Cache is crazy fast so no problems there. The problem is the sheer number of block compares you need to do for motion prediction. With X cores you can simply work on X regions of the frame simultaneously. No increase in clock speed can keep up with that.

QuoteThere are much faster substrates than silicon that have been around for years but cost is the major factor in not going in this direction but this would allow running higher clock speeds than the current heat based limits.

Please define "much faster". Do you mean signal speed, switching speed, some other parameter? For your information, switching speeds of transistors on 45 nm are 0.5 picoseconds or less. That's 2 terraherz. So that's not the limiting factor. Faster switching substrates won't help us at all.

Quoteyou appear to have missed the context of the statement

No I haven't. You were referring to NetBurst. Core 2 lacks not a single feature offered by NetBurst that could make it faster at running serial code.

QuoteAnything else is trivial, it is already being done with dula processor machines and later multicore machines.

The free lunch is over hutch. Your software is not going to get significantly faster on future architectures, unless you take full advantage of multi-core by rethinking your software design.
Title: Re: Multicore theory proposal
Post by: johnsa on May 27, 2008, 06:46:48 PM
Sofar I agree with both c0d1f1ed and hutch.
If multi-threading is approached from the traditional angle, c0d1f1ed is spot on.

I agree that in a lot of cases it is possible to design an algorithm which doesn't require any sort of lock at all.
In the few remainding cases where a lock is unavoidable it would pay to have a selection of the fastest and more effective locking constructs possible.

I suggest what we should do is put together some sort of task that can be parallelized and try various approaches (some group coding) and see what we can get out of it?
In terms of having some section which can be done in parallel without locking , and one component which requires a lock.

Anyone keen?

To start with I've been looking at spin-lock implementations (for x86) and here is the quickest basic version I've been able to come up with sofar.. this of course could be improved in a real-world scenario by
possibly using the MAX_SPINS to determine when to either go on with other work, sleep (put waiting thread in a queue) or switch to some sort if signalling mechanism. We can also assume that certain code can make use of a
read write lock where only write operations require locking.



spinlock_t STRUCT
_lock dd 1
spinlock_t ENDS
mylock spinlock_t <1>
MAX_SPINS equ 1000000

spinlock PROTO :DWORD
spinunlock PROTO :DWORD

align 4
spinlock PROC lockPtr:DWORD
push edi
mov edi,lockPtr
acquirelock:
xor ebx,ebx
lock bts dword ptr [edi],0
jc short spinloop
pop edi
ret
align 4
spinloop:
dw 0f390h              ;use cpu PAUSE instruction as a spin loop hint
inc ebx
cmp ebx,MAX_SPINS           ;simply stop trying to acquire the lock at MAX_SPINS
jge short endlock
test dword ptr [edi],1
jne short spinloop
jmp short acquirelock
endlock:
pop edi
ret
spinlock ENDP

align 4
spinunlock PROC lockPtr:DWORD
mov edi,lockPtr
xor eax,eax
mov [edi],eax     ;MESI protocol, its safe to clear the lock without an atomic operation
ret
spinunlock ENDP



in main thread code....
invoke spinlock, ADDR mylock
invoke spinunlock,ADDR mylock



Title: Re: Multicore theory proposal
Post by: johnsa on May 27, 2008, 07:23:03 PM
http://msdn.microsoft.com/en-us/magazine/cc163715.aspx
Title: Re: Multicore theory proposal
Post by: hutch-- on May 28, 2008, 12:58:45 AM
Reference from the PIV manual LOCK instruction.
Quote
Note that in later IA-32 processors (including the Pentium 4, Intel Xeon, and P6 family processors),
locking may occur without the LOCK# signal being asserted. See IA-32 Architecture
Compatibility below.

This means directly that at least some of the instruction that LOCK can be prefixed to ALREADY lock for their operation. Unless you are targetting legacy Intel hardware the use of LOCK may simply be redundant. By using the instructions WITHOUT a LOCK prefix, you already have a lock on the address for the duration of the instruction.

Now RE John's code,


lock bts dword ptr [edi],0


If the hardware you are running is not legacy Intel, it may be worth a try if you hae a testbed set up to test the speed of the spinlock to simple remove the LOCK prefix and see if its any faster. You may also be able to substitute any of the lockable instructions for BTS to see if they are faster, something like NEG / NEG for non destructie results.

RE: The free lunch. The slogan has been around for years as has repeated change from hardware change so change in itself is nothing new, its one of the few garrantees of any developing field. My comments on the changes being proposed is that they will not matter as multicore hardware develops out of its sub infancy but this much, the solution to true parallelism will be in hardware, not trying to emulate win3 software multitasking.

I have to go somewhere, will be back later.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 28, 2008, 04:16:53 AM
Thinking about it there is another option that I have not given much consideration to for a long time and that is the INT instruction. It can only be used at ring 0 but it will start and stop a core and from memory it can be done quickly at ring 0 level access. I know Linux still uses interrupts so I would imagine the technique is viable at least in some contexts.

My problem with such methods including spinlocks  is they knock the core it is pointed at stone dead so it is not yielding to any other process that may need processor time. A high level ring 3 spinlock that yields to other processes while waiting is very easy to write and they ae ery efficient in terms of processor usage but such high level notions are close to useless in close range algorithm processing as they are far too slow.

SUBSTRATES: A long time ago the base for transistors was germanium, lower resistance but higher leakage. Silicon has higher resistance but much lower leakage, its limiting factor on current hardware is heat from the resistance. There is no problem with winding up the clock speed on silicon except for the heat and this is why current processors are stuck at just under 4 gig.

The military and specialised instruments have had ICs based of other substrates for many years, something you use for parts of guidance systems for missiles and the like but the factor so far against using substrates with both lower resistance and lower leakage is a factor of cost, not performance. Sooner or later silicon will run out of puff, it already has in terms of heat and while smaller tracks help reduce this problem in conjunction with lower voltage, this too has its limits which are not far off.
Title: Re: Multicore theory proposal
Post by: sinsi on May 28, 2008, 05:07:43 AM
Quote from: johnsa on May 27, 2008, 06:46:48 PM    dw 0f390h              ;use cpu PAUSE instruction as a spin loop hint
That should be    db 0f3h,90h ;or dw 90f3h
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 01:56:55 PM
good spot! my bad... db 0f3h, 90h ... bad endian :) haha.. in any event i've just replaced it with the "pause" mnemonic which assembles under ML9 .. not sure about 8 or lower.
In any event with the pause instruction in the spin loop code takes 3x as long, on my test it went from 2ms to 6ms.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 28, 2008, 02:54:47 PM
Quote from: hutch-- on May 28, 2008, 12:58:45 AM
Reference from the PIV manual LOCK instruction.
QuoteNote that in later IA-32 processors (including the Pentium 4, Intel Xeon, and P6 family processors), locking may occur without the LOCK# signal being asserted.

This means directly that at least some of the instruction that LOCK can be prefixed to ALREADY lock for their operation. Unless you are targetting legacy Intel hardware the use of LOCK may simply be redundant. By using the instructions WITHOUT a LOCK prefix, you already have a lock on the address for the duration of the instruction.

You got it all wrong. The manual merely refers to a multi-processor system's ability to perform memory operations atomically without locking the bus. In particular, this is possible when the addressed memory is in the cache and in certain states of the MESI coherency protocol. Please read section 7.1.4 of the Sytem Programming Guide (http://download.intel.com/design/processor/manuals/253668.pdf).

Note that this is a very effective optimization that ensure there is no unnecessary blocking. Today's multi-core CPUs can access the same memory at full speed and even perform atomic operations at optimal efficiency.

QuoteRE: The free lunch. The slogan has been around for years as has repeated change from hardware change so change in itself is nothing new, its one of the few garrantees of any developing field. My comments on the changes being proposed is that they will not matter as multicore hardware develops out of its sub infancy but this much, the solution to true parallelism will be in hardware, not trying to emulate win3 software multitasking.

You're being totally blind for the revolutionary change that mainstream multi-core CPUs bring (or in denial). Previously, you were guaranteed a steady increase of your software's performance from one CPU generation to the next. Nowadays, single-threaded software runs no faster on a Q6600 than on an E6600! However, by properly redesigning your software you can make it up to 90% faster for each doubling of the cores. So, contrary to the past, it takes effort, but you do get a lot in return.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 28, 2008, 02:57:34 PM
Quote from: johnsa on May 27, 2008, 06:46:48 PM
use cpu PAUSE instruction as a spin loop hint

It's best to avoid it. Reasons here: Multicore processor code (http://www.masm32.com/board/index.php?topic=9221.0).
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 03:09:03 PM
I've tried using xchg, cmpxchg in place of the bts and they all perform the same. From what I read in the intel manuals only xchg automatically asserts a lock signal, all other instructions must have a lock prefix. The only real thing to bear in mind is that if the locked address is in the cache the CPU avoid asserting the bus lock signal which makes it considerably faster.

The pause instruction definately seems to be pretty useless.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 28, 2008, 03:18:21 PM
John,

Somewhere along the line I remember that you can use the normal db 90h (nop) or a number of them instead of pause. I have often found in the middle of an algo that 2 or 3 nops did not slow it down any so it may be worth a try.
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 04:27:56 PM
Tried using nops... while MUCH faster than a pause instruction it does add a fair amount of overhead to the spin loop...
Question is.. does it really matter.. tight loop with no nop/pause or nop in there.. either way the result will be the same and it will still take as much time as is necessary for the lock to become available..assuming this thread is running on a different core from the one which owns the lock brings another question up.. do we care that that specific core jams up running the loop at full-tilt while waiting for the lock?

And.. if it does matter why, and what would be a good option.. keep the max_spins option as I have it currently and do a thread sleep.. or a thread yield?
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 05:49:46 PM
Going back to my original post.. I still hold that threads are pretty pointless for what we want to achieve with them unless each thread runs on it's own core..
I can't imagine any algorithm being split into two threads (for example) on a single processor being any more efficient (probably less) than a single thread.. (assuming no Hyperthreading either).

So.. assuming that if one looks at the overhead of creating threads on a per-task basis, it makes far more sense to me to allocate a thread-per-core... and then in each thread routine implement some sort of workload designator which calls other routines as it needs.

I do understand how that approach doesn't really scale well in terms of a variable number of cores possibly, but unless you're creating a task which is essentially going to act as a template and all threads are instances of that same code (IE: like a socket server, web server etc) then implementing algorithms which are not only optimal but can handle a variable number of cores is almost impossible.

I would imagine that it might work reasonably well in some cases to follow my approach, but say you have 4 cores, create 6 or 8 initial threads (possibly 2 on a particular core..) should you be able to (at a high level) determine that you can create additional modules which could run in those threads and then only move those threads off the duplicate cores onto their own cores should more cores be available (perhaps a dual-quad core machine.. or in the future).
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 06:23:50 PM
http://cache-www.intel.com/cd/00/00/01/76/17689_w_spinlock.pdf

Very good reason, highly recommend!!

Subject to this I've made a few changes to my basic algorithm, including re-use of the pause (based on their explanation) and alignment / cache-line issues around the synchronizing variable / memory address.
Title: Re: Multicore theory proposal
Post by: johnsa on May 28, 2008, 07:42:26 PM
Ok... next question.. I'm presuming that any data specified in the .data or .data? sections would be considered/allocated by the OS as Write-Back and NOT Write-Combining...

1) It turns out that no atomic/locked operations should operate on variables stored in Write-Combining memory.

2) If that is the case, what would be the best way to align a variable to a cache-line, assuming 128byte cache-line size?
It's easy to accomplish if you dynamically allocate the memory by increasing the allocation size and ANDing off the low 7 bits.
as in:



.data

spinlock_t STRUCT
_lock dd 1
spinlock_t ENDS

; how to align this on 128byte boundary ?
mylock spinlock_t <1>



Next question.. when and why would you use VirtualAlloc to allocate a write - combining memory block? (would it improve the performance of writes including streaming stores to data stored in that area) - assumed that no data
in this area would act as a synchronization lock. And do the streaming store instructions automatically override the memory setting (assuming you stream-store to a memory block which is Write Back) it becomes Write Combine for the  store?

So many question... so few answers :)
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 11:01:17 AM
Quote from: hutch-- on May 28, 2008, 04:16:53 AM
SUBSTRATES: A long time ago the base for transistors was germanium, lower resistance but higher leakage. Silicon has higher resistance but much lower leakage, its limiting factor on current hardware is heat from the resistance. There is no problem with winding up the clock speed on silicon except for the heat and this is why current processors are stuck at just under 4 gig.

Intel has used SiGe since its 90 nm process to create strained silicon with higher electron mobility. Also, high-k gate dielectrics have been introduced in the 45 nm process to further reduce leakage, and 32 nm will feature metal gates to reduce resistance. The latter process will be used by Westmere and Sandy bridge, with up to eight cores running at 4 GHz.

So even with significant engineering triumphs in process technology there is no indication of any return to the MHz-race. Instead, with every shrink we get twice the number of transistors, and these are primarily used to increase the number of cores.

It's hard to still speak of silicon technology if you have SiGe substrates, high-k gate dielectrics instead of silicon dioxide, and metal gates instead of polysilicon. Whatever other advancement you're imagining, it's not going to increase single-threaded performance much. Any future chip will be multi-core though.

QuoteThe military and specialised instruments have had ICs based of other substrates for many years, something you use for parts of guidance systems for missiles and the like but the factor so far against using substrates with both lower resistance and lower leakage is a factor of cost, not performance.

Please specify.

Either way, hoping for any hardware technology to bring back the free lunch is a pipe dream. You can talk about faster substrates or whatever as much as you like, the reality is that if it can't be manufactured for a reasonable price it won't end up in your and everyone else's system. Muti-core silicon based CPUs are mainstream, today, and will continue to be for the foreseeable future. You have to start taking advantage of multi-core if you want your software to get any faster, or wait on a mythic fast single-core processor forever.

Besides, cost is not that much of a limitation. Nine layers of copper interconnects, strained silicon, hafnium insulators, metal gates, double patterning, immersion lithography... expensive technology hasn't prevented Intel from using it in their CPU lines. Investments of multiple billions are not an unsurmountable obstacle, and are compensated by sheer production volume. The fact that they haven't used better substrates should be a sign that it's just not that spectacular and there are more effective ways of increasing performance.

QuoteSooner or later silicon will run out of puff, it already has in terms of heat and while smaller tracks help reduce this problem in conjunction with lower voltage, this too has its limits which are not far off.

The current semiconductor roadmap (ITRS) already includes an 11 nm process. With Intel's plans of transitioning to a new process every two years, this means they'll continue to shrink lithographic processes till at least 2016 (the ITRS is less aggressive and projects 11 nm for 2022). With 11 nm, 64 cores could fit on a chip (and then there's also Hyper-Threading). Heat problems can be controlled as long as the clock frequency is not increased aggressively.

So to prepare for the next 8 years you'd better have a serious look at multi-core programming if you don't want to get hopelessly behind.

After 11 nm they'll have to transition to nanoelectronics. However, this still doesn't mean the end of multi-core processing. Hypothetically, even with materials that can run at ten times higher clock frequencies while keeping heat in check, you'd still have a transistor budget of over ten billion (more than 100 times that of a Pentium 4). No matter how clever you invest these in a single core, you'll never get the same throughput as a multi-core processor. Code only has a limited number of nearby instrutions that can execute in parallel. In practice an IPC of five is about the highest you can go. To do more work you need multiple points of execution; concurrent threads.

Software development has changed forever.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 11:08:37 AM
Quote from: hutch-- on May 28, 2008, 03:18:21 PM
Somewhere along the line I remember that you can use the normal db 90h (nop) or a number of them instead of pause.

Great, you do read my posts (http://www.masm32.com/board/index.php?topic=9221.msg66782#msg66782).
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 11:23:31 AM
Quote from: johnsa on May 28, 2008, 04:27:56 PM
Tried using nops... while MUCH faster than a pause instruction it does add a fair amount of overhead to the spin loop...
Question is.. does it really matter.. tight loop with no nop/pause or nop in there.. either way the result will be the same and it will still take as much time as is necessary for the lock to become available..assuming this thread is running on a different core from the one which owns the lock brings another question up.. do we care that that specific core jams up running the loop at full-tilt while waiting for the lock?

Yes, if you don't put a little delay between locking attempts you'll create an avalanche of synchronization traffic between the cores, and neither thread can succesfully aquire the lock for a long time. This only becomes a big problem with more than two cores. If one thread holds the lock, more than one other thread can start queing up to try to aquire the lock. Once the first thread releases the lock, all other threads start fighting for the lock at (almost) the same time.

Locks with exponential backoff have quite good scaling behavior (it's similar to the protocol used by ethernet (http://en.wikipedia.org/wiki/Carrier_sense_multiple_access_with_collision_detection)), but queue-based locks are even better for processors with a MESI cache coherency protocol. I'd love to see (http://www.masm32.com/board/index.php?topic=9221.msg66782#msg66782) x86 implementations of such locks.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 29, 2008, 12:14:44 PM
 :bg

Not necessarily but I do read the Intel manuals when i want technical data in Intel hardware.

Quote
This instruction was introduced in the Pentium 4 processors, but is backward compatible with
all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction operates like a NOP
instruction. The Pentium 4 and Intel Xeon processors implement the PAUSE instruction as a
pre-defined delay. The delay is finite and can be zero for some processors. This instruction does
not change the architectural state of the processor (that is, it performs essentially a delaying noop
operation).

> Software development has changed forever.

Software development HAS been changing forever but not all of it has lasted. Seen a 10 year old RISC box recentlly ? What about a modern DDE aplication, how much OLE have you seen lately ?

The notion that multicore processing is suddenly something new is mistaken, try the 512 parallel Itaniums I mentioned earlier with recent SGI boxes but even on x86 I remember seeing multiple processor boards for the early Pentiums and there was Windows OS support as early as win2000, I think also NT4 but I forget, it was 10 years ago.

The context for the "free lunch" is also mistaken, it addressed ever slower software on ever faster hardware. There IS a solution to THAT free lunch, rewrite VB style crap in C or assembler, that avenue is far from fully exploited and modern hardware at 20 to 30 times faster is not supported by much of modern sopftware that may be a bit faster here and there. Note that this level of performance increase does not even address multicore processing yet.

RE being left behind, keep in mind that the SGI hardware I mention which would be 3 to 4 years old now was pelting 20 megapixel images at over 100 frames a second back then and this type of performance is well beyond anything that x86 and current video softare can expect in the foreseeable future. The difference between SGI paralel hardware and for that matter some of the multiple parallel x86 hadware that was around a few years ago to current dual and double dual core processors is dedicated hardware to interface between large processor counts at about 1.9 per extra processor.

SUBSTRATES.
Silicon is 40 year old technology and while throwing large sums of money at it has kept it going for a long time, where does it go when speed / space requirements push trach widths down to under 1 nanometre ? The answer is nowhere in a hurry. Now while military suppliers and no going to start revealing their technology any time soon, I still remember ruby substrates and somewhere along the line saphire/silicon junctions in high speed instrumentation. It would indeed be a brave prediction that processor clock speeds will not go up again, it tends to sound like Bill gates prediction about 64k of memory.

Don't be without ambition, think in terms of 1024 parallel core running in the terahertz range with dedicated hardware to properly interface running Windows Universe 12.  :bg
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 12:23:06 PM
Quote from: johnsa on May 28, 2008, 05:49:46 PM
So.. assuming that if one looks at the overhead of creating threads on a per-task basis, it makes far more sense to me to allocate a thread-per-core... and then in each thread routine implement some sort of workload designator which calls other routines as it needs.

Spot on.

QuoteI do understand how that approach doesn't really scale well in terms of a variable number of cores possibly, but unless you're creating a task which is essentially going to act as a template and all threads are instances of that same code (IE: like a socket server, web server etc) then implementing algorithms which are not only optimal but can handle a variable number of cores is almost impossible.

True, it quickly becomes mind-boggling complex to write software this way. But it does become manageable when you make use of declarative programming techniques. Have a look at SystemC for example. Basically every statement runs in parallel except if there is a dependency between them. The framework and compiler ensure that these tasks are distributed over all available cores. Each finished task can spawn new tasks which are queued so that other threads can help process them, keeping the dependencies into account to maintain correctness. This allows to write and maintain larger projects than would otherwise be feasible with imperative languages. SystemC has a lot of overhead because it's built on top of C++, but a proper compiler for such a language could be quite revolutionary (RapidMind comes close and runs on an arbitrary number of cores).

Just-in-time compilation (JIT) also offers very interesting possibilities. Basically when you run the application it can compile the code to run optimally on whatever number of cores you have. So instead of requiring the developer to optimize specifically for every possible number of cores, it's handled automatically at run-time.

Of course declarative programming and JIT doesn't render assembly useless, but it will become increasingly more difficult to write and maintain large projects in purely imperative languages. Instead, C and assembly remain crucial for compilers and multi-core programming frameworks. I've recently started exploring LLVM (http://llvm.org) and the possibilities are truly awesome. In fact, I haven't found a single situation yet where the generated code doesn't match or exceed the performance of hand-written code, including SIMD operations! All that is lacking is a language and a compiler combining all of this into a convenient way to write high performance multi-core aware software. We have exciting times ahead of us. :8)
Title: Re: Multicore theory proposal
Post by: hutch-- on May 29, 2008, 12:51:17 PM
Here is a quick scruffy on the PAUSE mnemonic. It may in fact be more efficient in terms of exit from a spinlock but it still locks the processor stone dead until it exits. This is useful enough in terms of thread timing trimmers but unless you are willing to take big performance hits PAUSE needs to be supplimented with an OS level yield so that other processes can run on the core while the lock is idling.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pause equ <db 0F3h,90h>

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push esi

    mov esi, 10000000
    call spinlock

    mov esi, 20000000
    call spinlock

    mov esi, 40000000
    call spinlock

    mov esi, 80000000
    call spinlock

    pop esi
    ret

  spinlock:
    invoke GetTickCount
    push eax

  lbl1:
    pause
    sub esi, 1
    jnz lbl1

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax),13,10

    retn

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 02:18:29 PM
Quote from: hutch-- on May 29, 2008, 12:14:44 PM
Software development HAS been changing forever but not all of it has lasted. Seen a 10 year old RISC box recentlly ? What about a modern DDE aplication, how much OLE have you seen lately ?

Sure, lots of technology comes and goes or ends up in a dusty corner. But you haven't given me a single sound argument yet why multi-core won't last. Mythic semiconductor substrates don't count, they are not on any roadmap, and even if they were there is no indication they would render multi-core useless.

Try argumenting against this: maximum single-core performance = instructions per clock * clock frequency, maximum multi-core perfomance = core count * instructions per clock * clock frequency. It doesn't take a genious to know that balancing three parameters will result in a more optimal system than balancing two parameters. IPC is limited and doesn't favor single-core (multi-core benefits from it equally), so you'd have to compensate all performance gained by having multiple cores with an aggressively higher clock frequency. Octa-core are already on the roadmaps, now show me semiconductor technology that is on any roadmap that will allow a single-core to keep up with that.

QuoteThe notion that multicore processing is suddenly something new is mistaken, try the 512 parallel Itaniums I mentioned earlier with recent SGI boxes but even on x86 I remember seeing multiple processor boards for the early Pentiums and there was Windows OS support as early as win2000, I think also NT4 but I forget, it was 10 years ago.

I've said this before: the revolutionary aspect is that it's coming to every mainstream system. Supercomputers with multiple processors have existed for ages but they ran specialized applications only available to a minority. We now face the challenge of making multi-core software for the masses. And it's not a fixed number of processors programmed by specialists, it's a varying number of cores programmed by programmers with a varying level of expertise.

Furthermore, the existance of supercomputers for many decades should tell you something about the need for multiple cores to reach higher performance. They already lastest this long, so what kind of twisted reasoning makes you think we'll be able to vastly increase performance without transitioning to multi-core for good?

QuoteThe context for the "free lunch" is also mistaken, it addressed ever slower software on ever faster hardware. There IS a solution to THAT free lunch, rewrite VB style crap in C or assembler, that avenue is far from fully exploited and modern hardware at 20 to 30 times faster is not supported by much of modern sopftware that may be a bit faster here and there. Note that this level of performance increase does not even address multicore processing yet.

It's pointless to talk about the performance of Visual Basic. Anyone programming in it doesn't aim for performance in the first place, instead they want fast development and safety, which assembly won't offer them. I'm no fan of VB either but I'll recommend C# to anyone requiring these properties from a programming language. There are good reasons for the existance of any language. The very fact that there are so many languages is because there are people who seek different qualities. Performance is only one of them.

That being said, the issue at hand is increasing performance of applications already written in C and assembly. The only route to further enhance performance is to take advantage of the increasing number of cores. And to not get strangled in your own code, you need some level of abstraction. High-level tools and languages can help with that. The alternative is to re-invent the wheel for every project and waste eons of time getting the design right.

QuoteRE being left behind, keep in mind that the SGI hardware I mention which would be 3 to 4 years old now was pelting 20 megapixel images at over 100 frames a second back then and this type of performance is well beyond anything that x86 and current video softare can expect in the foreseeable future. The difference between SGI paralel hardware and for that matter some of the multiple parallel x86 hadware that was around a few years ago to current dual and double dual core processors is dedicated hardware to interface between large processor counts at about 1.9 per extra processor.

I fail to see why you're so excited by that multi-core SGI hardware but mainstream multi-core CPUs leave you cold. You won't get anywhere near the performance of that SGI hardware without making use of software designed for multi-core. Are you seriously suggesting that if we're not going to beat that SGI hardware any time soon that we might as well not try going the multi-core path anyway? There are tons of other applications that can become a reality in the meantime if we take advantage of multi-core.

QuoteSilicon is 40 year old technology and while throwing large sums of money at it has kept it going for a long time, where does it go when speed / space requirements push trach widths down to under 1 nanometre ? The answer is nowhere in a hurry. Now while military suppliers and no going to start revealing their technology any time soon, I still remember ruby substrates and somewhere along the line saphire/silicon junctions in high speed instrumentation. It would indeed be a brave prediction that processor clock speeds will not go up again, it tends to sound like Bill gates prediction about 64k of memory.

Oh I'm not saying that clock speeds won't go up again sooner or later. I'm just saying that silicon has at least another decade to go, during which the number of cores will increase aggressivly but the clock frequency won't increase that much. Even after silicon runs out of steam and we transistion to other technologies, there is not a single argument supporting a return to single-core. So no matter what happens, you should invest in multi-core software development if you care about performance.

Talking about substrates, the most promising of all is probably (synthetic) diamond. Even if it can truely run at hundreds of GHz as promised, we still have a transistor budget of many billion, for which the only sane choice is to spend them on multiple cores...
Title: Re: Multicore theory proposal
Post by: johnsa on May 29, 2008, 02:36:51 PM
Ok, so how would one go about getting the structure aligned to 128bytes and cacheline inline without having to dynamically allocate memory?

as in

spinlock_t STRUCT
_lock dd 0
spinlock_t ENDS

mylock spinlock_t <1>   ; align this at 128byte / cache line

as opposed to using virtualalloc or something?

Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 06:20:30 PM
http://msdn.microsoft.com/en-us/library/tydf8khh.aspx
http://msdn.microsoft.com/en-us/library/dwa9fwef(VS.80).aspx
Title: Re: Multicore theory proposal
Post by: johnsa on May 29, 2008, 07:07:53 PM
neither align nor struct will accept 128 as an aligment.. tried that already :)

Title: Re: Multicore theory proposal
Post by: hutch-- on May 29, 2008, 07:40:28 PM
John,

Allocate dynamic memory, any method will do as long as it is contiguous and align the start location you want to use which will probably be an offset from the start of the memory. This is the macro from masm32 to align memory.


    ; ********************************************
    ; align memory                               *
    ; reg has the address of the memory to align *
    ; number is the required alignment           *
    ; EXAMPLE : memalign esi, 16                 *
    ; ********************************************

      memalign MACRO reg, number
        add reg, number - 1
        and reg, -number
      ENDM


I know of one other method but it would be unusual, you can use large alignments on object modules and there is a tool in the masm32 project called FDA or the gui version FDA2 that will allow you to create an object module with the correct alignment with whatever bytes size of data you require.
Title: Re: Multicore theory proposal
Post by: johnsa on May 29, 2008, 09:08:58 PM
not to worry, thanks for that macro info. I'll just use a dynamic allocation and align it rather than creating the lock objects in the .data
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on May 29, 2008, 11:48:48 PM
Quote from: johnsa on May 29, 2008, 07:07:53 PM
neither align nor struct will accept 128 as an aligment.. tried that already :)

Shouldn't 64 byte alignment be sufficient?

Another option would be to have a static structure twice the cache line size and use an alignment macro like hutch's.

Or, have a structure that starts with a cache line size of dummy data, then the actual lock variable, and then filling the rest up to the size of another cache line. No matter where this strucuture ends up in memory, the lock variable is guaranteed to be all alone on a cache line (just make sure the structure is 4 or 8 byte aligned).
Title: Re: Multicore theory proposal
Post by: hutch-- on May 31, 2008, 12:01:23 AM
John,

Try this, I think its correct and its easy enough to impliment.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    spinlock_t STRUCT
      _lock dd 0
    spinlock_t ENDS

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    LOCAL pbuffer   :DWORD          ; buffer pointer
    LOCAL pstruct   :DWORD          ; start address for structure
    LOCAL buffer[1024]:BYTE

    lea eax, buffer                 ; get the buffer address
    mov pbuffer, eax                ; write it to buffer pointer

    push esi

    lea esi, pbuffer                ; load buffer offset into pointer
    memalign esi, 128               ; align ESI to 128 bytes
    mov (spinlock_t PTR [esi])._lock, 12345678  ; < load you value here

  ; -----------------------
  ; test code for alignment
  ; -----------------------
    print str$(esi),13,10
    memalign esi, 128
    print str$(esi),13,10

    pop esi
    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Title: Re: Multicore theory proposal
Post by: hutch-- on May 31, 2008, 01:18:47 AM
c0d1f1ed,

I think you miss where I come from in relation to multicore or multiprocessor computers. Like most people wity a reasonable grasp of modern computing I see multicore procesing as the future but at the common desktop PC level I see it as somewhere in the future as I don't see that capacity in current dual core hardware as being even vaguely near fast enough to do general purpose work.

To make a point back in the 16 bit DOS days I had the technical data on FLAT memory model ever though it was another 7 years until an OS properly supported it at the PC level. 16 bit DOS was 64k memory range with reasonably complicated tiling schemes for larger blocks of memory. Later in the scene you could access a few megabytes with upper memory management and later win32s but they were both slow fudges of the complete capacity that had been in hardware since the 386DX33.

I wrote my first 32 bit FLAT memory model code on winNT4 which had the proper support for FLAT memory model, much the same as the current 32 bit code that is still in use today.

I see multicore in much the same light as FLAT memory model in 1988, something that will be usefl in the future but not realy viable at the moment. By the time multicore/processor hardware is capable of doing anything useful in terms of general purpose code, it will be all 64 bit and the methods to make 9t general purpose will be very different indeed to current hardware and software techniques.

Multithread multicore processing is already with us in terms of multiple concurrent threads for things like terminal servers and web servers as they routinely handle that type of workload but the hardware is not yet suitable for close range high performance computing.

The distinction here is between vertical performance versus horizontal performance. Horizontal performance is well suited to current multicore hardware in terms of threads being spead across the availale cores.

Now where this will make a difference is when you can approach a task that is by its layout not suitable for parallel processing, compression comes to mind here which effects not only simple data compression but formats like MP2 and MP4 video compression which nees to be linear (serial) in its nature to acheive very high compression rates.

Think in terms of a 64 core x86 processor where the core design can not only handle current concurrent threads in the normal manner but can handle parallel processing on a sngle thread without the messy fudges that are curently required and where you can get about a 1.9 times increase in computing power for extra core used in the algorithm. It says for the use of 10 core instead of one that you wil get about 8 to 9 time that processing power.

The action is in interfacing multiple cores in an efficient manner and this will only ever be done at a very high speed hardware level, emulating software cooperative multitasking at a core level is doomed to failure as it can never be fast enough.
Title: Re: Multicore theory proposal
Post by: hutch-- on May 31, 2008, 01:44:55 AM
John,

I just tweaked the alignment example so it made a bit more sense. I have just come out of hospital for a day and I am stil a bit wandery from the general anasthetic.

One question, why use a structure with only one member when a single unsigned 32 bit value would do the job fine and with less code ?


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    spinlock_t STRUCT
      _lock dd 0
    spinlock_t ENDS

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    LOCAL pbuffer   :DWORD          ; buffer pointer
    LOCAL pstruct   :DWORD          ; start address for structure
    LOCAL buffer[1024]:BYTE

    lea eax, buffer                 ; get the buffer address
    mov pbuffer, eax                ; write it to buffer pointer

    push esi

    lea esi, pbuffer                ; load buffer offset into ESI
    memalign esi, 128               ; align address in ESI to 128 bytes
    mov pstruct, esi
    mov (spinlock_t PTR [esi])._lock, 12345678  ; < load you value here

  ; -----------------------
  ; test code for alignment
  ; -----------------------
    lea eax, (spinlock_t PTR [esi])._lock
    print str$(eax)," aligned spinlock_t structure member address", 13,10

    print str$(esi)," ESI aligned value",13,10
    memalign esi, 128
    print str$(esi)," ESI aligned value after realigning it to 128 bytes again",13,10

    pop esi
    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Title: Re: Multicore theory proposal
Post by: johnsa on May 31, 2008, 02:48:28 PM
Hi,

Thanks for that example!

The reason I wanted to put the lock in a structure to start with is that I was thinking of expanding the structure with some other data to indicate a R/W condition on the lock and possibly some sort of call-back queue that would be processed as the spin-lock times out.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 01, 2008, 10:22:37 PM
Quote from: hutch-- on May 31, 2008, 01:18:47 AM
I think you miss where I come from in relation to multicore or multiprocessor computers.

I don't think that's relevant, but ok...

QuoteLike most people wity a reasonable grasp of modern computing I see multicore procesing as the future but at the common desktop PC level I see it as somewhere in the future as I don't see that capacity in current dual core hardware as being even vaguely near fast enough to do general purpose work.

Why? New projects appear every day reaching 90% higher performance on dual-core, proving the contrary.

QuoteTo make a point back in the 16 bit DOS days I had the technical data on FLAT memory model ever though it was another 7 years until an OS properly supported it at the PC level.

You're not making any point for multi-core here. Mainstream multi-core processors were supported the day they were available in stores.

QuoteI see multicore in much the same light as FLAT memory model in 1988, something that will be usefl in the future but not realy viable at the moment.

Nostalgic references to unrelated technology are not an argument. It's 2008 and budget PC's come with a very capable dual-core CPU. If you want your software to run faster and continue to become faster with newer processors you have to go multi-core sooner rather than later.

QuoteMultithread multicore processing is already with us in terms of multiple concurrent threads for things like terminal servers and web servers as they routinely handle that type of workload but the hardware is not yet suitable for close range high performance computing.

Again you're trying to get away with this without any arguments. Inserting and removing elements from a lock-free list takes only a few dozen clock cycles, so current hardware is very capable of running close range high performance workloads. What exactly do you think is still missing?

QuoteNow where this will make a difference is when you can approach a task that is by its layout not suitable for parallel processing, compression comes to mind here which effects not only simple data compression but formats like MP2 and MP4 video compression which nees to be linear (serial) in its nature to acheive very high compression rates.

There will always be a number of algorithms not suited for parallel processing. That doesn't take away the vast benefits of multi-core for algorithms that do scale well with concurrent threads. By the way, compression parallelizes quite well: H.264 (http://www.google.com/search?q=h.264+parallel), JPEG (http://citeseer.ist.psu.edu/klein03parallel.html), Lempel Ziv (http://citeseer.ist.psu.edu/klein05parallel.html), ...

QuoteThink in terms of a 64 core x86 processor where the core design can not only handle current concurrent threads in the normal manner but can handle parallel processing on a sngle thread without the messy fudges that are curently required and where you can get about a 1.9 times increase in computing power for extra core used in the algorithm. It says for the use of 10 core instead of one that you wil get about 8 to 9 time that processing power.

Please specify in great detail how that would work. Show me some references to related research.

QuoteThe action is in interfacing multiple cores in an efficient manner and this will only ever be done at a very high speed hardware level, emulating software cooperative multitasking at a core level is doomed to failure as it can never be fast enough.

Tell me again how you're going to achieve that.

The fact of the matter is that multi-core processor architectures will not change drastically for the foreseeable future. You'll still need significant changes to the software to make use of the extra cores. And you can't avoid the issue by referring to mythic hardware.
Title: Re: Multicore theory proposal
Post by: MichaelW on June 01, 2008, 11:59:08 PM
Quote from: c0d1f1ed on June 01, 2008, 10:22:37 PM
Quote from: hutch-- on May 31, 2008, 01:18:47 AM
To make a point back in the 16 bit DOS days I had the technical data on FLAT memory model ever though it was another 7 years until an OS properly supported it at the PC level.

You're not making any point for multi-core here. Mainstream multi-core processors were supported the day they were available in stores.

As was the 386, but "proper" OS support for anything near its full capacity didn't appear until years later.

QuoteNostalgic references to unrelated technology are not an argument. It's 2008 and budget PC's come with a very capable dual-core CPU. If you want your software to run faster and continue to become faster with newer processors you have to go multi-core sooner rather than later.

Quote
The fact of the matter is that multi-core processor architectures will not change drastically for the foreseeable future. You'll still need significant changes to the software to make use of the extra cores. And you can't avoid the issue by referring to mythic hardware.

Quote from: c0d1f1ed on May 14, 2008, 09:45:29 AM
Multi-core programming is in its infancy and it's going to require innovation on the software front (both low-end and high-end), and hardware front to achieve superior results.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 02, 2008, 01:07:11 AM
c0d1f1ed,

Quote
The fact of the matter is that multi-core processor architectures will not change drastically for the foreseeable future. You'll still need significant changes to the software to make use of the extra cores. And you can't avoid the issue by referring to mythic hardware.

This is probably the major aea of our disagreement, I see that multicore processing will change in almost unimaginable ways in the next 5 to 10 years if there is not another technical brakthrough inbetween and the techniques that are being proposed at the moment as well as the current OS design techniques will end up in the dustbin of history.

I have known for a couple of years that both Intel and AMD have 32 and 64 core processors in the early design stages and this type of core arrangement is well beyond the co-operative multitasking you have alluded to over the range of this debate.

there is memory technology in the pipeline that is vastly faster than current memory and it is also well known that memory is still the major bottleneck in current software performance so there are major performance gains here.

It is simply a mistake to assume that current technology contains the architecture of the future, there are lessons from the past that making the same assumption has often failed, 64k of memory, why would you ever need to go faster than a 33 meg 486 etc etc ....

Real multicore/multiprocessor computing is ALREADY WITH US, its just that 120000 US$ tops what most people wish to spend on a desktop but the model is clear, forget trivial fudges of up to 1.9 times faster than a single core, think of dozens to hundreds of times faster than a single core and you will have some idea of where multicore hardware is going.

I have a wait and see approach as I don't see the magnitude of hardware improement yet and I doubt I will see it for a few years. let some other patsy waste their time and money on hardware and techniques that will not last, just like the win32s guys did, just like the Itanuim development guys did, RISC boxed etc .....

When the hadware and OS support is there, this stff will work and it will be fast, not just a coupe of core but far more and far faster.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 02, 2008, 09:18:01 AM
Quote from: MichaelW on June 01, 2008, 11:59:08 PM
As was the 386, but "proper" OS support for anything near its full capacity didn't appear until years later.

Which again has nothing to do with multi-core. Multi-processor support has been widespread for some time so when multi-cores appeared they were fully supported. If you think there's still a lot missing, please specify.

Quote
Quote from: c0d1f1ed on May 14, 2008, 09:45:29 AM
Multi-core programming is in its infancy and it's going to require innovation on the software front (both low-end and high-end), and hardware front to achieve superior results.

Read the context. While multi-core hardware is going to make significant progress (more cores, faster inter-core communication, etc.), there is no magical technology that will make it as easy as sequential programming to make use of the full concurrent processing capacity. Unless, you use high-level tools that help extract the parallelism.
Title: Re: Multicore theory proposal
Post by: MichaelW on June 02, 2008, 10:37:30 AM
QuoteWhich again has nothing to do with multi-core. Multi-processor support has been widespread for some time so when multi-cores appeared they were fully supported. If you think there's still a lot missing, please specify.

Please define "fully supported".
Title: Re: Multicore theory proposal
Post by: hutch-- on June 02, 2008, 12:41:17 PM
c0d1f1ed,

Have a look at this thread and you will see that your assumptions are unsound. In particular look at the graphs that Greg has posted and it shows clearly that a single thread is being processed by both cores. Forget abstraction, high level tool to make it all easier and magical high level libraries that will do it all for you, this IS being done in hardware on a modern dual core Intel processor. The future is more of the same but with many more cores interfaced in hardware.

http://www.masm32.com/board/index.php?topic=9297.0
Title: Re: Multicore theory proposal
Post by: PBrennick on June 02, 2008, 01:29:59 PM
Hutch,

By now, after all the experiences we have had with that type of scenario over the years, this has really become an axiom, hasn't it?

Quote
This is probably the major aea of our disagreement, I see that multicore processing will change in almost unimaginable ways in the next 5 to 10 years if there is not another technical brakthrough inbetween and the techniques that are being proposed at the moment as well as the current OS design techniques will end up in the dustbin of history.

Well said, any way you look at it., this happens because (not when) hardware technology out-paces software development, doesn't it? And it has happened over and over again.

Paul
Title: Re: Multicore theory proposal
Post by: johnsa on June 03, 2008, 09:02:36 PM
If I am understanding correctly ... from this graph ... what is actually going on inside the cores/OS without us even knowing (perhaps in some undocumented attempt) is that intel/amd/ms are already looking at ways to get the cores to automatically handle the processing load of sequential code without actually having to "multi-thread" at all... which in my mind is the perfect solution... IE: no solution :)
don't multi-thread and let the cpu work out how best to split the instructions up that it receives amongst it's cores.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 04, 2008, 12:49:33 AM
Its my guess over time that with the core count increase that we will see a technique of core clustering in much the same way as the 4 core double core 2 Duos work at the moment but on a much larger scale. It means that both ends of the spectrum are being approached so that close range linear code will benefit from automatic scheduling between 2 or more cores in a cluster while the number of clusters of cores will improve the performance of multithreaded code by at least a factor of the number of clusters.

I would expect to see later OS versions with far more accurate thead synchronisation methods that have far finer granularity than current OS versions which should allow greater parrallelism between multiple clusters, each in itself scheduling instructions within a single thread across the number of cores in a cluster.
Title: Re: Multicore theory proposal
Post by: NightWare on June 04, 2008, 02:07:14 AM
Quote from: hutch-- on June 04, 2008, 12:49:33 AM
I would expect to see later OS versions with far more accurate thead synchronisation methods that have far finer granularity than current OS versions which should allow greater parrallelism between multiple clusters, each in itself scheduling instructions within a single thread across the number of cores in a cluster.
:lol  stop dreaming, it will not happen (i mean at os level), ms has never used hardware improvements (like cmov, mmx, etc...) in their os like they should do... and only reserve that for extra stuffs (like directx). beside, threads at os level, is only made to deal with a lot of differents programs, nothing more... in the contrary, at hardware level it's an attempt for parallelism... it's different. so if you place yourself in the point of view of ms, why coding something if it's automatically made at hardware level ?

now, it's good for us (asm coder), it's easy to develop a system to efficiently share a task, with simple bolean operators (until it's ^2), something similar to intel's hardware system... beside, why coding in asm if it's to let the os to do the job for us ? (and remember, here we speak of the guys who coded win3.11, win95, win98, win98se, winMe and more recently winVista !  :bg)
Title: Re: Multicore theory proposal
Post by: hutch-- on June 04, 2008, 02:19:33 AM
 :bg

We probably differ here, once a capacity is built into hardware, some years later MS tend to put it to use. 386DX multitasking was eventually put into early 32 bit NT as the OS would not work on earlier stuff. I see multiprocessor/core hardware in much the same light, current OS version only touch the fringe of its capacity and until there is both major hardware and software changes, tis will not change much.

Now while it may hae to wait for Windows Galaxy to be properly implemented with a minimum hardware spec, it is inevitable that both approaches will see development, close range hardware controlled core synchronisation and independent threads on diferent clusters and in this sense true parallelism will become a reality, just don't hold your breath waiting.  :P
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 04, 2008, 09:44:11 PM
Quote from: hutch-- on June 02, 2008, 01:07:11 AM
This is probably the major aea of our disagreement, I see that multicore processing will change in almost unimaginable ways in the next 5 to 10 years if there is not another technical brakthrough inbetween and the techniques that are being proposed at the moment as well as the current OS design techniques will end up in the dustbin of history.

Please name these breakthroughs, or at least explain in detail how they would work in theory. You're claiming fantastic advancements in hardware that will make it unnecessary to explicitely do concurrent processing, so lets hear about them.

QuoteI have known for a couple of years that both Intel and AMD have 32 and 64 core processors in the early design stages and this type of core arrangement is well beyond the co-operative multitasking you have alluded to over the range of this debate.

Why? With high-level languages there is plenty of opportunity to extract a high level of parallellism. Again, look at RapidMind and SystemC for inspiration. Also please explain to me how you're going to program a 64 core in pure and straightforward assembly if you're not even considering programming much simpler dual- and quad-cores today. Extracting parallelism beyond just a few instructions is a high-level software problem. Only the developer (or a high-level tool) knows what tasks are independent and therefore can run concurrently. Single-threaded programming has no future.

Quotethere is memory technology in the pipeline that is vastly faster than current memory and it is also well known that memory is still the major bottleneck in current software performance so there are major performance gains here.

Again, name that technology. Are you talking about Z-RAM, embedded RAM, etc? That's all great stuff but it will equally benefit each core and won't help single-core software reach the performance of multi-core software.

QuoteIt is simply a mistake to assume that current technology contains the architecture of the future, there are lessons from the past that making the same assumption has often failed, 64k of memory, why would you ever need to go faster than a 33 meg 486 etc etc ....

You're referring to totally unrelated things. In the past, every upgrade of the CPU and memory made your software faster and made it easier to program for them. This just doesn't hold for multi-core. To put more transistors to work you have to explicitely make them do independent tasks.

QuoteReal multicore/multiprocessor computing is ALREADY WITH US, its just that 120000 US$ tops what most people wish to spend on a desktop but the model is clear, forget trivial fudges of up to 1.9 times faster than a single core, think of dozens to hundreds of times faster than a single core and you will have some idea of where multicore hardware is going.

We're talking about mainstream systems here. But either way the number of cores is increasing. If you're going to wait till we have dozens of cores you'll have missed at the very least a decade of opportunity to write faster software than the competition.

QuoteI have a wait and see approach as I don't see the magnitude of hardware improement yet and I doubt I will see it for a few years. let some other patsy waste their time and money on hardware and techniques that will not last, just like the win32s guys did, just like the Itanuim development guys did, RISC boxed etc .....

Feel free to wait and see. But you'll be waiting forever to make your software faster. Amateurs using the tools and frameworks for multi-core programming will write faster software than you.

QuoteWhen the hadware and OS support is there, this stff will work and it will be fast, not just a coupe of core but far more and far faster.

That's just wishful thinking. The brightest people in the industry and the academic world haven't come up yet with a realistic way to make single-core performance scale like multi-core performance could. They also don't think any significant O.S. support is missing. A big topic right now is transactional memory, but while it might allow software to scale beyond, say, eight cores, it's not a silver bullet by any stretch. In particular, you still have at least the same architectural complexity needed for efficient dual- and quad-core programming. It's just yet another synchronization primitive that might be added to the concurrent programming toolbox, out of necessity.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 04, 2008, 09:47:37 PM
Quote from: MichaelW on June 02, 2008, 10:37:30 AM
Please define "fully supported".

You can create threads, suspend them, and resume them. That's pretty much everything you need to have a thread per core and have them schedule and process tasks. Synchronization primitives can be implemented at the application level with no need for O.S. interaction.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 04, 2008, 09:58:35 PM
Quote from: hutch-- on June 02, 2008, 12:41:17 PM
Have a look at this thread and you will see that your assumptions are unsound. In particular look at the graphs that Greg has posted and it shows clearly that a single thread is being processed by both cores. Forget abstraction, high level tool to make it all easier and magical high level libraries that will do it all for you, this IS being done in hardware on a modern dual core Intel processor. The future is more of the same but with many more cores interfaced in hardware.

What are you talking about? There's no performance improvement. The O.S. simply decides to run the thread on another core from time to time.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 05, 2008, 12:39:11 AM
Quote from: johnsa on June 03, 2008, 09:02:36 PM
don't multi-thread and let the cpu work out how best to split the instructions up that it receives amongst it's cores.

It's not splitting up the instructions amongst its cores.

Even if it did, that approach just doesn't scale. There's only a limited amount of instruction parallelism in any one thread, and extracting it becomes exponentially more complex. By having independent instruction steams (threads) you can get much higher throuhput.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 05, 2008, 12:54:59 AM
Quote from: hutch-- on June 04, 2008, 12:49:33 AM
Its my guess over time that with the core count increase that we will see a technique of core clustering in much the same way as the 4 core double core 2 Duos work at the moment but on a much larger scale.

Why would you want to do that? You're probably not getting that ALUs are pretty cheap now. Sandy bridge will feature vector units of eight elements, for only a limited range of applications. The real cost is in execution ports, for which sharing of resources between cores just doesn't help.

QuoteI would expect to see later OS versions with far more accurate thead synchronisation methods that have far finer granularity than current OS versions which should allow greater parrallelism between multiple clusters, each in itself scheduling instructions within a single thread across the number of cores in a cluster.

How, and why? First of all, there is an inherent cost for switching threads. You can have Hyper-Threading which can switch threads at a per clock basis but it's too expensive to have much more than four threads, while you need a lot more to get the finer granularity you're talking about. And secondly, you don't need it if you have one thread per core that just continuously schedule and execute tasks. So why use expensive hardware solutions if there are perfectly adequate and cheap software solutions?

And don't get me wrong, I'm not excluding some kind of high speed thread migration and such in the far future, but again it's just not the silver bullet that will free developers from writing a proper multi-core software design.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 06, 2008, 01:52:46 AM
c0d1f1ed,

The interesting part of the graphs that Greg posted is the load distribution across both cores. If you bother to look the load is not being switched from one core to another, its being distributed across both cores and this is with single thread code. Your technical assumptions are simply incorrect and it is because you have based your ideas on current software technology, not the emerging hardware capacity.

Nor is the capacity some pie in the sky idea far into the future, it is ALREADY being done in high end hardware and it is being done the only way possible to pick up processor parallelism, directly in hardware. While near 4 gig clock speeds may sound fast at a computing level, at the electronic level there is hardware that is at least hundreds of times faster and if this type of technology is built into the chip directly for path length reasons the capacity to synchronise hardware will dramatically expand and at far higher speeds.

Software multitasking is old junk from the past at about win 3.0 technology, the future is multiprocessor / multicore hardware that can do both close range single thread load distribution as well as the current technology of multiple concurrent threads. The dustbin of history is full of hybrid stopgap technology, why waste your time with junk like this when hardware will change it all as it always has.
Title: Re: Multicore theory proposal
Post by: johnsa on June 06, 2008, 08:00:40 AM
Hutch, do you think perahps the graph is just really inaccurate? It could possibly just be switching the load from core 0 to core 1 but due to the lack of fine granularity in the graph you cant see the alternating sqaure wave pattern so to speak.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 06, 2008, 09:10:44 AM
John,

While its possible, the first graph that Greg posted certainly did not look like it. The interesting part is even if it was switching between cores alternately it tells you something about how data is cached between two core and how the two cores are sharing the data. The speeds coming out of the later Core 2 Duos does not look like alternate core stalls and with normal pipeline length this would be unavoidable due to result dependencies of sequential instructions.
Title: Re: Multicore theory proposal
Post by: johnsa on June 06, 2008, 01:45:23 PM
Agreed, in a standard two core setup the way I've understood it from Intel's perspective is that each core has it's own cache and instruction prefetch etc. So moving a thread backwards and forward between 2 cores would cause all sorts of performance issues in terms of the instruction pipeline, re-fetch and cache updates.

So it would seem that even if the cores aren't sharing the load so to speak, that they have been implementing some sort of shared prefetch/cache setup to allow code to transition seamlessly from core to core.
Perhaps this is what we could consider step 1 of something bigger down the line.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 06, 2008, 03:58:03 PM
John,

> Perhaps this is what we could consider step 1 of something bigger down the line.

It would seem so. While a much faster threaded model has clear enough logic, it does not look like the hardware is going in that direction yet. Multiple parallel processing through multiple cores is probably where the future is going as it can beat the clock speed bottleneck and if done properly by a reasonably large degree.

What will be interesting is if instruction order in code can effect the distribution of load across multiple cores in much the same way as multiple pipelines respond well to preferred instruction choice and order. The PIII was going in this direction and the PIVs certainly responded well to proper instruction choice, effectively RISC for preferred instructions while leaving the antiques to be emulated in microcode.

The other end was if you did mess up an instruction ordering with pipeline design of this type, you were stung with a big performance drop. Something tha most asm people are familiar with is different encodings suit different hardware and part of the art is to produce reasonable averages across most common hardware. I have no doubt that the coming generation of multiple core hardware will have its own set of optimisations as well.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 06, 2008, 09:48:31 PM
Quote from: hutch-- on June 06, 2008, 01:52:46 AMThe interesting part of the graphs that Greg posted is the load distribution across both cores. If you bother to look the load is not being switched from one core to another, its being distributed across both cores and this is with single thread code.

It's not distributed across the cores. What you see is an averaging of the thread running for some time on one core and for some time on the other core, never simultaneously.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 06, 2008, 10:00:11 PM
Quote from: hutch-- on June 06, 2008, 09:10:44 AM
The interesting part is even if it was switching between cores alternately it tells you something about how data is cached between two core and how the two cores are sharing the data. The speeds coming out of the later Core 2 Duos does not look like alternate core stalls and with normal pipeline length this would be unavoidable due to result dependencies of sequential instructions.

Thread switching happens either way. So wether you're on a single-core or a multi-core, the O.S. schedules a different thread at every interrupt.

So you get some stalls either way and you won't see much of a speed difference running a single thread on a single-core or multi-core processor.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 06, 2008, 10:19:14 PM
Quote from: johnsa on June 06, 2008, 01:45:23 PMSo moving a thread backwards and forward between 2 cores would cause all sorts of performance issues in terms of the instruction pipeline, re-fetch and cache updates.

Those penalties are minor compared to the length of the time slices (in the order of 10-100 milliseconds). And even if other threads are essentially idle (there could be thousands of them), the O.S. still schedules them from time to time so they can see whether they got any new tasks waiting. This happens on a single-core too so your thread is going to get interrupted several times per second anyway. On a dual-core, it doesn't matter much if, when your thread gets scheduled again, it gets the first or the second core.

QuoteSo it would seem that even if the cores aren't sharing the load so to speak, that they have been implementing some sort of shared prefetch/cache setup to allow code to transition seamlessly from core to core.

Core 2 has a shared L2 cache, which means that the worst data access penalties you're going to get is an L2 fetch latency. That's negligible compare to the length of the time slice. The transient is practically entirely compensated with out-of-order execution.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 07, 2008, 04:34:28 AM
c0d1f1ed,

It is evident that you have not had a good look at the graph that Greg posted in a file named p4timing.jpg. The test piece is a deliberately processor intensive single thread yet the graph clearly shows that at startup the app loads BOTH cores for the beginning of the graph then settles down to a reverse symetrical load sharing. You appear to have missed that it is running a single thread, the OS thread scheduler has nothing to do with the load distribution between the two cores.

I have as comparison a 3.2 gig Prescott PIV which is similar in clock speed to Greg's Duo yet his timings are faster than the Prescott even allowing for it having faster memory and a higher BUS speed than the PIV. It is hardly a problem on a single core PIV to set the priority to a test piece so the time slice for test purposes is no big deal but the simple fact is the Duo is faster due to its multicore processing of the single thread test piece.

Forget abstraction, magic libraries, ring3 emulation of Win 3.0 co-operative multitasking, its all old hat technology from the dustbin of history. The brave new world will continue as it is developing at the moment, multiple core processing of single thread code in conjunction with existing capacity of multiple threads for essentially concurent threads for tasks ike servers and the like.

Vertical performance (how fast a thread will execute) will come from the former, horizontal performance (how many concurrent threads) will come from the latter.
Title: Re: Multicore theory proposal
Post by: sinsi on June 07, 2008, 09:50:07 AM
Quote from: hutch-- on June 07, 2008, 04:34:28 AM
the graph clearly shows that at startup the app loads BOTH cores for the beginning of the graph
Some of that can be windows itself (looking up bizarre registry keys etc.) before the program even gets loaded.

Seems to me that one side here is fixed on hardware and the other on software. I would like an OS that will run on
two of my cores and leave the other two for the one or two programs that I actually use 'simultaneously/multitaskingly'

Famous names: Multi-threaded development joins Gates as yesterday's man (http://www.theregister.co.uk/2008/06/06/gates_knuth_parallel/)

edit: interesting that 8 out of 20 new topics in the lab are about multi core/cpu...
Title: Re: Multicore theory proposal
Post by: hutch-- on June 07, 2008, 02:00:19 PM
Thanks for the link, its a good article. i come down on the side of knowing what you are doing, not the magic library approach. It also comes across that this area is both in its infancy in terms of cheap PCs and the futuer design direction is not all that clear. Donald Knuth's comments are indeed interesting and the sad part is he may be right.

I wonder when we will see the first terahertz processor ?  :bg
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 07, 2008, 06:55:06 PM
hutch,

When you start a process a lot more happens than setting the instruction pointer to the entry point. For every new memory page accessed, an interrupt is generated an the O.S. loads the page from disk. Disk access is controlled by a separate thread to allow asynchronous I/O and arbitration between other processes. So this is one way in which a single-threaded application can still cause multiple threads to run concurrently. But this is just a startup phenomenon. After that you get practically zero benefit from running a single-threaded application on a multi-core. Every few timeslices the O.S. simply schedules the thread on another core, which when averaged out over a second or so becomes a 50/50 distribution. The reason the O.S. schedules them on different cores is because time slices expire either way, and there are many other threads wanting some execution time. With just my browser open, Task Manager tells me Windows is juggling over 500 threads. So the O.S. schedules our thread on whatever core is available at the time it decides it's our turn again.

Greg's Core 2 Duo is not faster than your Prescott at running single-threaded applications because of dual-core. It's faster because the significantly different architecture can sustain a much higher IPC, and it has a more advanced cache hierarchy. The NetBurst architecture had a low IPC by design to allow a higher clock frequency to compensate it (3+ GHz on 90 nm is quite impressive, we're only slowly seeing that return on 45 nm). Unfortunately they relied on clock frequency to not only compensate the low IPC, but they also expected to get a competitive advantage. Fundamantal physics largely prevent that from happening. From 4+ Ghz the power consumption becomes unmanageable. Power increases quadratically with voltage and linearly with frequency. But voltage at a given process node can't be lowered because increasing frequency means there's less time to charge wires (make them transition from logical 0 to 1) for which you need higher voltage (any overclocker will tell you this). Voltage can only be reduced marginally at every new process node. So if you want power consumption to stay below a certain level, clock frequency can only be increased slowly for future generations of processors. The only reason the Pentium 4 has been able to increase frequency from 1.3 to 3.8 agressively is because they started at 50 Watt and went up to 115 Watt, and they implemented clock gating. There is no headroom left.

Now, the hardware technology you seem to be expecting is called Reverse Hyper-Threading. The term was coined by a site rumoring that AMD's dual-core might have been able to use resources from both cores on a single thread. The fact of the matter is that Reverse Hyper-Threading is a myth, AMD's multi-core chips don't have it, and the reson they don't is because it's physically impossible. Electrical signals travel at a fraction of the speed of light, but at multi-GHz frequencies that's only enough to get from one pipeline stage to the next. Distributing instructions from one thread across different cores would require communication with execution units so distant (at this scale) that it would require multiple clock cycles of latency and high power. So you're not gaining anything. There is some very interesting research going on that uses low-power lasers to communicate between cores, and also stacking of chips to bring components closer is an interesting approach, but these are again merely incremental improvements necessary for steady advancement in the next decades. It will be most useful to scale beyond a single digit number of cores, not to run a single thread that much faster.

There is no way to avoid having to redesign your software to take advantage of current and future multi-core chips. Have a look at these Nehalem benchmarks: AnandTech (http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326&p=7). In the single-threaded benchmark it performs the same as a Penryn. And I see no reason to fight multi-core programming either. Once you've put in the effort you can get massive performance improvements.
Title: Re: Multicore theory proposal
Post by: codewarp on June 07, 2008, 07:14:53 PM
Quote from: hutch-- on June 07, 2008, 02:00:19 PM
Thanks for the link, its a good article. i come down on the side of knowing what you are doing, not the magic library approach.

The "magic library approach"?  We call that "encapsilation of complexity" where I come from, you ought to try it sometime.  i come down on the side of knowing what you are doing, not the magic hardware approach.


Quote from: hutch-- on June 07, 2008, 02:00:19 PM
I wonder when we will see the first terahertz processor ?  :bg

And the day after they do that, someone will gang four or 16 of them together, then run them using same multi-processor software technology you are trying to tell everybody here doesn't work.  Come on, Hutch!  You know that mhz is a different dimension from multiple CPUs, terahz is patently irrelevant to this discussion.  That you would actually scoff (in several of your posts) at getting 1.9 effective CPU performance out of software, on a Web site dedicated to cutting out clock cycles, is truly astounding to me, and lowers the credibility of your entire Masm Form, Mr. Moderator.

The brave new world of ubiquidous multi-processors is upon us, and ripe for the taking.  I am actually doing it, and I wanted to share my experiences here.  But you and your bulldogs here just can't have anyone contradicting this "can't get there from here", mentality.  Your comments are discouraging others who are trying to learn and explore this material, and insulting to those of us actually doing it with great success.  I am sorry for you, this it a topic to be showcased, not banned and scoffed at.  You guys have severely missed the boat on this one.
Title: Re: Multicore theory proposal
Post by: askm on June 07, 2008, 07:38:04 PM
Since I am here logged in

asking

What if Intel or AMD, at the time of the 16-bit or even 8-bit cpu heydey,

multi-cored them then ?

Where would we be in the multicore software performance understanding today ?

Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 07, 2008, 07:48:17 PM
sinsi,

That's an interesting article, thanks for the link. It's not very shocking though. It's blatantly obviously programmers would want their free lunch to continue. But we don't have a choice. This isn't chip manufacturers being lazy, they're fighting physical laws they best they can - it's the software developers who have to stop being lazy!

Veterans like Knuth obviously try to be conservative. His glory days are fading fast while programming is about to change in revolutionary ways he won't be part of. It's also ironic that the article mentions he wrote "The Art of Computer Programming" and he won the Turing price, while the authors of "The Art of Multiprocessor Programming (http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916)" won the Dijkstra and Gödel prize. There are many 'famous names' sharing their opinion, but I have yet to find one who sais multi-core is a bad idea and who has an actual realistic hardware solution.

One of the problems here is that most progammers, even the seasoned, have become clueless about hardware behavior. Those who have a better understanding of it, realize that multi-core is the way forward and hardware manufacturers are doing an amazing job. I had a course based on Digital Integrated Circuits (http://www.amazon.com/Digital-Integrated-Circuits-Printice-Electronics/dp/0130909963) and one about Advanced Computer Architecture (with a focus on CPU design), and it really broadened my perspective.

Lastly, for every pessimistic article about the challenges of multi-core programming, I can give you ten articles that regard it as a major opportunity...
Title: Re: Multicore theory proposal
Post by: codewarp on June 07, 2008, 08:05:26 PM
Quote from: sinsi on June 07, 2008, 09:50:07 AM

Seems to me that one side here is fixed on hardware and the other on software. I would like an OS that will run on
two of my cores and leave the other two for the one or two programs that I actually use 'simultaneously/multitaskingly'

Famous names: Multi-threaded development joins Gates as yesterday's man (http://www.theregister.co.uk/2008/06/06/gates_knuth_parallel/)

edit: interesting that 8 out of 20 new topics in the lab are about multi core/cpu...

Software is what this web site purports to be about.  Software is what we can actually do something about.  All this hairy, fairy, triple terahz fantasy nonsense about wishful hardware is hardly fit for a blog on FOXNews--the hardware at any point is what it is--who cares.  Software is what we all come here for, as far as I can tell.  I like the hardware the way it is--I can get the extreme performance, and the serious development work, while much of my competition is sitting in the corner, wishing for magical hardware to come save them--it won't.  It is truly hilarious--this is a great time to be a programmer.

The two for you, two for the OS method of CPU allocation would not be very effective.  Just pull up a Window task monitor, like taskinfo2000, and get familar with how many things are running simulteneously in your Windows system--30 processes, 320 threads.  Reservation of shared resources like that would effectively turn a Quad back into a Dual.  

You can use two cores now, all you want, just keep two threads busy doing what you need them to do, and you got 'em all to yourself, the system even shrinks down to 2 cores to let you do that.  What is the problem you are trying to solve with this proposal?  Today's system is like a genie in a bottle, your wish is its command.  What exactly is it that is missing for you, to want fixed CPU assignments, with such dire consequences?
Title: Re: Multicore theory proposal
Post by: codewarp on June 07, 2008, 08:18:06 PM
Quote from: askm on June 07, 2008, 07:38:04 PM
Since I am here logged in

asking

What if Intel or AMD, at the time of the 16-bit or even 8-bit cpu heydey,

multi-cored them then ?

Where would we be in the multicore software performance understanding today ?
And where would we be today, if the Egyptians had invented multi-layer integrated circuits 5000 years ago?  Things happen when they are good and ready to happen, and not until then.
Title: Re: Multicore theory proposal
Post by: GregL on June 08, 2008, 01:57:52 AM
This has sparked my interest in what is going on here. I never paid a whole lot of attention to it before.

There is some relevant information here Measuring Multiprocessor System Activity (http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/core/fnef_mul_qtdl.mspx) (best viewed in IE), it's a little dated but I haven't found anything newer.

Under 'Thread Partitioning':
QuoteWindows 2000 uses soft processor affinity, determining automatically which processor should service threads of a process. The soft affinity for a thread is the last processor on which the thread was run or the ideal processor of the thread. The Windows 2000 soft affinity thread scheduling algorithm enhances performance by improving the locality of reference. However, if the ideal or previous processor is busy, soft affinity allows the thread to run on other processors, allowing all processors to be used to capacity.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 02:22:17 AM
There are a number of things in this thread that have made me laugh, you cannot learn from history, WE already know it all, you cannot learn from existing massive parallel hardware like SGI and similar, we already know it all. You cannot learn from direct testing with graphs provided, it does not fit our theory and we know it all etc ....

With clock speeds, would the designers of the 4 megahertz 8088 have seen x86 architecture running at just under 4 gigahertz ? I seriously doubt it, would they have seen multiple pipeline hadware from early pentiums upwards, not realy as the first pipelined x86 chip was a 486.

RE the free lunch and getting left behind etc .... I did not buy an Itanium and got left behind and I did not buy a RISC box and got left behind and I never wrote software for the early multiprocessor MAC and got left behind and with interim hardware of the type that is available in x86 I hope I get left behind as I have almost no respect for hybrid inerim fudges that end up in the dustbin of history.

Even though the starting price is a bit high for a desktop, SGI and other companies have LONG AGO build hardware controlled multiprocessor computers with throughputs tha are out of the PC world. Forget magic libraries running in ring3, it is far too slow as any of the driver guys will tell you, to do anything even vaguely fast you need exclusive ring0 core control and that occurs only at an OS level of operation. Just above the system idle loop are your thread synchronisation methods, spinlocks, thread yield and so on, trying to do this in ring3 co-operative multiasking as like trying to win the Indy car race with a T model ford, its ancient redundant junk from the early win3.0 days.

Codewarp,

Quote
And the day after they do that, someone will gang four or 16 of them together, then run them using same multi-processor software technology you are trying to tell everybody here doesn't work.  Come on, Hutch!  You know that mhz is a different dimension from multiple CPUs, terahz is patently irrelevant to this discussion.  That you would actually scoff (in several of your posts) at getting 1.9 effective CPU performance out of software, on a Web site dedicated to cutting out clock cycles, is truly astounding to me, and lowers the credibility of your entire Masm Form, Mr. Moderator.

Tread carefully here, while everyone gives me cheek which I in turn sit up at night wiping away the tear stains while wringing my hands in despair, try it out on any of the other members and this thread is dead. We have already had one that had to be closed due to the smartarse wisecracks offending members, keep it objective or see it disappear.

While John posted code and ideas, all I have heard is dogma about high level code, abstraction, magic libraries and old hat technology. This is finally the "Laboratory" for code, not anecdotal waffle and dogma.

Here is a test piece for you to deliver code for. show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core noting that the code is memory bound and core intensive.


    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi
Title: Re: Multicore theory proposal
Post by: xmetal on June 08, 2008, 06:10:06 AM
Quote from: hutch-- on June 08, 2008, 02:22:17 AM
Here is a test piece for you to deliver code for. show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core noting that the code is memory bound and core intensive.


    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi


Runs about 4000000000 times faster and does not need more than a single core...


    LOCAL var   :DWORD

    mov var, 12345678

    mov eax, var
    mov ecx, var
    mov edx, var
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 06:52:34 AM
xmetal,

1 pass does not test anything. Look at the words "run the identical code on 2 or 4 cores" to see what the test is about.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 08, 2008, 07:37:04 AM
hutch,

Ring0 thread control is not necessary for fast multi-core processing. Instead of scheduling threads at the O.S. level, you can schedule tasks at the application level.

Think of ray-tracing. A straightforward approach would be to have one thread per pixel, or even per ray. That would require the O.S. to constantly schedule different threads, suspend some, and resume some. It will perform horribly. Indeed only ring0 access might make that a little faster. But you can get far better results by only running as many threads as there are cores, and having each of them continuously request a pixel from a shared queue and process it. And this approach works fine from ring3, the O.S. isn't even actively involved in it. 100% multi-core utilization and minimal overhead right in your lap. What more could you ask for?

Of course, for applications where there isn't a clear pile of independent tasks to execute things get a little more complicated. But that's exactly what this thread is about. It's not going to get solved by hardware, as there is no way to identify independent work more than a couple instructions away. It has to happen at the software level, using frameworks, tools and languages that help extract the parallel tasks.

By the way, your code is trivial to run 4 times faster on a quad-core. It's not memory limited or anything. The stack variable will be in each of the core's L1 caches.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 10:06:44 AM
It appears you don't understand the difference in speed from memory writes to register writes, the memory read speed is the limiting factor in the simple test piece, the loop is short enough to saturate the cores capacity to run the instructions in the loop thus it is both memory bound and core intensive.

Current multi-thread as per your example is uncontentious as parallel asynchronous operations but you are synchronising them with the shared queue for pixel data. Saturate the capacity of all 4 cores and you will not see any speed gain. You may get some speed gain if your threads are nowhere near instruction saturation.

The discussion has not been about asynchronous parallel operation, its been about the "free lunch" notion that you have taken out of the context of slow bloated high level languages running out of puff as the processor speeds don't outpace the software slowdown of lousy code. Until you can display improved performance in highly processor intensive code, your just dabbling with iternet download parallelism technology.

There are enough people here who own late model Core 2 Duos to see the results of single thead testing and they are faster that a matching clocik speed PIV and while bus speed and memory read/write speed will have a little to do with this, the load sharing that occurs across both cores is the real difference.

This type result is showing up in objective testing where all I have yet to see that software multitasking you have mentioned actually perform. This is why I suggested a simple test piece run across multiple threads on multiple core to show us how its done. I doubt you can deliver.
Title: Re: Multicore theory proposal
Post by: zooba on June 08, 2008, 11:33:49 AM
Okay, I haven't been following this thread the entire time but I dropped in to have a read and decided I should speak up.

The screenshots posted by Greg show only that Windows uses soft affinities (as mentioned earlier). The thread keeps switching between cores - it's an OS function and can be 'disabled' by using SetThreadAffinityMask() or SetProcessAffinityMask(). My own benchmarking counterintuitively showed a slight decrease in speed when memory bound processing (array arithmetic using SSE) was restricted to one core. It also showed (and others have found this as well, though I'm coming up blank for sources right now, so feel free to take this with a grain of salt) that the best number of threads is 1.5x the number of cores (without forcing the affinity).

Personally I am quite happy with my code not automatically being shared between processors. Having a second core to keep running while the other is hung adds a huge amount of stability. Previously a completely hung process would require a restart - now I can actually terminate it. Also, I am quite confident in my ability (after a lot of failures, mind you) to create multithreaded code where it is useful to take advantage of parallel processing and to get it right.

The apparent better performance of single-threaded code on many-core CPUs is most likely due to the operating system being able to use a separate core. A genuinely processor intensive process can actually get 100% of a core without the OS interrupting it (or locking up the system).

Cheers,

Zooba :U
Title: Re: Multicore theory proposal
Post by: johnsa on June 08, 2008, 11:45:23 AM
c0d1f1ed, I agree with most of what you are saying in terms of that, multi-core/processor is the future, there is no escaping this fact. It is highly unlikely that anything significant is going to happen architecturally or OS wise. So the only option we as programmers are left with is to work out how to make best use of the tools we are given.

That being said, perhaps we should take Hutch's example code and possibly a few other similar ones which try to create load in different areas, processing, memory etc and try to multi-thread multi-core enable them and see what comes of it. Perhaps we can find some hybrid solutions to common programming tasks (after all there should be some good brains in this forum). Perhaps even start working towards putting together an additional library for MASMv11/v12 that provides some multi-core "helper" routines.

Perhaps something similar to Intel's TBB for C++..
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 08, 2008, 02:35:19 PM
hutch,

Memory read speed of a shared variable that is not being written to is no different from that of a single thread. Each core simply has a copy of the data in their cache and because it's in the Shared state of the MESI protocol you don't get any performance penalty. So your code runs four times faster on a quad-core.

Regarding ray-tracing, getting a new pixel is as simple as atomically incrementing a counter. That only takes a dozen cycles or so and can be done in ring3. It's not just lock-free it's wait-free synchronization, giving the best possible scaling behavior. You might want to read chapter 1.1 from The Art of Multiprocessor Programming (http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916) (just click on SEARCH INSIDE! and go to the Excerpt) for an introduction too.

Proof of multi-core speedups for "highly processor intensive code" is everywhere. Just read AnandTech's Nehalem preview again.

The one and only reason Core 2 Duo is faster for a single thread than a Pentium 4 at the same clock speed is its higher IPC per core. Core 2 can retire 4 instructions per clock, Pentium 4 only 3. Add to this shorter pipelines with lower latencies and you have a clear winner. Reverse Hyper-Threading is a myth, it's physically impossible.

As for "delivering", check my profile.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 05:32:39 PM
The reason why I suggested objective testing is that I have only heard opinion on what results will be, not what they are. Any processor intensive piece of code will do as long as it both processes instructions and performs read/write of memory that is in cache as using an extended memory range introduced the extra variable of page table load times if the addresses are far enough apart.

Now the way to prove the theory you have been putting forward is to show an identical piece of code running on one core complete with timing them run it on dual coes to show the speed improvement.

Now in relation to your description of your shared meory queue, the problem is that even if it works on processor and memory intensive code is that it stops waiting threads and thus the core it is running on stone dead and while this may be a typical gamer brain, its hardly useful on a modern computer. This is exactly why this type of work needs to be done at the OS level just above the idle loop where the wait time can be passed off to another runing process or even for that matter another thread on the same core controlled by the running application.
Title: Re: Multicore theory proposal
Post by: codewarp on June 08, 2008, 05:48:19 PM
Quote from: johnsa on June 08, 2008, 11:45:23 AM
That being said, perhaps we should take Hutch's example code and possibly a few other similar ones which try to create load in different areas, processing, memory etc and try to multi-thread multi-core enable them and see what comes of it. Perhaps we can find some hybrid solutions to common programming tasks (after all there should be some good brains in this forum). Perhaps even start working towards putting together an additional library for MASMv11/v12 that provides some multi-core "helper" routines.

Hutch is right, his code cannot be helped by using MPs without changing the code, but so what?  Nobody is trying to do that using this technology, except for Hutch.  I'm afraid Hutch's example does not address any issues at all.  He insists on treating threads as synchronous tools for him to switch on and off, like calling a subroutine.  Since that doesn't work, and never will, he declares the technology worthless.  The truth is, that he just doesn't like it if you have to change the code--he states as much in his "rules" govering his example.  Too bad, you are just gonna have ta learn new programming models, that's just the way it is.  We did it in the structured programming days.  We did it in the object-oriented days.  We did it with networks, and with the Internet, and with multiple threads and other things, and now with widespread MPs.

The marketplace wants multiple processors, multiple processors require fundamental changes in the programming models at the application level, and the marketplace knows what to do with code and programmers who refuse to change.  Hutches rule: "You can't change this" is his own stumbling block, a limitation of his own that he is going to have to struggle with by himself.  Sorry Hutch, when the stakes are this high, we will be changing the code, rules are made to be broken.

Those that try to hold back the dawn are doomed to failure.  
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 08, 2008, 06:00:25 PM
hutch,

I have done objective testing. It's four times faster on my quad-core.

Wait-free synchronization doesn't stop anything, that's why it's called wait-free. It takes a dozen cycles to do the atomic increment and that's it. Any approach in ring0 is going to take at least this long (or should I say short).

By the way, I'm not a gamer, I'm an engineer.
Title: Re: Multicore theory proposal
Post by: codewarp on June 08, 2008, 06:58:47 PM
Quote from: c0d1f1ed on June 08, 2008, 06:00:25 PM
hutch,

I have done objective testing. It's four times faster on my quad-core.

Wait-free synchronization doesn't stop anything, that's why it's called wait-free. It takes a dozen cycles to do the atomic increment and that's it. Any approach in ring0 is going to take at least this long (or should I say short).

By the way, I'm not a gamer, I'm an engineer.

If I might be so bold, but I don't think that is what Hutch is after.  He wants to see that one loop sped up with the application of MPs (multi-processors), without changing the code.  Of course, the whole challenge is a contrived set up.  It absurdly tries to hold MPs to a rediculous and laughable synchronous standard, like task switches in one cycle.  Besides, if nobody can change the code, then the code can't be written in the first place without violating its own rules.

Until you completely surrender to the reality that threads are asychronous and out of time with one another, you will remain ineffective with MPs.  Once this fundamental truth sinks in, you can start to compute on-the-fly, using programming models designed to tolerate asynchronous computation.  Those of you expecting synchronous behavior from MPs have seriously bet on the wrong horse, and you have to work extra hard to overcome your erroneous expectation.  No matter how hard and fast the clock speeds get, MPs and all the same issues are present all the way up--resistance is futile.

Holding MPs to a synchronous standard, is like that old joke: "Stop staring at your radio".  Hutch, if you want synchronous parallelism, learn about SSE, and stop staring at MPs.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 07:00:56 PM
c0d1f1ed,

I labelled the approach as the same as a gamer brain for a reason, to synchronise your parallel threads you either use the OS which is far too slow or you run a spinlock wait loop, there is no sub millisecond method available to yield wasted time so you lock up the processor for the synchronisation wait time. Your notion of "wait free" is a myth here, it does not happen by magic.

Now I have heard you state that you have tested intensive code and it goes faster but we have yet to see it from you. Any processor intensive process will do that addresses memory, try a memory fill algo which is commonly used in the graphics/games area and see if you can get an identical algorith to go two or four times faster than its single thread timings using multiple cores. A ratio of 1.9 would be fine.

codewarp,

> The marketplace wants multiple processors

It already has them, you just don't like the starting price. What the market in fact wants is cheap high performance processors but PC processors are only in their infancy in multiple processors. One of the few things Itaniums are good for is multiple parallel processing if they have high performance dedicated hardware to support them.

Quote
We did it in the structured programming days.  We did it in the object-oriented days.  We did it with networks, and with the Internet, and with multiple threads and other things, and now with widespread MPs.

This tells me much more than the preceding anecdotal waffle. Flitting from one tend to another complete with the bloat, performance degradation and free lunch assumptions that go with it and now with the current hiatus in clock speeds, the free lunch movement are trying to flog the coming multicore hadware as the next free lunch for lousy slow bloated code. How many cores will you need to make the first terabyte "Hello World" run fast ?

> Those that try to hold back the dawn are doomed to failure.

Those who want the water to run back uphill to where it was in the past will be disappointed.

LATER: Here is an SGI cheapie built with x86-64 hardware.

http://www.sgi.com/products/servers/altix/xe/

Have a look at the spec sheet of diferent configurations, clustering and the like. The stuff you have in mind is kids stuff.
Title: Re: Multicore theory proposal
Post by: codewarp on June 08, 2008, 07:41:29 PM
Quote from: hutch-- on June 08, 2008, 07:00:56 PM
This tells me much more than the preceding anecdotal waffle. Flitting from one tend to another complete with the bloat, performance degradation and free lunch assumptions that go with it and now with the current hiatus in clock speeds, the free lunch movement are trying to flog the coming multicore hadware as the next free lunch for lousy slow bloated code. How many cores will you need to make the first terabyte "Hello World" run fast ?

This is a technical topic, deserving of a serious discussion.  Hutch, you are obviously emotionally invested in being right about this.  The rest of the world disagrees with you.  Since we have heard nothing from you on the serious aspects of this discussion, and hear the same "can't get there from here" message over and over again, I would like you to excuse yourself from this thread until you have something more constructive to contribute.  The quote above makes this painfully plain and obvious for anyone to see.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 08, 2008, 07:44:15 PM
Any more wisecracks and I will excuse you from the thread. Its a case of put up or shut up and I have yet to see any working code from you.
Title: Re: Multicore theory proposal
Post by: sinsi on June 09, 2008, 02:57:59 AM
Quote from: c0d1f1ed on June 08, 2008, 06:00:25 PM
I have done objective testing. It's four times faster on my quad-core.

Post some code, then I can test it on my quad...
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 01:40:47 PM
Quote from: hutch-- on June 08, 2008, 07:00:56 PM
I labelled the approach as the same as a gamer brain for a reason, to synchronise your parallel threads you either use the OS which is far too slow or you run a spinlock wait loop, there is no sub millisecond method available to yield wasted time so you lock up the processor for the synchronisation wait time. Your notion of "wait free" is a myth here, it does not happen by magic.

Now I have heard you state that you have tested intensive code and it goes faster but we have yet to see it from you. Any processor intensive process will do that addresses memory, try a memory fill algo which is commonly used in the graphics/games area and see if you can get an identical algorith to go two or four times faster than its single thread timings using multiple cores. A ratio of 1.9 would be fine.

No, I do not use the O.S. for sychronization, nor do I use a spinlock wait loop. Read chapter 1.1 of The Art of Multiprocessor Programming. The atomic increment is the only form of synchronization and it takes nanoseconds, not milliseconds. Wait-free synchronization is not a myth, it's backed by peer-reviewed research (http://cs.brown.edu/people/mph/Herlihy91/p124-herlihy.pdf).

Didn't I tell you exactly how to trivially turn your code into multi-threaded code? You don't even need to bother with spin locks or lock-free or wait-free approaches. Anyway, for the lazy:

#include <windows.h>
#include <stdio.h>

int n;

HANDLE done[4];

void hutchTask()
{
    int var = 12345678;

    for(unsigned int i = 0; i < 4000000000 / n; i++)
    {
        __asm
        {
            mov eax, var
            mov ecx, var
            mov edx, var
        }
    }
}

unsigned long __stdcall threadRoutine(void *parameter)
{
    hutchTask();

    SetEvent(done[*(int*)parameter]);

    return 0;
}

int main()
{
    DWORD elapsedMilliseconds[4];

    for(int threads = 1; threads <= 4; threads++)
    {
        n = threads;

        for(int i = 0; i < n; i++)
        {
            done[i] = CreateEvent(0, FALSE, FALSE, 0);
        }

        HANDLE threadHandle[4];
        int parameter[4] = {0, 1, 2, 3};

        DWORD startTime = GetTickCount();

        for(int i = 0; i < n; i++)
        {
            threadHandle[i] = CreateThread(0, 0, threadRoutine, &parameter[i], 0, 0);
        }

        WaitForMultipleObjects(n, done, true, INFINITE);

        elapsedMilliseconds[n - 1] = GetTickCount() - startTime;

        for(int i = 0; i < n; i++)
        {
            CloseHandle(done[i]);
            CloseHandle(threadHandle[i]);
        }

        printf("Milliseconds for %d threads: %d, multi-thread speedup: %f\n", n, elapsedMilliseconds[n - 1], (float)elapsedMilliseconds[0] / elapsedMilliseconds[n - 1]);
    }
}


Running this on my Q6600 give me:

Quote
Milliseconds for 1 threads: 61090, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 30904, multi-thread speedup: 1.976767
Milliseconds for 3 threads: 21247, multi-thread speedup: 2.875229
Milliseconds for 4 threads: 17800, multi-thread speedup: 3.432022

Quote from: hutch--show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core

Now you please show me a peer-reviewed article that shows Reversed Hyper-Threading is real. If you can't, stating that you might be wrong will do.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 02:07:54 PM
Can you post your build info as I get this trying to build your example in the vctoolkit. Alternatively post the complete project with its makefile of project file so someone else can build it.


@echo off

set lib=h:\vctoolkit\lib\
set include=h:\vctoolkit\include\

if exist mproc.exe del mproc.exe
if exist mproc.obj del mproc.obj

h:\vctoolkit\bin\cl /c /G7 /O2 /Ot /GA /TC /W3 /FA mproc.c
h:\vctoolkit\bin\Link /SUBSYSTEM:WINDOWS /libpath:h:\vctoolkit\lib gdi32.lib kernel32.lib user32.lib mproc.obj

dir mproc.*

pause




Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3052 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

mproc.c
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2143: syntax error : missing ')' before 'type'
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2065: 'i' : undeclared identifier
mproc.c(12) : warning C4018: '<' : signed/unsigned mismatch
mproc.c(12) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(12) : error C2059: syntax error : ')'
mproc.c(13) : error C2143: syntax error : missing ';' before '{'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2143: syntax error : missing ')' before 'type'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2065: 'threads' : undeclared identifier
mproc.c(36) : warning C4552: '<=' : operator has no effect; expected operator with side-effect
mproc.c(36) : error C2059: syntax error : ')'
mproc.c(37) : error C2143: syntax error : missing ';' before '{'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : error C2143: syntax error : missing ')' before 'type'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(40) : error C2059: syntax error : ')'
mproc.c(41) : error C2143: syntax error : missing ';' before '{'
mproc.c(45) : error C2275: 'HANDLE' : illegal use of this type as an expression
        h:\vctoolkit\include\WinNT.h(342) : see declaration of 'HANDLE'
mproc.c(45) : error C2146: syntax error : missing ';' before identifier 'threadHandle'
mproc.c(45) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(45) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(45) : error C2143: syntax error : missing ';' before 'identifier'
mproc.c(45) : error C2065: 'threadHandle' : undeclared identifier
mproc.c(45) : error C2109: subscript requires array or pointer type
mproc.c(46) : error C2143: syntax error : missing ';' before 'type'
mproc.c(48) : error C2275: 'DWORD' : illegal use of this type as an expression
        h:\vctoolkit\include\WinDef.h(141) : see declaration of 'DWORD'
mproc.c(48) : error C2146: syntax error : missing ';' before identifier 'startTime'
mproc.c(48) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(48) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(48) : error C2143: syntax error : missing ';' before 'identifier'
mproc.c(48) : error C2065: 'startTime' : undeclared identifier
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : error C2143: syntax error : missing ')' before 'type'
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(50) : error C2059: syntax error : ')'
mproc.c(51) : error C2143: syntax error : missing ';' before '{'
mproc.c(52) : error C2109: subscript requires array or pointer type
mproc.c(52) : error C2065: 'parameter' : undeclared identifier
mproc.c(52) : error C2109: subscript requires array or pointer type
mproc.c(52) : error C2198: 'CreateThread' : too few arguments for call through pointer-to-function
mproc.c(55) : error C2065: 'true' : undeclared identifier
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : error C2143: syntax error : missing ')' before 'type'
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(59) : error C2059: syntax error : ')'
mproc.c(60) : error C2143: syntax error : missing ';' before '{'
mproc.c(62) : error C2109: subscript requires array or pointer type
mproc.c(62) : error C2198: 'CloseHandle' : too few arguments for call through pointer-to-function
Microsoft (R) Incremental Linker Version 7.10.3052
Copyright (C) Microsoft Corporation.  All rights reserved.

LINK : fatal error LNK1181: cannot open input file 'mproc.obj'
Volume in drive H is WIN2K_H
Volume Serial Number is 20E8-3719

Directory of H:\vctoolkit\multiproc

06/09/2008  11:46p               1,413 mproc.c
06/09/2008  11:59p               1,413 mproc.cpp
               2 File(s)          2,826 bytes
               0 Dir(s)  17,639,866,368 bytes free
Press any key to continue . . .
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 02:46:43 PM
Visual C++ Express (http://www.microsoft.com/express/vc/Default.aspx)
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 02:53:25 PM
I didn't ask for a link, I asked for your build information.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 02:57:27 PM
Just hit F5.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 02:59:57 PM
Spare us the smartarse wisecracks and just post your build information. This code looks like it will work, why the kiddies games ?
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 03:01:22 PM
Download Visual C++ Express, copy/paste the code and hit F5. Where's the problem?
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 03:09:08 PM
While we are waiting for your buildable source, here is the part conversion in masm.

Timing

4515 MS Single thread timing
Press any key to continue ...

5 year old PIV 2.8 gig Northwood.


Your timings on quad core Intel.

Milliseconds for 1 threads: 61090, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 30904, multi-thread speedup: 1.976767
Milliseconds for 3 threads: 21247, multi-thread speedup: 2.875229
Milliseconds for 4 threads: 17800, multi-thread speedup: 3.432022


Why is there such a major timing difference between a 5 year old PIV and your quad core ?

Source

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    threadRoutine PROTO :DWORD

    .data
      elapsedMilliseconds dd 0,1,2,3
      done dd 0,0,0,0

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

  ; --------------------
  ; single thread timing
  ; --------------------
    invoke GetTickCount
    push eax

    invoke threadRoutine,0

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," MS Single thread timing",13,10

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

threadRoutine proc parameter:DWORD

    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  align 16
  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi

    invoke SetEvent,parameter
    mov done, eax

    ret

threadRoutine endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 03:24:07 PM
Replace /TC by /TP.

The difference is due to the division. Without it I get:

Quote
Milliseconds for 1 threads: 5055, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 2543, multi-thread speedup: 1.987810
Milliseconds for 3 threads: 1700, multi-thread speedup: 2.973529
Milliseconds for 4 threads: 1326, multi-thread speedup: 3.812217
Press any key to continue . . .

Interestingly it gets even closer to a 4x speedup, which was the whole point of the exercise anyway...

By the way, where's that scientific Reverse Hyper-Threading paper?
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 03:37:13 PM
This is fine but we still don't have a buildable source. Is there some reason why you won't post your build information ? What happened to ANSI portable C ?
Title: Re: Multicore theory proposal
Post by: askm on June 09, 2008, 03:39:36 PM
How many total instructions are being executed on
either of the tests you all are running ?
cod~, are you using ~Express 2005 or ~Express 2008 or somewhere in between
whereas the 08 version does a better job, if (all else equal) the base compiler is better for that matter ?

Does anyone in the forum use the latest Intel compiler (as an option in the latest visual studio)
as I read its supposed to be of great use in multicore?

(I can only multidream, as my hardware+software+experience "is not there yet".)
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 04:05:51 PM
Quote from: hutch-- on June 09, 2008, 03:37:13 PM
This is fine but we still don't have a buildable source. Is there some reason why you won't post your build information ? What happened to ANSI portable C ?

What errors do you get now? I use Visual C++ and avoid all the hassle of command line options. Anyway, if it helps you: '/Ox /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt'. It's not C, it's C++.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 04:30:22 PM
Thanks,

It builds with "/TP". What I don't undertand is why it take 44 seconds (timed watching the system clock) to run a single thread on this box before the additional theads are run when the asm single thread test piece runs in about 4.5 seconds.

I tracked down why your test code was so slow, VC had made a mess of the timings with the "for" loop.

Replace the code as follows.


// void hutchTask()
// {
//     int var = 12345678;
//
//     for(unsigned int i = 0; i < 4000000000 / n; i++)
//     {
//         __asm
//         {
//             mov eax, var
//             mov ecx, var
//             mov edx, var
//         }
//     }
// }

void hutchTask()
  {
      int var = 12345678;
      __asm
      {
          push esi
          mov esi, 4000000000
        lbl0:
          mov eax, var
          mov ecx, var
          mov edx, var
          sub esi, 1
          jnz lbl0
          pop esi
      }
  }


This yields the following timings on my single core PIV which are predictable..


Milliseconds for 1 threads: 4532, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 9062, multi-thread speedup: 0.500110
Milliseconds for 3 threads: 13531, multi-thread speedup: 0.334935
Milliseconds for 4 threads: 18032, multi-thread speedup: 0.251331


You are using a tail end synchronisation technique,


SetEvent(done[*(int*)parameter]);    // each thread on exit
....
WaitForMultipleObjects(n, done, true, INFINITE);  // wait for all to finish


I have attached the fixed version of your test piece with a working binary so other people can test you code on either a dual core or a quad core.

[attachment deleted by admin]
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 04:44:45 PM
Quote from: hutch-- on June 09, 2008, 04:30:22 PM
I tracked down why your test code was so slow, VC had made a mess of the timings with the "for" loop.

Like I said, it's the division. In Visual C++, place a breakpoint at the loop and press Alt+F8 to see the disassembly during debugging. Do the division outside of the loop and on a Q6600 you get the last results I posted. Your version doesn't do the division at all so the speedup factor is wrong.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 04:57:45 PM
At least it gets the timings right. the quad core results should show the timings for all 4 tests with very close to the same timing.

Built as ANSI code with CL from the VCTOOLKIT as a C file. Disassembled with DUMPBIN from VC2005 with no magic libraries, abstraction or any other high level claptrap.

Now I wonder how well this open thread startup with a synchronised tail end to display the results scales to a task like repeatedly filling a buffer in one thread while writing to the buffer in a calling thread ? For 50 frames a second this needs to be done in 20ms. The more so if the two operations do not take the same time.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 09, 2008, 05:38:04 PM
Quote from: hutch-- on June 09, 2008, 04:57:45 PM
Now I wonder how well this open thread startup with a synchronised tail end to display the results scales to a task like repeatedly filling a buffer in one thread while writing to the buffer in a calling thread ? For 50 frames a second this needs to be done in 20ms. The more so if the two operations do not take the same time.

As I've been saying all along O.S. level synchronization is terribly slow and we shouldn't expect any revolutionary ring0 multi-threading solution. It just happens to work ok for this code because the task can be split into equal sized subtasks of several seconds. As soon as you do something more interesting this fails. Have you finally read chapter 1.1 of The Art of Multiprocessor Programming yet? The solution, which I've also been repeating over and over again, is to keep the threads running and schedule the tasks with lock-free or better even wait-free synchronization.

Reading and writing buffers that are larger than the cache may seem to defeat it. But the trick is to subdivide the buffers and treat the processing of the sections as separate tasks, and ensure that you perform as many tasks on a certain section before you go to the next. Dataflow programming (http://en.wikipedia.org/wiki/Dataflow_programming) paradigms are very useful here.

So can we finally come to a consensus that multi-core is very useful even though it takes some programming effort to maximize effiency?
Title: Re: Multicore theory proposal
Post by: hutch-- on June 09, 2008, 06:01:58 PM
My problem is not with multiple processor or even multicore processors, they have been around for a very long time, its with how useful it is with the vast range of code types that get written on a daily basis with the level of OS control in current OS versions. Non synched threads have a very limited range of tasks that they can perform and a vast range of application do not suit that type of seperation.

Multiple pipeline hardware already parallels instructions when scheduled correctly and this makes each thead faster on normal PC but I wil make the point again that vastly larger high processor count hardware uses dedicated hardware synchronisation to parallel up to 1024 Itaniums and they can produce massive throughput many times faster than a single Itanium. I have not read all of the tech data for the x86-64 cheapies from SGI but with the 8 core dual quad option I have no doubt the throughput is competitive for the core count.

What is missing in your example is the need for abstraction, magic libraries and the pile of claptrap that comes with bloated high level tools, its straight API code using tail end synchronisation to display the results.

I will be interested to see the results of other members with dual or quad core hardware.
Title: Re: Multicore theory proposal
Post by: sinsi on June 09, 2008, 09:55:04 PM
From the EXE in your attachment Hutch,

Milliseconds for 1 threads: 5000, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 5016, multi-thread speedup: 0.996810
Milliseconds for 3 threads: 5000, multi-thread speedup: 1.000000
Milliseconds for 4 threads: 5000, multi-thread speedup: 1.000000

????????
Title: Re: Multicore theory proposal
Post by: GregL on June 09, 2008, 10:15:13 PM
Here's what I'm seeng with hutch's code (Pentium D 940, Vista SP1):


Milliseconds for 1 threads: 3938, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 4203, multi-thread speedup: 0.936950
Milliseconds for 3 threads: 6250, multi-thread speedup: 0.630080
Milliseconds for 4 threads: 8203, multi-thread speedup: 0.480068


They're all distributed to both cores. Graph attached.



[attachment deleted by admin]
Title: Re: Multicore theory proposal
Post by: MichaelW on June 09, 2008, 11:17:35 PM
Quote from: c0d1f1ed on June 09, 2008, 03:24:07 PM
Replace /TC by /TP.

The difference is due to the division. Without it I get:

Quote
Milliseconds for 1 threads: 5055, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 2543, multi-thread speedup: 1.987810
Milliseconds for 3 threads: 1700, multi-thread speedup: 2.973529
Milliseconds for 4 threads: 1326, multi-thread speedup: 3.812217
Press any key to continue . . .

How can this be? Without the division each thread will do 4 billion iterations, so if each thread is running on a different core they should all complete in approximately the same time, independent of the number of threads.

And why exactly is it that you did not post an EXE and/or made it difficult for us to create our own EXE from your source? And while I'm asking questions, why not a source in the preferred language of this forum?
Title: Re: Multicore theory proposal
Post by: hutch-- on June 10, 2008, 12:47:04 AM
OK,

Here is my awake version in masm, abstraction free, magic library free and bloat free. First I must thank c0d1f1ed for providing some working code that broke the gabfest and other related waffle. I think I have the tail end synch working correctly and the times on my PIV reflect the increase work load from 1 to 2 to 4 threads.

These are the timings I get on a PIV which are predictable with a single core.


===========================================
Run a single thread on fixed test procedure
===========================================
4516 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
9000 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
17969 MS Four thread timing

Press any key to continue ...


The dual thread code should produce the same timing as the single thread on a 2 core processor, the 4 thread code should produce the same timing as the single thread version on a quad core.

[attachment deleted by admin]
Title: Re: Multicore theory proposal
Post by: GregL on June 10, 2008, 01:12:46 AM
I seem to do best with a single thread. ??

(Pentium D 940, Vista SP1).

===========================================
Run a single thread on fixed test procedure
===========================================
3891 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
4578 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
8531 MS Four thread timing



Same deal on the graph.

Title: Re: Multicore theory proposal
Post by: NightWare on June 10, 2008, 01:21:29 AM
Quote from: Greg on June 10, 2008, 01:12:46 AM
I seem to do best with a single thread. ??
of course... hardware always faster than software  :bg

my results on core2duo T7300 2ghz :
===========================================
Run a single thread on fixed test procedure
===========================================
5990 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
6100 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
12152 MS Four thread timing

Press any key to continue ...
Title: Re: Multicore theory proposal
Post by: hutch-- on June 10, 2008, 01:35:03 AM
These results make sense, the single thread has no extra thread overhead as it does not need it. The 2 thread test does twice the work and the 4 thread test does 4 times the work so allow for the thead overhead of the latter two tests you are getting close to a two times speedup. It should also show a 4 times speedup on a double dual core.
Title: Re: Multicore theory proposal
Post by: sinsi on June 10, 2008, 01:37:44 AM
Q6600 2.4 GHz

===========================================
Run a single thread on fixed test procedure
===========================================
5016 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
5000 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
5000 MS Four thread timing

Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 10, 2008, 02:18:31 PM
Quote from: MichaelW on June 09, 2008, 11:17:35 PM
How can this be? Without the division each thread will do 4 billion iterations, so if each thread is running on a different core they should all complete in approximately the same time, independent of the number of threads.

It's without the division inside the loop, I obviously still do it outside the loop, to divide up the work among the threads to get the correct results.

QuoteAnd why exactly is it that you did not post an EXE and/or made it difficult for us to create our own EXE from your source? And while I'm asking questions, why not a source in the preferred language of this forum?

I didn't post an executable because then I could have been accused of cheating or whatever. Plus they asked for source, which is exactly what I gave them. And I wrote it in C++ because that was faster to write and it's trivial to understand and modify. Last but not least I've proven the point which is the only thing relevant right now.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 10, 2008, 03:33:13 PM
hmmmm,

> Last but not least I've proven the point which is the only thing relevant right now.

Which one, the bloat, abstraction, magic libraries, "it can't be done in assembler", there have been many points made but few successfully. Your example worked after it was fixed then rewritten which was a change from the gabfest and waffle but lets face it, its 1995 windows thread technology with a tail end system method wait based synchronisation. So much for ring3 co-operative multitasking, its normal Windows ring0 thread synchronisation.

I think your code was worthwhile and it has been a useful contribution but it has extremely limited application to the vast majority of code written on a day to day basis as unsynchronised concurrent theads are very poorly suited to general purpose programming.
Title: Re: Multicore theory proposal
Post by: NightWare on June 10, 2008, 09:45:20 PM
since the single thread isn't shared on both core here (otherwise the speed would be divised by 2), maybe filling a small area should show when the thread start to be shared (16/32kb, 2x or 4x the l1 cache should be fine), then we will see if the cache is the factor...
Title: Re: Multicore theory proposal
Post by: johnsa on June 10, 2008, 10:26:01 PM
I agree, we've gotten a base here that proves that it is possible to setup multiple threads assigned to cores (perhaps we should set the thread affinity mapping in the example) and that each thread is capable of running code in parallel which is accessing a core-local (l1 cached) variable in each thread.

So now to up the ante we should create a single large data structure in memory (too large for the l1 caches) and have each thread perform a function on that structure.. making sure that the calculation is constant-time and not dependant on the position within the data etc.
Each thread could take 1/2 or 1/4 of the total data (chunks or interleaved?).

Then we compare that against the single thread, to see how much having the shared memory/caching implications affect the result.
Title: Re: Multicore theory proposal
Post by: johnsa on June 10, 2008, 10:28:56 PM
refer to me SSE Weirdness thread... we could make the data-structure a huge list of vectors and run through and normalize them... then at the same time we can test on different processors what might be causing the wierd
movaps/movdqa behaviour i mentioned in that thread. Perhaps it's just a byproduct of my crufty Pentium M :)
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 11, 2008, 07:00:54 AM
Quote from: hutch-- on June 10, 2008, 03:33:13 PM
Which one, the bloat, abstraction, magic libraries, "it can't be done in assembler", there have been many points made but few successfully. Your example worked after it was fixed then rewritten which was a change from the gabfest and waffle but lets face it, its 1995 windows thread technology with a tail end system method wait based synchronisation.

It worked flawlessly the way it was. The only one having trouble building it was you, and ironically it was me who pointed out how to get it to build your way. The division within the loop didn't change a damn thing about the validity of the test.

QuoteSo much for ring3 co-operative multitasking, its normal Windows ring0 thread synchronisation.

Duh! I wasn't going to write an example that was ten times longer if I could prove the point of multi-core throughput with a trivial example.

Now it's your turn to prove things. Write a reusable framework that can perform different tasks of varying execution time and scales well with the number of cores, without using lock-free synchronization and completely in assembly. As long as you can't show me that I'll have every reason to assume it can't be done...

Oh and what about your Reverse Hyper-Threading claims? Given up on that already as well? And here's some more facts for you to consider: InfiniBand (http://en.wikipedia.org/wiki/InfiniBand), used to interconnect Opteron-based supercomputer nodes, has an end-to-end latency of 1 millisecond and up. So much for magical hardware offering fast synchronization.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 11, 2008, 08:14:00 AM
c0d1f1ed,

Your trying to pull my leg now, the code you posted ran on my box at 44 seconds where the test on a single core ran at 4.5 seconds. You had an error in your "for" loop that made it run ten times slower. Your published numbers were meaningless and this is exactly the problem of gabfest versus writing some code.

Then you refusal to supply the build specs simply did not make sense, it appeared that you did not want anyone to build and test the code.

Now come back to the dogma you were trying to inflict over the high level advantage, magic helper libraries, "ths is too difficult to do in assembler" and other related waffle and see why it has not been taken seriously, the example when rewritten in masm was less than a tenth of the size and it did what it claimed to be able to do and the funny part is I don't even have a multi-core processor to try it out on.

All you managed to prove apart from delivering a working example is you don't need high level code, no magic libraries and that the task was trivial in masm.

Maybe you should stick to the code and spare us all the dogma, at least it was not much work to rewrite it so it worked properly in masm.With quads common, 6 core version in the pipeline and much more powerful stuff in the near future, I will happily use my oldest PIV until the wheels fall off it as the longer it lasts, the faster and cheaper the multicore stuff will get and its not lke 1995 win95 technology in thread manipulation is any big deal to write.

RE: AMD quad latency problems, they are still about a year off delivering competitive performance although they do have some legs in the FP area.

Reverse Hyper-Threading is not one of my expressions, the closest I have come to "Reverse Hyper-Threading" is turning Hyper-threading OFF in the BIOS of my 3 PIVs as it interfered with algorithm timing and nothing went faster with it. Multiple pipeline out of order instruction scheduling has been with us since the early PIV days.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 11, 2008, 10:19:20 AM
Quote from: hutch-- on June 11, 2008, 08:14:00 AM
Your trying to pull my leg now, the code you posted ran on my box at 44 seconds where the test on a single core ran at 4.5 seconds. You had an error in your "for" loop that made it run ten times slower. Your published numbers were meaningless and this is exactly the problem of gabfest versus writing some code.

There was no error in the for loop. It simply included the division. The results perfectly proved the 4x speedup and you were free to throw in or leave out any instruction from the loop to get the same result. The total run time was entirely irrelevant and you should get yourself a quad-core to see it divided by four.

QuoteThen you refusal to supply the build specs simply did not make sense, it appeared that you did not want anyone to build and test the code.

That doesn't make any sense. First I provide a trivial piece of code to prove a point and then I don't want you to build it? I even told you early on to use Visual C++. Don't blame me of your incompetence to build these few lines of code.

QuoteNow come back to the dogma you were trying to inflict over the high level advantage, magic helper libraries, "ths is too difficult to do in assembler" and other related waffle and see why it has not been taken seriously, the example when rewritten in masm was less than a tenth of the size and it did what it claimed to be able to do and the funny part is I don't even have a multi-core processor to try it out on.

All you managed to prove apart from delivering a working example is you don't need high level code, no magic libraries and that the task was trivial in masm.

The need for abstraction is in no way what I claimed to prove with that example. You do need abstraction though to do more complicated things with different tasks that have varying execution time. The only way to prove me wrong is to code something yourself. I have yet to see something complex that scales convincingly to quad-core, written entirely in assembly.

And before you start trying to tell me that anything is possible in assembly, let me add that it should be coded in a timely fashion. FYI, I have worked on industry quality software that scales up to quad-core, released months ago. Good luck trying to catch up with that.

QuoteWith quads common, 6 core version in the pipeline and much more powerful stuff in the near future, I will happily use my oldest PIV until the wheels fall off it as the longer it lasts, the faster and cheaper the multicore stuff will get and its not lke 1995 win95 technology in thread manipulation is any big deal to write.

Wait as long as you like, but O.S. level thread synchronization isn't going to give you good scaling for anything except the most trivial code like that which I posted. You'll need every trick in the book (The Art of Multiprocessor Programming will do), or use a framework or language that provides the same functionality to get good speedups.

It's pretty ironic that you're trying to teach me things about multi-core programming while you don't even own one yourself.

QuoteRE: AMD quad latency problems, they are still about a year off delivering competitive performance although they do have some legs in the FP area.

Who was talking about quad latency? I was talking about InfiniBand, used by Cray to interconnect nodes. Even if you can point me to any interconnect technology that is an order of magnitude faster, that's nowhere near what would be needed for magical speedups provided by hardware solutions. The technology for maximizing concurrency is entirely in the software. So there's no reason to wait for any mythical hardware advancement, they don't exist. You can start multi-core development today. If performance is the reason you code in assembly, don't leave multi-core aside because you'll easily get beaten by people who do master multi-core development, using high-level tools and languages where necessary.

QuoteReverse Hyper-Threading is not one of my expressions, the closest I have come to "Reverse Hyper-Threading" is turning Hyper-threading OFF in the BIOS of my 3 PIVs as it interfered with algorithm timing and nothing went faster with it. Multiple pipeline out of order instruction scheduling has been with us since the early PIV days.

You referred to a multi-core being able to execute instructions from one thread on multiple cores simultaneously. That's called Reverse Hyper-Threading. But no matter what you want to call it, you still haven't given me any proof of its existance or even future feasibility. So either provide it or admit you were dead wrong. Superscalar execution has existed since the first Pentium (for x86 at least). Yet we still only have four execution ports. There are two reasons for this: 1) It's technically infeasible to have many more execution ports (let alone double their amount every silicon node) due to exponentially growing dependencies. 2) There's hardly ever more closeby independent instructions in straight-line code. You have to seek concurrency higher up.

Anyway, I'm going to stop wasting my time in this thread, unless you can actually prove something tangible to me.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 11, 2008, 12:39:36 PM
 :bg

> There was no error in the for loop. It simply included the division.

There was an error in your loop, it contained a division that should not have been there. I dumped your test piece with dumpbin, identified the problem in assembler then rewrote the proc without the error. You reported results were not accurate, each test ran 10 times longer than the test piece, this is normally why you post an example that can be built so other people can test it.

Posting an example then giving the advice that all you needed to do was download a 938 meg ISO from Microsoft to install a pile of crap that most people would not want on their machine says you tried to make it difficult to build it. Then you held up providing the build data that should have accompanied your example in the first place when in fact it built with VCtoolkit2003 and the version of CL and LINK from vc2005.

The problem as I see it is you were willing to wade into a whole field of people who have been programming for many years trying to tell them that their choice to program in a particular language was mistaken and they would be left behind or unable to code multicore applications without high level junk, additional magic libraries and the like. All you have done there is to prove you were wrong.

If you had have posted that code about 6 to 8 pages earlier sparing us all the dogma and waffle about high level languages and the like, much of the nonsense could have been avoided and a lot of time could have been saved as many people are in fact interested in this style of programming.

Quote
And before you start trying to tell me that anything is possible in assembly, let me add that it should be coded in a timely fashion. FYI, I have worked on industry quality software that scales up to quad-core, released months ago. Good luck trying to catch up with that.

There are many things in the dustbin of history that I never felt compelled to catch up on but just to confuse you further, disassemble ntoskrnl.exe and hal.dll and have a look at the masm code in it. You tend to find it by the LEAVE mnemonic. This code in fact scales perfectly to quad core hardware.

Quote
You can start multi-core development today. If performance is the reason you code in assembly, don't leave multi-core aside because you'll easily get beaten by people who do master multi-core development, using high-level tools and languages where necessary.

Here you in fact mean multithread, welcome to Windows95.

Quote
You referred to a multi-core being able to execute instructions from one thread on multiple cores simultaneously. That's called Reverse Hyper-Threading

No, in fact its called multiple processing, it is a mistake to assume that the history of computer hardware is contained in x86 desktops. See IBM, SGI and the other large computer manufacturers. Is a 1024 Itanium SGI superbox capable of 1024 concurrent threads only or can it deliver the computing power in its spec sheets ?

The current PC market has shifted to multicore at the moment due to clock speed limitations, not any desperate need for parallelism but the market will keep demanding improvements which will require higher performance per thread than is current which will either produce faster cores or faster synchronisation of multiple cores and probably both over time.
Title: Re: Multicore theory proposal
Post by: Bill Cravener on June 11, 2008, 07:41:57 PM
Quotec0d1f1ed

It's pretty ironic that you're trying to teach me things about multi-core programming while you don't even own one yourself.

That reminds me, so there I was pigging out on a big roast beef sandwich with all the fixens and grabbing handfuls of Snyder's original Bar-B-Q chips between bites while enjoyably reading this message thread when I suddenly felt a sharp pain in my chest. No worry, it was just gas, freaking had me scared though.

Got me to thinking, imagine my family finding me here dead at my computer seat with half a big roast beef sandwich and a Snyder's original Bar-B-Q chips bag almost empty. I mean after all I'm 57 years old and I like to eat and drink.

Anyway they, the family, then look over at my PC computer screen and there's these folks talking about duo core processor thingies and single-multi-threaded hicky-ma-bobs doing things in itty-bitty seconds.

I don't know, just seemed funny!  :lol
Title: Re: Multicore theory proposal
Post by: GregL on June 11, 2008, 08:46:38 PM
Cycle Saddles,

:lol  It made me laugh out loud. :lol 

Title: Re: Multicore theory proposal
Post by: Bill Cravener on June 11, 2008, 09:16:53 PM
Greg,

Laughings a good thing buddy. Hutch is my friend and in my book he's always right. Just thought this was a good spot for a laugh. I have a twisted way of seeing things. :bg
Title: Re: Multicore theory proposal
Post by: hutch-- on June 12, 2008, 12:51:35 AM
Bill,

Apart from all of the extremely serious considerations in this thread, how do you organise the important things of life like a big roast beef sandwich with all the fixens and grabbing handfuls of Snyder's original Bar-B-Q chips. I suffer the historical programmers problem of forgetting to eat and wondering why you start to feel seedy after a couple of days.

Since I am the worlds lousiest cook as well as having a few dietry limitations imposed by old age and bad habits, the current indulgence is to boil a dozen eggs at a time until they are like bullets, put them in the refrigerator and next morning, shell 3 of them, dice them with an egg slicer, add a generous sprinkle of salt and a light dusting of a very mild curry powder.

It tastes like the curried egg sandwiches that old ladies used to make for church fund raisers back in the 50s minus the sliced bread.

Many of the things addressed in this thread have a history that is yet to be written. Quad cores are becoming common, there are 6 core versions in the works and a number of interesting techniques that allow an ever increasing number of transistors while reducing substrate leakage, narrower tracks, a new wafer doping technique and some metal tracks for lower resistence.

AMD are currently playing catchup to Intel on the 4 cores but apparently have some good design in the pipeline and plenty of headroom to wind it up higher so if my old PIV lasts another year or so there is a good chance that there will be dual quad cores on chip with the price coming down at the same time.
Title: Re: Multicore theory proposal
Post by: Bill Cravener on June 12, 2008, 09:08:54 AM
QuoteI suffer the historical programmers problem of forgetting to eat and wondering why you start to feel seedy after a couple of days.

Steve, I know what you mean I for many years suffered the same problem. I was often referred to as the bean-pole, stick-man or pencil-necked geek by my peers. Back then food wasn't important and at 6'2" and 165 pounds I was pretty skinny. Now that I've slowed down I love to eat and drink and I'm not particular as to what as long as its eatable and the drink has alcohol in it.  :lol

You stay healthy and keep posting interesting topics. I like to read while feeding my face.  :bg
Title: Re: Multicore theory proposal
Post by: NightWare on June 12, 2008, 10:12:50 PM
i've made some modifications to allow me to compile it with the old masm32 v8 i use... i've added thread priority to obtain better results, i've changed the algo with a fillmem one, i've made 4 differents threads (coz you can't pass parameters like it should with the synchronising tech used)... when i test it, i have :
===========================================
Run a single thread on fixed test procedure
===========================================
1280 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1295 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
2590 MS Four thread timing



Press ENTER to quit...
(here, it look similar to previously...),


but after that, i've given the corresponding size/start address (uncomment the lines 307,314,359,370,372,374) to the threads, for the same amount of work in all cases... and the results :
===========================================
Run a single thread on fixed test procedure
===========================================
1279 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1170 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
312 MS Four thread timing



Press ENTER to quit...

there is a problem somwhere... (all of them should give similar results...) now as i've said somewhere else, i'm not very familar with win32 api, so maybe i've made a mistake somewhere...

[attachment deleted by admin]
Title: Re: Multicore theory proposal
Post by: zooba on June 12, 2008, 11:45:59 PM
You code looks fine as far as I can tell. Are you running on a quad-core processor? The results of the first test look like you have a dual-core but the results of the second look like a quad-core. I get almost identical results to yours and I know I'm only using a dual-core.

Also, you've almost got the parameter passing perfectly. Try including the events in each thread's data structure and passing a pointer to the entire structure in lParam. What would be a better way of passing parameters than by using a pointer?

Cheers,

Zooba :U
Title: Re: Multicore theory proposal
Post by: NightWare on June 13, 2008, 12:40:33 AM
hi,
like you i have a core2, so the 2nd results are... weird... look like only one thread is used...
Title: Re: Multicore theory proposal
Post by: zooba on June 13, 2008, 03:55:10 AM
I just had another look and noticed that you're passing the thread ID around (SetThreadPriority, etc.) when it should be the thread handle (return value of CreateThread). Not that this will be causing such a huge different in timings.

I changed the fill data and set a breakpoint on the return from WaitForMultipleObjects and the memory was filled correctly. Something strange seems to be going on here...

I have attached a slightly modified version of NightWare's original code, fixing the handle problem I mentioned above and enabling the work-division. (There is also an executable for people who can't be bothered building it :bg ) Getting some more validation of these numbers would be great.

Cheers,

Zooba :U

My results:
===========================================
Run a single thread on fixed test procedure
===========================================
1295 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1139 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
328 MS Four thread timing

[attachment deleted by admin]
Title: Re: Multicore theory proposal
Post by: sinsi on June 13, 2008, 04:38:02 AM
Q6600, 2.4GHz, XPSP3

===========================================
Run a single thread on fixed test procedure
===========================================
1265 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
203 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
110 MS Four thread timing

Title: Re: Multicore theory proposal
Post by: lingo on June 13, 2008, 10:19:29 PM
Core2 E8500,4 GHz,Vista64-SP1:
===========================================
Run a single thread on fixed test procedure
===========================================
405 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
156 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
140 MS Four thread timing



Press ENTER to quit...
Title: Re: Multicore theory proposal
Post by: NightWare on June 13, 2008, 10:43:35 PM
here my opinion :
the os (because of protected mode) loop every x time with all the programs (and alloc more or less time for the tasks depending of the priority), when you add threads, you just add other tasks (the threads) in the loop, so you just double or quad the amount of time you allocate for your app (and reduce the time for all the other tasks, including os...), nothing more... it's seems quite logical... we've probably been misdirected by pseudo speed test results, SetMaskAffinity, etc...  :lol
Title: Re: Multicore theory proposal
Post by: zooba on June 13, 2008, 10:56:36 PM
You're probably right. I can't think of any other explanation. Windows is designed to execute everything fairly, rather than provide all power to any thread/process that asks for it (though it can be coerced).

Clearly though, dividing this sort of work up amongst multiple threads is where multi-core speed gains come from.

Cheers,

Zooba :U
Title: Re: Multicore theory proposal
Post by: johnsa on June 15, 2008, 06:42:44 PM
http://www.informationweek.com/shared/printableArticle.jhtml?articleID=208403616

Interesting article about Intel Ct (exentions to C++ language) to facilitate multi-core development.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 16, 2008, 12:17:10 AM
Thanks for the link John, it looks like in the near future that this type of code will improve the entry of higher level languages into multicore programming. One of the commercial offerings I have seen recently (while forgetting its name) got under the threaded model and does this type of synchronisation at a lower level which apparently improves the performance some.

I suspect that later OS versions may have this type of option built into it but there also appear to be some fast moving changes in the hardware area going on as well so some of this type of technology may not last all that long. In current multicore development at a hardware level multiple identical cores are what is happening at the moment but I have heard talk of different core types as clusters that have diminished capacity but faster better integrated performance in serial code.

The holy grail will be improvements in both serial and parallel performance which will probably take both major cores and minor core to deliver. All paralel programming involves running serial code in parallel so an improvement in both will see far larger performance gains than either individually.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 16, 2008, 11:37:42 AM
Quote from: hutch-- on June 16, 2008, 12:17:10 AM
Thanks for the link John, it looks like in the near future that this type of code will improve the entry of higher level languages into multicore programming. One of the commercial offerings I have seen recently (while forgetting its name) got under the threaded model and does this type of synchronisation at a lower level which apparently improves the performance some.

Ct already does task synchronization at a lower level. Read the section Tasks in Ct (http://techresearch.intel.com/articles/Tera-Scale/1514.htm). RapidMind works similarly.

Quote...but there also appear to be some fast moving changes in the hardware area going on as well so some of this type of technology may not last all that long.

Please name those changes.

As the number of cores increases we'll become more dependent on this type of technology, not less.

QuoteIn current multicore development at a hardware level multiple identical cores are what is happening at the moment but I have heard talk of different core types as clusters that have diminished capacity but faster better integrated performance in serial code. The holy grail will be improvements in both serial and parallel performance which will probably take both major cores and minor core to deliver. All paralel programming involves running serial code in parallel so an improvement in both will see far larger performance gains than either individually.

Heterogeneous architectures actually have lower per-core instruction throughput. Look at Cell and Larrabee for instance. Their cores can't do out-of-order execution, don't do register renaming, there's no speculative execution with branch prediction, etc. But because they got rid of this 'hardware bloat' which only offers minor IPC improvements they can spend the extra transistors on more of these simple cores. Just look at the Cell die (http://www.research.ibm.com/cell/cell_chip.html). Instead of having room for only about three PowerPC cores, it has one complex PowerPC core and eight simple SPE cores. Despite somewhat slower sequential code execution, four SPEs can still deliver much higher combined throughput than one complex core. The complexity shifts to the software though, as you have to maximize thread concurrency.

So don't expect sequentual code performance to increase significantly any time soon.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 16, 2008, 12:50:35 PM
 :bg

> So don't expect sequentual code performance to increase significantly any time soon.

Who was it who said you will never need more memory the 64k ?

> Ct already does task synchronization at a lower level. Read the section Tasks in Ct. RapidMind works similarly.

How much faster does it make a serial task like chained encryption ?

0.000000000000000%

The problem is when the task has no inherant parallelism to distribute across multiple cores and there are a massive number of tasks like this.

Even in the parallel model you have in mind with current technology, it still runs serial code in parallel, the need for fast serial code will always exist and sooner or later the hardware will address it again. The current hiatus is due to clock speed limitations imposed by heat.

In a world where you already have multiple pipelines with out of order execution to improve throughput on a single core, it would indeed be a brave man who predicts the end of linear (serial) improvements.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 16, 2008, 02:44:12 PM
Quote from: hutch-- on June 16, 2008, 12:50:35 PM
Who was it who said you will never need more memory the 64k ?

Nobody did. It's a popular misinterpretation (http://www.nybooks.com/articles/15180). And you're misinterpreting me as well. I didn't say you don't need higher sequential performance, I said don't expect it any time soon. Neither did I say overall performance won't increase rapidly either, as you're going to get it in the form of extra cores.

QuoteHow much faster does it make a serial task like chained encryption ?

0.000000000000000%

The problem is when the task has no inherant parallelism to distribute across multiple cores and there are a massive number of tasks like this.

This is why we have other encryption methods than CBC (http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation). There might be a massive number of algorithms with no inherent parallelism, but there's an even more massive number of algorithms suited for parallelisation. And the number of parallel algorithms is still growing, as more developers have the opportunity to develop on multi-core systems. You also have to stop thinking in terms of performing just one kind of task on one block of data. Instead of encrypting one file, encrypt multiple files divided into multiple sections. The opportunity for parallelism is so substantial that it can even be done on the GPU (http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/proceedings/&toc=comp/proceedings/mue/2008/3134/00/3134toc.xml&DOI=10.1109/MUE.2008.94). And instead of having your application doing just encryption, allow other things to run concurrently as well. For instance an image editing application can compress and encrypt a number of images in the background, while allowing the user to keep working on his image in the foreground, while generating previews and updating animated GUI elements... You have to start thinking outside the box.

What you also seem to keep forgetting is that there is no other solution than to start using new, parallel algorithms. There is no way to make a sequential algorihm run faster other than increasing the clock speed, which is hitting physical limits whether you like it or not. Even if they suddenly have a breakthrough in technology and they can bump it up tenfold, you'll still have multiple cores. The only way to take advantage of them is with parallel algorithms, so stop talking about sequential algorithms, they belong in the "dustbin of history".

QuoteIn a world where you already have multiple pipelines with out of order execution to improve throughput on a single core, it would indeed be a brave man who predicts the end of linear (serial) improvements.

There is nothing left to do for single-threaded code at the architectural level.  As Moore's Law allowed more and more transistors, they added pipelining, they added branch prediction, they added superscalar execution, they added out-of-order execution, etc. Now they've simply gotten to the point where you can't churn through instructions from a single thread any faster. If we're missing any technology that is yet to be implemented, specify. So far you haven't named any of your magical technology.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 16, 2008, 04:02:52 PM
The 64k was from about 10 years earlier when an IMB PC had an amazing 64k when earlier PCs had 2, 4 or even 16k. I doubt it gets said much any more.

Algos like encryption, compression, searching, data structures(trees - hash tables) are fundamentally sequential (serial) in nature. There may be alternatives but they tend to be inferior designs in terms of encryption and compression as both designs are serial chained algos. Easy parallelism belongs to gaming in multimedia code, some engineering calculations as long as they are not sequentially dependent. The dustin of history is full of many things but serial algorithms are not one of them. Parallel processing is still done with serial processing on each core.

> What you also seem to keep forgetting is that there is no other solution than to start using new, parallel algorithms.

Who cares if they cannot do the same task.

Quote
There is no way to make a sequential algorihm run faster other than increasing the clock speed, which is hitting physical limits whether you like it or not.

This is claptrap, instruction throughput is the action, not clock speed, it was just one way to get more instructions through. Physical limits change as in fact they have over time, its a brave man who predicts no speed increase in instruction throughput.

Quote
There is nothing left to do for single-threaded code at the architectural level.  As Moore's Law allowed more and more transistors, they added pipelining, they added branch prediction, they added superscalar execution, they added out-of-order execution, etc. Now they've simply gotten to the point where you can't churn through instructions from a single thread any faster. If we're missing any technology that is yet to be implemented, specify. So far you haven't named any of your magical technology.

This IS magical technology in that it parallels single thread instructions through multiple pipelines and YES it did get faster because of it.

> If we're missing any technology that is yet to be implemented, specify.

Predicting the future is best done with a crystal ball, most have to be satisfied with continuity and that has been faster machines over time, multicore technology is still in its infancy in the PC market, it is useful but its not universal in its application. Most multicore technology on current PCs is win95 multithreading technology applied to multicore hardware.

Fine for where its faster but lousy where its not.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 16, 2008, 07:23:22 PM
Quote from: hutch-- on June 16, 2008, 04:02:52 PM
Algos like encryption, compression, searching, data structures(trees - hash tables) are fundamentally sequential (serial) in nature.

Encryption: CTR (http://www.cs.ucdavis.edu/~rogaway/papers/ctr.pdf) mode allows parallelization without weakening security
Compression: Huffman coding (http://ce.et.tudelft.nl/publicationfiles/1296_754_PDP4485.pdf) benefits from multiple processors and is optimal
Searching: the most embarassingly parallel problem of all, ask Google
Data structures: The Art of Multiprocessor Programming has chapters on lock-free trees, hash tables, and others

QuoteThere may be alternatives but they tend to be inferior designs in terms of encryption and compression as both designs are serial chained algos.

Bollocks. But that being said, there are lots of interesting alternative algorithms that have better scaling behavior while only sacrificing a marginal bit of effectiveness.

QuoteWho cares if they cannot do the same task.

Who cares about sequential algorithms if they can't deliver.

QuotePhysical limits change as in fact they have over time, its a brave man who predicts no speed increase in instruction throughput.

Good luck changing the speed of light.

The writing is on the wall. AnandTech's review of Nehalem shows that it is almost no faster at running single-threaded code. Any increase can entirely be attributed to the integrated memory controller, which is nothing more than an increment improvement that won't bring back the steady single-threaded performance increases from the MHz-race days.

QuoteThis IS magical technology in that it parallels single thread instructions through multiple pipelines and YES it did get faster because of it.

Sure, 15 years ago. Nowadays superscalar execution is exploited to the practical maximum. And there's no new trick up the chip designer's sleeves to furhter increase single-threaded instruction throughput in any substantial way.

Read Chip Multi-Processor Scalability for Single-Threaded Applications (http://liberty.princeton.edu/Publications/dascmp05_scalability.pdf) and note how the number of instructions that can execute in parallel is not much higher than 4 for realistic designs. Guess what, current CPUs already have 4 execution ports. Also note the rapidly diminishing returns for throwing more resources at it. Oh and don't forget the conclusion, which clearly states that even the most aggressive approach to increasing single-threaded performance would run out of steam in a mere 6-8 years. Looking at Intel and other chipmakers roadmaps its pretty clear they rather spend their transistors on multiple cores.

QuotePredicting the future is best done with a crystal ball, most have to be satisfied with continuity and that has been faster machines over time, multicore technology is still in its infancy in the PC market, it is useful but its not universal in its application.

I don't need a crystal ball to see that all roadmaps are going towards massive numbers of cores. It's also clear that the continuity of the MHz-race got disrupted when Tejas (http://en.wikipedia.org/wiki/Tejas_and_Jayhawk) got ditched in favor of multi-cores based on P6. We have a new continuity now: performance per watt. And multi-core right now gives us the biggest increase in performance for every extra transistor burning power.

QuoteMost multicore technology on current PCs is win95 multithreading technology applied to multicore hardware. Fine for where its faster but lousy where its not.

That's what this thread is supposed to be about. Yes, classic approaches of having one task per thread have little good applications. But by scheduling tasks within a thread, using lock/wait-free approaches, practically all software can scale to many cores.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 17, 2008, 01:42:26 AM
 :bg

A sad tale but true, http://www.intel.com/products/processor/itanium/

In particular view the link on this page "Dual-Core Intel® Itanium® processor demo" to answer your questions about the advantage of EPIC architecture and what its advantages are over RISC and similar older technology.

Quote
Itanium-based servers are incredibly scalable, allowing configuration in systems of as many as 512 processors and a full petabyte (1024TB) of RAM. Together with full support for both 32-bit and 64-bit applications, that capacity provides unmatched flexibility in tailoring systems to your enterprise needs.

Quote
Dual-core processing, EPIC architecture, and Hyper-Threading Technology†      Supports massive, multi-level parallelism for today's data-intensive workloads

Quote
Support for up to 512 processors and one petabyte (1024TB) RAM      Provides scalable performance for enterprise flexibility

Quote
Up to 24MB of low-latency L3 cache      Prevents idle processing cycles with a high-bandwidth data supply to the execution cores

Quote
Core Level Lock-step     Enables one processor core to mirror the operations of the other

more ...............

You have continued to make assuptions based on current x86 architecture without realising that it is an ancient achitecture. The future of genuine high performance computing is not a threaded model on low performance multiple cores, it is BOTH synchronous and asynchronous applications running on very high performance hardware.

I have said to you before, don't be without ambition, think in terms of 512 dual core Itaniums as current technology that can do things you would not believe.  :P
Title: Re: Multicore theory proposal
Post by: johnsa on June 17, 2008, 11:46:53 AM
On a slightly off-topic point.. why oh why haven't they updated the x86 fpu to a non-stack based model (keep it there for compatability if needed) but FPU performance could be increased in the margin of about 15-20% I would reckon just by changing the instruction set / opcodes. Every Stack FPU based piece i've ever written would be about 20% shorter (instruction count) when implemented on a normal register based FPU (ala 68k).

Just a thought.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 17, 2008, 03:50:59 PM
hutch, I suggest you first read the manual (http://download.intel.com/design/Itanium2/manuals/25111003.pdf) for the actual facts instead of throwing Itanium marketing talk at me as an argument.

QuoteIn particular view the link on this page "Dual-Core Intel® Itanium® processor demo" to answer your questions about the advantage of EPIC architecture and what its advantages are over RISC and similar older technology.

The advantage of EPIC is that instruction dependencies are Explicit. The disadvantage of EPIC is that instruction dependencies are Explicit.

That's right. The task of instruction scheduling is entirely the compiler's responsability (or the assembly programmer's, if you're that brave). It's an in-order architecture so cache misses and dependencies mean stalls. So much for hardware helping you reach higher performance. The six instructions per cycle is a maximum, just like Core 2's five instructions per cycle is a maximum. And the reality is that this maximum is hardly ever reached, for the simple reason that cache misses and instruction dependencies are unavoidable. There's not enough intrinsic parallelism in a thread to sustain the maximum throughput. The only thing Itanium is rather good at is multimedia and scientific computing, but ironically that's perfectly suited for multi-core as well.

It's also interesting that with Montecito Intel added switch-on-event multithreading. This allows it to execute another thread when a cache miss occured. In other words they resorted to multi-threading to increase the effective throughput. The fact that it's also a dual-core should tell you just how much of a dead end trying to increase single-threaded performance would be.

QuoteYou have continued to make assuptions based on current x86 architecture without realising that it is an ancient achitecture.

It hardly matters. An add, mul or div on x86 is just as good as on any other ISA. The actual architecture inside has changed tremendously from one generation to the next. What started as an in-order CISC processor became an out-of-order multi-issue RISC processor. The instruction set is nothing more than a facade, an interface used by the software. The reason why x86 is still alive and kicking is because its flaws have started to matter less and less. It matters so little that Larrabee, primarily a GPU, will be x86 based.

Going multi-core is not another technology to hide any of x86's flaws. All processors are going multi-core, including IA64, ARM (http://www.theregister.co.uk/2004/05/18/arm_multicore/), SPARC (http://www.sun.com/processors/throughput/faqs.html), PowerPC (http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars), and many more (http://en.wikipedia.org/wiki/Multi-core_(computing)#Hardware).

QuoteI have said to you before, don't be without ambition, think in terms of 512 dual core Itaniums as current technology that can do things you would not believe.

Rest assured, I'm very ambitious, but I'm also certain that just having this kind of hardware won't automagically result in faster software that makes good use of it. It requires considerable effort in software design to scale it to such a high number of cores.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 17, 2008, 03:52:36 PM
Quote from: johnsa on June 17, 2008, 11:46:53 AM
On a slightly off-topic point.. why oh why haven't they updated the x86 fpu to a non-stack based model (keep it there for compatability if needed) but FPU performance could be increased in the margin of about 15-20% I would reckon just by changing the instruction set / opcodes. Every Stack FPU based piece i've ever written would be about 20% shorter (instruction count) when implemented on a normal register based FPU (ala 68k).

What's wrong with SSE(2)?
Title: Re: Multicore theory proposal
Post by: johnsa on June 17, 2008, 04:59:44 PM
Nothing wrong with sse imho.. I just think they should update the std. fpu instruction set..

fmov f0,dword ptr [esi]
fsqr f0
fmov f1,f0
fsqrt f1
fmov f2,1.0
fneg f2

that sort of thing.. it produces code which on average is about 20% fewer instructions than the whole st(n) stack model.
Title: Re: Multicore theory proposal
Post by: hutch-- on June 17, 2008, 07:56:18 PM
This much I have learnt about Intel over time, they have an irritating habit of knowing what they are talking about with their own hardware lines and as the worlds leading chipmaker they have the proof of what they say. While the Itanium has a terrible instruction set, they are not some pie in the sky future model, they are in production and are the base component for massive supercomputers from companies like SGI and others with customers like NASA, the JPL and many resaerch universities.

In rushing to avoid the data on production hardware already doing the job, you have missed some of its important capaciies. Massive extendability with added cores up to 512 dual and the coming quad core versions. Then there is the existing "Core Level Lock-step" capacity, hardware synchronisation of multiple cores for mirroring. So much for the x86 based notions of limitations of processor locking.

Then there is the notion that an Itanium is at some disadvantage in terms of stalls yet you have to go back to a 386 or earlier to avoid stalls on x86 hardware. You may have the ambtion but you are still wearing blinkers in terms of hardware that is coming and current hardware limits. The future holds BOTH synchronous and asynchronous parallel processing, not multithreaded model asynchronous processing alone.

The need for both is obvious as even if you can achieve asynchronous parllelism speed improvements, there is a limit to the number of useful spits in the task so if you have 64 cores but can only use 8 of them, your task is limited by the core count it can use, not the hardware. Synchronous parallel processing is the only way arond this limit so you have synchronous parallel processing being run in each thread of an asynchronous application.
Title: Re: Multicore theory proposal
Post by: c0d1f1ed on June 17, 2008, 09:44:13 PM
Quote from: johnsa on June 17, 2008, 04:59:44 PM
Nothing wrong with sse imho.. I just think they should update the std. fpu instruction set..

fmov f0,dword ptr [esi]
fsqr f0
fmov f1,f0
fsqrt f1
fmov f2,1.0
fneg f2

that sort of thing.. it produces code which on average is about 20% fewer instructions than the whole st(n) stack model.

No seriously, what's wrong with SSE? :wink

movss xmm0, dword ptr [esi]
mulss xmm0, xmm0
movss xmm1, xmm0
sqrtss xmm1, xmm1
movss xmm2, one
xorps xmm2, sign
Title: Re: Multicore theory proposal
Post by: johnsa on June 17, 2008, 10:33:04 PM
Point :)