Multicore theory proposal

c0d1f1ed · May 29, 2008, 11:01:17 AM

Quote from: hutch-- on May 28, 2008, 04:16:53 AM
SUBSTRATES: A long time ago the base for transistors was germanium, lower resistance but higher leakage. Silicon has higher resistance but much lower leakage, its limiting factor on current hardware is heat from the resistance. There is no problem with winding up the clock speed on silicon except for the heat and this is why current processors are stuck at just under 4 gig.

Intel has used SiGe since its 90 nm process to create strained silicon with higher electron mobility. Also, high-k gate dielectrics have been introduced in the 45 nm process to further reduce leakage, and 32 nm will feature metal gates to reduce resistance. The latter process will be used by Westmere and Sandy bridge, with up to eight cores running at 4 GHz.

So even with significant engineering triumphs in process technology there is no indication of any return to the MHz-race. Instead, with every shrink we get twice the number of transistors, and these are primarily used to increase the number of cores.

It's hard to still speak of silicon technology if you have SiGe substrates, high-k gate dielectrics instead of silicon dioxide, and metal gates instead of polysilicon. Whatever other advancement you're imagining, it's not going to increase single-threaded performance much. Any future chip will be multi-core though.

QuoteThe military and specialised instruments have had ICs based of other substrates for many years, something you use for parts of guidance systems for missiles and the like but the factor so far against using substrates with both lower resistance and lower leakage is a factor of cost, not performance.

Please specify.

Either way, hoping for any hardware technology to bring back the free lunch is a pipe dream. You can talk about faster substrates or whatever as much as you like, the reality is that if it can't be manufactured for a reasonable price it won't end up in your and everyone else's system. Muti-core silicon based CPUs are mainstream, today, and will continue to be for the foreseeable future. You have to start taking advantage of multi-core if you want your software to get any faster, or wait on a mythic fast single-core processor forever.

Besides, cost is not that much of a limitation. Nine layers of copper interconnects, strained silicon, hafnium insulators, metal gates, double patterning, immersion lithography... expensive technology hasn't prevented Intel from using it in their CPU lines. Investments of multiple billions are not an unsurmountable obstacle, and are compensated by sheer production volume. The fact that they haven't used better substrates should be a sign that it's just not that spectacular and there are more effective ways of increasing performance.

QuoteSooner or later silicon will run out of puff, it already has in terms of heat and while smaller tracks help reduce this problem in conjunction with lower voltage, this too has its limits which are not far off.

The current semiconductor roadmap (ITRS) already includes an 11 nm process. With Intel's plans of transitioning to a new process every two years, this means they'll continue to shrink lithographic processes till at least 2016 (the ITRS is less aggressive and projects 11 nm for 2022). With 11 nm, 64 cores could fit on a chip (and then there's also Hyper-Threading). Heat problems can be controlled as long as the clock frequency is not increased aggressively.

So to prepare for the next 8 years you'd better have a serious look at multi-core programming if you don't want to get hopelessly behind.

After 11 nm they'll have to transition to nanoelectronics. However, this still doesn't mean the end of multi-core processing. Hypothetically, even with materials that can run at ten times higher clock frequencies while keeping heat in check, you'd still have a transistor budget of over ten billion (more than 100 times that of a Pentium 4). No matter how clever you invest these in a single core, you'll never get the same throughput as a multi-core processor. Code only has a limited number of nearby instrutions that can execute in parallel. In practice an IPC of five is about the highest you can go. To do more work you need multiple points of execution; concurrent threads.

Software development has changed forever.

c0d1f1ed · May 29, 2008, 11:08:37 AM

Quote from: hutch-- on May 28, 2008, 03:18:21 PM
Somewhere along the line I remember that you can use the normal db 90h (nop) or a number of them instead of pause.

Great, you do read my posts.

c0d1f1ed · May 29, 2008, 11:23:31 AM

Quote from: johnsa on May 28, 2008, 04:27:56 PM
Tried using nops... while MUCH faster than a pause instruction it does add a fair amount of overhead to the spin loop...
Question is.. does it really matter.. tight loop with no nop/pause or nop in there.. either way the result will be the same and it will still take as much time as is necessary for the lock to become available..assuming this thread is running on a different core from the one which owns the lock brings another question up.. do we care that that specific core jams up running the loop at full-tilt while waiting for the lock?

Yes, if you don't put a little delay between locking attempts you'll create an avalanche of synchronization traffic between the cores, and neither thread can succesfully aquire the lock for a long time. This only becomes a big problem with more than two cores. If one thread holds the lock, more than one other thread can start queing up to try to aquire the lock. Once the first thread releases the lock, all other threads start fighting for the lock at (almost) the same time.

Locks with exponential backoff have quite good scaling behavior (it's similar to the protocol used by ethernet), but queue-based locks are even better for processors with a MESI cache coherency protocol. I'd love to see x86 implementations of such locks.

hutch-- · May 29, 2008, 12:14:44 PM

:bg

Not necessarily but I do read the Intel manuals when i want technical data in Intel hardware.

Quote
This instruction was introduced in the Pentium 4 processors, but is backward compatible with
all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction operates like a NOP
instruction. The Pentium 4 and Intel Xeon processors implement the PAUSE instruction as a
pre-defined delay. The delay is finite and can be zero for some processors. This instruction does
not change the architectural state of the processor (that is, it performs essentially a delaying noop
operation).

> Software development has changed forever.

Software development HAS been changing forever but not all of it has lasted. Seen a 10 year old RISC box recentlly ? What about a modern DDE aplication, how much OLE have you seen lately ?

The notion that multicore processing is suddenly something new is mistaken, try the 512 parallel Itaniums I mentioned earlier with recent SGI boxes but even on x86 I remember seeing multiple processor boards for the early Pentiums and there was Windows OS support as early as win2000, I think also NT4 but I forget, it was 10 years ago.

The context for the "free lunch" is also mistaken, it addressed ever slower software on ever faster hardware. There IS a solution to THAT free lunch, rewrite VB style crap in C or assembler, that avenue is far from fully exploited and modern hardware at 20 to 30 times faster is not supported by much of modern sopftware that may be a bit faster here and there. Note that this level of performance increase does not even address multicore processing yet.

RE being left behind, keep in mind that the SGI hardware I mention which would be 3 to 4 years old now was pelting 20 megapixel images at over 100 frames a second back then and this type of performance is well beyond anything that x86 and current video softare can expect in the foreseeable future. The difference between SGI paralel hardware and for that matter some of the multiple parallel x86 hadware that was around a few years ago to current dual and double dual core processors is dedicated hardware to interface between large processor counts at about 1.9 per extra processor.

SUBSTRATES.
Silicon is 40 year old technology and while throwing large sums of money at it has kept it going for a long time, where does it go when speed / space requirements push trach widths down to under 1 nanometre ? The answer is nowhere in a hurry. Now while military suppliers and no going to start revealing their technology any time soon, I still remember ruby substrates and somewhere along the line saphire/silicon junctions in high speed instrumentation. It would indeed be a brave prediction that processor clock speeds will not go up again, it tends to sound like Bill gates prediction about 64k of memory.

Don't be without ambition, think in terms of 1024 parallel core running in the terahertz range with dedicated hardware to properly interface running Windows Universe 12. :bg

c0d1f1ed · May 29, 2008, 12:23:06 PM

Quote from: johnsa on May 28, 2008, 05:49:46 PM
So.. assuming that if one looks at the overhead of creating threads on a per-task basis, it makes far more sense to me to allocate a thread-per-core... and then in each thread routine implement some sort of workload designator which calls other routines as it needs.

Spot on.

QuoteI do understand how that approach doesn't really scale well in terms of a variable number of cores possibly, but unless you're creating a task which is essentially going to act as a template and all threads are instances of that same code (IE: like a socket server, web server etc) then implementing algorithms which are not only optimal but can handle a variable number of cores is almost impossible.

True, it quickly becomes mind-boggling complex to write software this way. But it does become manageable when you make use of declarative programming techniques. Have a look at SystemC for example. Basically every statement runs in parallel except if there is a dependency between them. The framework and compiler ensure that these tasks are distributed over all available cores. Each finished task can spawn new tasks which are queued so that other threads can help process them, keeping the dependencies into account to maintain correctness. This allows to write and maintain larger projects than would otherwise be feasible with imperative languages. SystemC has a lot of overhead because it's built on top of C++, but a proper compiler for such a language could be quite revolutionary (RapidMind comes close and runs on an arbitrary number of cores).

Just-in-time compilation (JIT) also offers very interesting possibilities. Basically when you run the application it can compile the code to run optimally on whatever number of cores you have. So instead of requiring the developer to optimize specifically for every possible number of cores, it's handled automatically at run-time.

Of course declarative programming and JIT doesn't render assembly useless, but it will become increasingly more difficult to write and maintain large projects in purely imperative languages. Instead, C and assembly remain crucial for compilers and multi-core programming frameworks. I've recently started exploring LLVM and the possibilities are truly awesome. In fact, I haven't found a single situation yet where the generated code doesn't match or exceed the performance of hand-written code, including SIMD operations! All that is lacking is a language and a compiler combining all of this into a convenient way to write high performance multi-core aware software. We have exciting times ahead of us. :8)

hutch-- · May 29, 2008, 12:51:17 PM

Here is a quick scruffy on the PAUSE mnemonic. It may in fact be more efficient in terms of exit from a spinlock but it still locks the processor stone dead until it exits. This is useful enough in terms of thread timing trimmers but unless you are willing to take big performance hits PAUSE needs to be supplimented with an OS level yield so that other processes can run on the core while the lock is idling.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    pause equ <db 0F3h,90h>

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    push esi

    mov esi, 10000000
    call spinlock

    mov esi, 20000000
    call spinlock

    mov esi, 40000000
    call spinlock

    mov esi, 80000000
    call spinlock

    pop esi
    ret

  spinlock:
    invoke GetTickCount
    push eax

  lbl1:
    pause
    sub esi, 1
    jnz lbl1

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax),13,10

    retn

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

c0d1f1ed · May 29, 2008, 02:18:29 PM

Quote from: hutch-- on May 29, 2008, 12:14:44 PM
Software development HAS been changing forever but not all of it has lasted. Seen a 10 year old RISC box recentlly ? What about a modern DDE aplication, how much OLE have you seen lately ?

Sure, lots of technology comes and goes or ends up in a dusty corner. But you haven't given me a single sound argument yet why multi-core won't last. Mythic semiconductor substrates don't count, they are not on any roadmap, and even if they were there is no indication they would render multi-core useless.

Try argumenting against this: maximum single-core performance = instructions per clock * clock frequency, maximum multi-core perfomance = core count * instructions per clock * clock frequency. It doesn't take a genious to know that balancing three parameters will result in a more optimal system than balancing two parameters. IPC is limited and doesn't favor single-core (multi-core benefits from it equally), so you'd have to compensate all performance gained by having multiple cores with an aggressively higher clock frequency. Octa-core are already on the roadmaps, now show me semiconductor technology that is on any roadmap that will allow a single-core to keep up with that.

QuoteThe notion that multicore processing is suddenly something new is mistaken, try the 512 parallel Itaniums I mentioned earlier with recent SGI boxes but even on x86 I remember seeing multiple processor boards for the early Pentiums and there was Windows OS support as early as win2000, I think also NT4 but I forget, it was 10 years ago.

I've said this before: the revolutionary aspect is that it's coming to every mainstream system. Supercomputers with multiple processors have existed for ages but they ran specialized applications only available to a minority. We now face the challenge of making multi-core software for the masses. And it's not a fixed number of processors programmed by specialists, it's a varying number of cores programmed by programmers with a varying level of expertise.

Furthermore, the existance of supercomputers for many decades should tell you something about the need for multiple cores to reach higher performance. They already lastest this long, so what kind of twisted reasoning makes you think we'll be able to vastly increase performance without transitioning to multi-core for good?

QuoteThe context for the "free lunch" is also mistaken, it addressed ever slower software on ever faster hardware. There IS a solution to THAT free lunch, rewrite VB style crap in C or assembler, that avenue is far from fully exploited and modern hardware at 20 to 30 times faster is not supported by much of modern sopftware that may be a bit faster here and there. Note that this level of performance increase does not even address multicore processing yet.

It's pointless to talk about the performance of Visual Basic. Anyone programming in it doesn't aim for performance in the first place, instead they want fast development and safety, which assembly won't offer them. I'm no fan of VB either but I'll recommend C# to anyone requiring these properties from a programming language. There are good reasons for the existance of any language. The very fact that there are so many languages is because there are people who seek different qualities. Performance is only one of them.

That being said, the issue at hand is increasing performance of applications already written in C and assembly. The only route to further enhance performance is to take advantage of the increasing number of cores. And to not get strangled in your own code, you need some level of abstraction. High-level tools and languages can help with that. The alternative is to re-invent the wheel for every project and waste eons of time getting the design right.

QuoteRE being left behind, keep in mind that the SGI hardware I mention which would be 3 to 4 years old now was pelting 20 megapixel images at over 100 frames a second back then and this type of performance is well beyond anything that x86 and current video softare can expect in the foreseeable future. The difference between SGI paralel hardware and for that matter some of the multiple parallel x86 hadware that was around a few years ago to current dual and double dual core processors is dedicated hardware to interface between large processor counts at about 1.9 per extra processor.

I fail to see why you're so excited by that multi-core SGI hardware but mainstream multi-core CPUs leave you cold. You won't get anywhere near the performance of that SGI hardware without making use of software designed for multi-core. Are you seriously suggesting that if we're not going to beat that SGI hardware any time soon that we might as well not try going the multi-core path anyway? There are tons of other applications that can become a reality in the meantime if we take advantage of multi-core.

QuoteSilicon is 40 year old technology and while throwing large sums of money at it has kept it going for a long time, where does it go when speed / space requirements push trach widths down to under 1 nanometre ? The answer is nowhere in a hurry. Now while military suppliers and no going to start revealing their technology any time soon, I still remember ruby substrates and somewhere along the line saphire/silicon junctions in high speed instrumentation. It would indeed be a brave prediction that processor clock speeds will not go up again, it tends to sound like Bill gates prediction about 64k of memory.

Oh I'm not saying that clock speeds won't go up again sooner or later. I'm just saying that silicon has at least another decade to go, during which the number of cores will increase aggressivly but the clock frequency won't increase that much. Even after silicon runs out of steam and we transistion to other technologies, there is not a single argument supporting a return to single-core. So no matter what happens, you should invest in multi-core software development if you care about performance.

Talking about substrates, the most promising of all is probably (synthetic) diamond. Even if it can truely run at hundreds of GHz as promised, we still have a transistor budget of many billion, for which the only sane choice is to spend them on multiple cores...

johnsa · May 29, 2008, 02:36:51 PM

Ok, so how would one go about getting the structure aligned to 128bytes and cacheline inline without having to dynamically allocate memory?

as in

spinlock_t STRUCT
_lock dd 0
spinlock_t ENDS

mylock spinlock_t <1> ; align this at 128byte / cache line

as opposed to using virtualalloc or something?

c0d1f1ed · May 29, 2008, 06:20:30 PM

http://msdn.microsoft.com/en-us/library/tydf8khh.aspx
http://msdn.microsoft.com/en-us/library/dwa9fwef(VS.80).aspx

johnsa · May 29, 2008, 07:07:53 PM

neither align nor struct will accept 128 as an aligment.. tried that already :)

hutch-- · May 29, 2008, 07:40:28 PM

John,

Allocate dynamic memory, any method will do as long as it is contiguous and align the start location you want to use which will probably be an offset from the start of the memory. This is the macro from masm32 to align memory.

Code Select


    ; ********************************************
    ; align memory                               *
    ; reg has the address of the memory to align *
    ; number is the required alignment           *
    ; EXAMPLE : memalign esi, 16                 *
    ; ********************************************

      memalign MACRO reg, number
        add reg, number - 1
        and reg, -number
      ENDM

I know of one other method but it would be unusual, you can use large alignments on object modules and there is a tool in the masm32 project called FDA or the gui version FDA2 that will allow you to create an object module with the correct alignment with whatever bytes size of data you require.

johnsa · May 29, 2008, 09:08:58 PM

not to worry, thanks for that macro info. I'll just use a dynamic allocation and align it rather than creating the lock objects in the .data

c0d1f1ed · May 29, 2008, 11:48:48 PM

Quote from: johnsa on May 29, 2008, 07:07:53 PM
neither align nor struct will accept 128 as an aligment.. tried that already :)

Shouldn't 64 byte alignment be sufficient?

Another option would be to have a static structure twice the cache line size and use an alignment macro like hutch's.

Or, have a structure that starts with a cache line size of dummy data, then the actual lock variable, and then filling the rest up to the size of another cache line. No matter where this strucuture ends up in memory, the lock variable is guaranteed to be all alone on a cache line (just make sure the structure is 4 or 8 byte aligned).

hutch-- · May 31, 2008, 12:01:23 AM

John,

Try this, I think its correct and its easy enough to impliment.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    spinlock_t STRUCT
      _lock dd 0
    spinlock_t ENDS

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

    LOCAL pbuffer   :DWORD          ; buffer pointer
    LOCAL pstruct   :DWORD          ; start address for structure
    LOCAL buffer[1024]:BYTE

    lea eax, buffer                 ; get the buffer address
    mov pbuffer, eax                ; write it to buffer pointer

    push esi

    lea esi, pbuffer                ; load buffer offset into pointer
    memalign esi, 128               ; align ESI to 128 bytes
    mov (spinlock_t PTR [esi])._lock, 12345678  ; < load you value here

  ; -----------------------
  ; test code for alignment
  ; -----------------------
    print str$(esi),13,10
    memalign esi, 128
    print str$(esi),13,10

    pop esi
    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

hutch-- · May 31, 2008, 01:18:47 AM

c0d1f1ed,

I think you miss where I come from in relation to multicore or multiprocessor computers. Like most people wity a reasonable grasp of modern computing I see multicore procesing as the future but at the common desktop PC level I see it as somewhere in the future as I don't see that capacity in current dual core hardware as being even vaguely near fast enough to do general purpose work.

To make a point back in the 16 bit DOS days I had the technical data on FLAT memory model ever though it was another 7 years until an OS properly supported it at the PC level. 16 bit DOS was 64k memory range with reasonably complicated tiling schemes for larger blocks of memory. Later in the scene you could access a few megabytes with upper memory management and later win32s but they were both slow fudges of the complete capacity that had been in hardware since the 386DX33.

I wrote my first 32 bit FLAT memory model code on winNT4 which had the proper support for FLAT memory model, much the same as the current 32 bit code that is still in use today.

I see multicore in much the same light as FLAT memory model in 1988, something that will be usefl in the future but not realy viable at the moment. By the time multicore/processor hardware is capable of doing anything useful in terms of general purpose code, it will be all 64 bit and the methods to make 9t general purpose will be very different indeed to current hardware and software techniques.

Multithread multicore processing is already with us in terms of multiple concurrent threads for things like terminal servers and web servers as they routinely handle that type of workload but the hardware is not yet suitable for close range high performance computing.

The distinction here is between vertical performance versus horizontal performance. Horizontal performance is well suited to current multicore hardware in terms of threads being spead across the availale cores.

Now where this will make a difference is when you can approach a task that is by its layout not suitable for parallel processing, compression comes to mind here which effects not only simple data compression but formats like MP2 and MP4 video compression which nees to be linear (serial) in its nature to acheive very high compression rates.

Think in terms of a 64 core x86 processor where the core design can not only handle current concurrent threads in the normal manner but can handle parallel processing on a sngle thread without the messy fudges that are curently required and where you can get about a 1.9 times increase in computing power for extra core used in the algorithm. It says for the use of 10 core instead of one that you wil get about 8 to 9 time that processing power.

The action is in interfacing multiple cores in an efficient manner and this will only ever be done at a very high speed hardware level, emulating software cooperative multitasking at a core level is doomed to failure as it can never be fast enough.

News:

Multicore theory proposal

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed