why is multicore such a load of junk?

johnsa · October 02, 2008, 10:20:42 AM

So.. a few months ago we all had a lot of discussions around multi-core, multi-threading, parallel architectures and lock-free algorithms etc etc. I experimented with quite a few ideas at the time but I only had a single core machine so it was all theoretical. Subsequently I now have a core2duo (2core) and quad core and have re-visited all the old ideas to see what really happens now in practice.

I've come to this defining conclusion:
multi-core is a scam..

I've taken a couple different simple examples and "multi-cored/multi-threaded" them in the simplest ways to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.

The end result.. no matter what you do: the 2nd core gives you around 15-20% improvement, 3rd and 4th core.. absolutely nothing.
i've tried it on various machines, in asm and C# using parallel extension library and so on. The same holds true all the time.

The one thing I see as a major problem with multi-core is that although you have increased computational ability, no computation or "REAL" work is possible without some sort of data, and it's the access of that data from memory that provides an upper bound on your performance. Even with one single core that upper bound can be reached thus yielding little or no gain from subsequent cores.

That said I thought it would be worth testing something that takes some TINY data, say a few hundred bytes and works on that same data repeatedly in the 2 threads with affinity set to each core to thus remove this memory access issue from the equation to see if 2 cores do truly run twice as fast... result... the same 20% improvement even though that data should reside in the core's local cache.

Unless I'm missing something very obvious, to me it seems that multiple cores just don't do what they say.

Thoughts?

sinsi · October 02, 2008, 10:50:23 AM

I think multi-core works best with multi-tasking, so a single program won't show much improvement but (maybe) overall win/lin work better.
Let's face it, windows shows me 403 threads running, so even a 64-core cpu (or 64 cpu's) won't be that great...

Has anyone seen http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
64 cpu's and 2TB of memory...mmm

zooba · October 02, 2008, 11:57:24 AM

It is a very inexact science. I have seen massive improvements in performance for video processing algorithms and also seen algorithms that, while theoretically are highly parallelisable, don't perform any better. I believe you are correct in that memory is now the limit. I would suggest that an equally suitable test would be to run multiple instances of a single-threaded algorithm (that is, two separate processes, not threads). I have found, in general, that neither process suffers speedwise as a result of the second process doing the same thing on a different file.

I also very much agree with Sinsi: multi-tasking is considerably better on a multi-core system. I personally would prefer that most applications didn't automatically try and thrash every single core, because that way I can have heavy processing going on in the background without affecting the system responsiveness. Also, as Mark Russinovich points out in one of his most recent blogs, you gain the ability to close a program that would totally lock up a single-core PC.

Cheers,

Zooba :U

johnsa · October 02, 2008, 12:13:21 PM

I agree at the process level, I do find that two processes seem to be able to produce the same results running concurrently that would previously only be possible one at a time. Which is all well and fine from a user perspective making the overall OS experience a bit smoother.. but from a development point of view multiple processes isn't really an option.. if we want to write code which is "faster" we need some way to make use of this extra processing power, but it just doesn't seem to work.. no matter what type of data or algorithm it is.. the second you use threads to accomplish a task in parallel (even if by simple subdivision.. eg taking a set of 10,000 records and doing 5,000 per core) the best you get is at most 20% for the 2nd core thereafter nothing.

Based on the industry trend at the moment towards ever increasing core count and the cpu companies inability yet to achieve higher clock speeds.. what future is there for "faster" code.. it's almost as if we've reached the .. "this is as good as it's going to get" point... at least until there is some innovation and we start seing 4,5,6+ ghz cores.

The memory upper-bound limit aside, surely if 2 threads worked on two totally independant pieces of data, both equal in size (say 64 bytes.. something absolutely tiny and guaranteed to be in cache).. both threads being the exact same code.. one would expect that at least to increase by around 80-90% in performance... perhaps Windows is at fault here and the threads aren't making use of the 2 cores correctly... or is it the cores themselves which aren't working as one would think.

sinsi · October 02, 2008, 12:39:57 PM

The trouble is that windows imposes its own priorities on any process. You can set thread affinity and priority but windows decides when things happen.
It gets even worse when threads have to sync - windows may be multi-tasking, but multi-threading within a process seems to be badly handled.

dsouza123 · October 03, 2008, 12:09:30 AM

Running 2 copies of a program, each having the affinity set to a different core, can yield 2 times
the production of 1 copy running on 1 core, as long as the program isn't memory bandwidth limited/data starved.

I wrote a program that did it, factoring integers, but the memory needed easily fit in the L1 data cache.
---------------------------------------------------

A more extensive example that has been tested on numerous x86 systems, having 1 to 16 cores.
Some results from the table, of two similar CPUs, a dual and quad core at the same clock speed 2500 Mhz.

Kümmel's Mandelbrot Benchmark Version 0.53H-32b-MT results

http://www.mikusite.de/pages/x86.htm

Optimized x86 Assembly (used FASM) linear scaling with extra cores.

Code Select


Intel Core 2 Quad Q9300, 2500 Mhz, 4 Cores, FPU 1450.238 Mil Iter/sec, SSE2 3320.891 Mil Iter/sec
Intel Core 2 Duo  E9300, 2500 Mhz, 2 Cores, FPU  697.183 Mil Iter/sec, SSE2 1607.021 Mil Iter/sec

The source and 3 versions of the executable, FPU, SSE2, SSE2 Pent M, readme.txt and results.xls
are in the zip download from the above page.

When run, it displays multiple zoom ins, 8, of the Mandelbrot set,
repeats 10 times and gives iter/sec result in a messagebox.

A well written multithreaded assembly program can linearly scale with more cores,
or multiple copies of a program on different cores can also linearly scale.

Clarification: The Mandelbrot program is an example of multithreading,
with multiple cores used by one process working separately in parallel.

Normally the most effective use of multiple cores on a CPU is to have
separate processes working independently on a problem with no communication
between the processes, with the only multithreading in a process being two
threads, an interface thread and a worker thread. Ideally if the memory needed
at one time can be held to some fraction of L1 and either just work within it
or if more is needed only occasionally reload it.

A big issue is memory access, even with multiple processes if there
are demands on memory that cause continual cache misses or if the
processes memory use far exceeds cache sizes, so the caches are bypassed,
relying on main memory, bandwidth limits and issues with access contention
will occur.

Draakie · October 03, 2008, 03:46:42 PM

QuoteHubby :

Well I bought a quad and boy is it fast as hell - that's running 64-bit XP Pro
with the correct RAM to FSB ratio - that is flat out 1066 MHz - 4 gig's of
Kingston Hyper - and a AMD Phenom clocked to 2.8Ghz per core and SATA
sitting at a 3 GB/sec. Performance is to say the least - AWESOME !
3DSMAX / MAYA / SERIOUS RPG GAMES run like butter in a 400 deg oven.

Draakie

hutch-- · October 03, 2008, 11:33:52 PM

Having bothered to read the specs for the upcoming Intel quads, this stuff will continue to become more general purpose over time with better instructions, reduced task switching latency, better thread synchronisation and the like ..... The megabuck end of computing has much of this now and it is within the foreseeable future that synchronous parallel processing will become a reality just like asynchronous parallel computing became viable with 32 bit x86.

Interestingly enough memory speed increases and processor throughput have improved over the last few years even though the clock speeds have not gone up much so hardware is still getting faster. If all else fails there is still an excellent old solution for big tasks, have a fast network between computers and just delegate complete tasks off to another box, the granularity is terrible but for tasks that take minutes to hours you just dump it somewhere else and let it bash away at it while you are doing something else and you certainly do not suffer from core contention, memory saturation and so on.

sinsi · October 04, 2008, 06:36:37 AM

Real world test...had to copy 3 digital video tapes (90 minutes each) and burn to DVD.
Copied with movie maker via firewire (at last, found a use for the bloody thing, but had to enable it in the BIOS :redface:), burned with nero.
computer1: athlon 2600+ 2.13GHz, XP home sp3, 1.5GB memory
computer2: q6600 2.4GHz, XP home sp3, 2GB memory (my baby - with an 8800GT it effing flies)

computer1 took about 7 bloody hours to do it (crashed the first time with 70% done :bdg)
computer2 took about 2 hours

A well-written single program can work well, if the threads do totally different things and talk to each other minimally - not at all is best.

Quote from: dsouza123 on October 03, 2008, 12:09:30 AM
Normally the most effective use of multiple cores on a CPU is to have
separate processes working independently on a problem with no communication
between the processes, with the only multithreading in a process being two
threads, an interface thread and a worker thread.

Yes, like one for DoEvents (?is it?) to process user clicks/keys, and the other for the main program.

QuoteIdeally if the memory needed
at one time can be held to some fraction of L1 and either just work within it
or if more is needed only occasionally reload it.

A big issue is memory access, even with multiple processes if there
are demands on memory that cause continual cache misses or if the
processes memory use far exceeds cache sizes, so the caches are bypassed,
relying on main memory, bandwidth limits and issues with access contention
will occur.

With the number of threads running nowadays memory/cache thrashing is going to happen anyway, so I think
that there's nothing we can really do about it.

draakie, I haven't found a game yet that really taxes my system - the only thing I need now is
a widescreen LCD (I've only got a 19 inch CRT) but I will survive :bg

dsouza123 · October 04, 2008, 01:42:59 PM

I wonder if nero used multiple threads, multiple (child) processes or both.

johnsa
When you did

Quote from: johnsa on October 02, 2008, 10:20:42 AM
a couple different simple examples and "multi-cored/multi-threaded" them in the simplest ways
to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.

did you use SetProcessAffinityMask (for all cores), SetPriorityClass
and for the threads SetThreadAffinityMask (specific core) and/or SetThreadIdealProcessor (specific core) and SetThreadPriority,
so each thread would/should run on a different core ?

If you didn't specify, then it is whatever the OS defaults to,
maybe all threads on one core or bouncing the process and/or threads from core to core.

What about running multiple copies of the tiny data version with only one worker thread,
with and without SetThreadAffinityMask or the multithreaded version using SetThreadAffinityMask ?
-----------------------------------------------------------

As a reference I quickly looked at the assembly code for the Mandelbrot benchmark 0.53H
and Kümmel uses SetThreadAffinityMask to control what core a thread will use.

I don't see a quick way to remove/substitute (display a small subset with GDI in a window),
the full screen DirectDraw graphics so core ultilization can be determined.

A possible work around, start Windows Task Manager on the Users tab then minimize
before running the benchmark and then restore and switch to the Performance tab when it finishes,
and the CPU use graphs will show the history.

dsouza123 · October 04, 2008, 06:31:47 PM

A google search for thread affinity came up with
http://www.flounder.com/affinity.htm

The developer, Joseph Newcommer, wrote an interesting testbed GUI application,
the process priority and thread core affinities can be selected, configurations saved,
it can work with a PC with 1 to 8 cores. It tests 1 to 8 worker threads and 1 GUI thread.

What it does, it has each worker thread doing some looped computation,
when all worker threads finish it creates a graph showing when each thread
was working. It also displays the elapsed time for each thread.

It is written in C++, has the source, resources and exe. The code is
well commented and has a somewhat familar style to MASM
by using API calls (in C++ syntax) such as ::SetThreadAffinityMask,
::SetPriorityClass and mentioning DWORD, DWORD_PTR, WPARAM, LPARAM,
possibly because the developer coauthored a book,
Developing Windows NT Device Drivers.

I wish it had been done in MASM then an alternative computation(s) could be
plugged in and evaluated.

vanjast · November 10, 2008, 06:15:13 AM

Yup, it's the amount of cache allocated to each core that will make the difference, added to this, the method used to tranfer data between the cores.
All these things were already proven in the early 90s when we used to dabble with Transputer systems - Now these were nice toys. :boohoo:
It's such a pity that they never went further as their last chipset was set to kill Intel.
:8)

Mark_Larson · November 18, 2008, 04:46:26 PM

Quote from: johnsa on October 02, 2008, 10:20:42 AM
So.. a few months ago we all had a lot of discussions around multi-core, multi-threading, parallel architectures and lock-free algorithms etc etc. I experimented with quite a few ideas at the time but I only had a single core machine so it was all theoretical. Subsequently I now have a core2duo (2core) and quad core and have re-visited all the old ideas to see what really happens now in practice.

I've come to this defining conclusion:
multi-core is a scam..

I've taken a couple different simple examples and "multi-cored/multi-threaded" them in the simplest ways to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.

The end result.. no matter what you do: the 2nd core gives you around 15-20% improvement, 3rd and 4th core.. absolutely nothing.
i've tried it on various machines, in asm and C# using parallel extension library and so on. The same holds true all the time.

The one thing I see as a major problem with multi-core is that although you have increased computational ability, no computation or "REAL" work is possible without some sort of data, and it's the access of that data from memory that provides an upper bound on your performance. Even with one single core that upper bound can be reached thus yielding little or no gain from subsequent cores.

That said I thought it would be worth testing something that takes some TINY data, say a few hundred bytes and works on that same data repeatedly in the 2 threads with affinity set to each core to thus remove this memory access issue from the equation to see if 2 cores do truly run twice as fast... result... the same 20% improvement even though that data should reside in the core's local cache.

Unless I'm missing something very obvious, to me it seems that multiple cores just don't do what they say.

Thoughts?

Intel has slides on their site. They more cores you add the less benefit each new core has.

It is also very algorithm independent. For example, in ray-tracing you get an almost 2x speed up,. because it is so math intensive. So the more your code uses your CPU hard, the more benefit you get from multicore.

johnsa · November 19, 2008, 09:08:35 AM

The ray-tracing example is another one I find hard to believe. Yes it is mathematically complex, but every single one of those calculations relies on object geometry, vectors, scene data etc which is all memory bound.
I reckon that you could could probably get 1 single core ray-tracing at around 75% of the possible maximum given the memory bandwidth constraint, your second core may give you that extra 15-20% as I've noted before but no more. There is no such thing as computationally complex with out data... all computation relies on data somewhere.. this is my argument... even a simple add requires two elements of data. In 95% of all real world cases your "data" is not going to fit in the measly amount of cpu cache available. Add to this the fact that I don't think the caches behave well between cores either.

Mark_Larson · November 19, 2008, 02:30:32 PM

Quote from: johnsa on November 19, 2008, 09:08:35 AM
The ray-tracing example is another one I find hard to believe. Yes it is mathematically complex, but every single one of those calculations relies on object geometry, vectors, scene data etc which is all memory bound.
I reckon that you could could probably get 1 single core ray-tracing at around 75% of the possible maximum given the memory bandwidth constraint, your second core may give you that extra 15-20% as I've noted before but no more. There is no such thing as computationally complex with out data... all computation relies on data somewhere.. this is my argument... even a simple add requires two elements of data. In 95% of all real world cases your "data" is not going to fit in the measly amount of cpu cache available. Add to this the fact that I don't think the caches behave well between cores either.

you are making an incorrect assumption. If you run your program, and it's not close to maxing out the CPU, then you don't have a good example. Your assumption that it is just moving data around is incorrect. Different programs use more and more of the processor. In typical raytracing algo's it's 100%. So you get a big speed up from going multicore. For most programs that only use 30% or 15%, you won't see much of a speed up. Does that make better sense?

News:

why is multicore such a load of junk?

dsouza123

dsouza123

dsouza123