The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on October 02, 2008, 10:20:42 AM

Title: why is multicore such a load of junk?
Post by: johnsa on October 02, 2008, 10:20:42 AM
So.. a few months ago we all had a lot of discussions around multi-core, multi-threading, parallel architectures and lock-free algorithms etc etc. I experimented with quite a few ideas at the time but I only had a single core machine so it was all theoretical. Subsequently I now have a core2duo (2core) and quad core and have re-visited all the old ideas to see what really happens now in practice.

I've come to this defining conclusion:
multi-core is a scam..

I've taken a couple different simple examples and "multi-cored/multi-threaded" them in  the simplest ways to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.

The end result.. no matter what you do: the 2nd core gives you around 15-20% improvement, 3rd and 4th core.. absolutely nothing.
i've tried it on various machines, in asm and C# using parallel extension library and so on. The same holds true all the time.

The one thing I see as a major problem with multi-core is that although you have increased computational ability, no computation or "REAL" work is possible without some sort of data, and it's the access of that data from memory that provides an upper bound on your performance. Even with one single core that upper bound can be reached thus yielding little or no gain from subsequent cores.

That said I thought it would be worth testing something that takes some TINY data, say a few hundred bytes and works on  that same data repeatedly in the 2 threads with affinity set to each core to thus remove this memory access issue from the equation to see if 2 cores do truly run twice as fast... result... the same 20% improvement even though that data should reside in the core's local cache.

Unless I'm missing something very obvious, to me it seems that multiple cores just don't do what they say.

Thoughts?
Title: Re: why is multicore such a load of junk?
Post by: sinsi on October 02, 2008, 10:50:23 AM
I think multi-core works best with multi-tasking, so a single program won't show much improvement but (maybe) overall win/lin work better.
Let's face it, windows shows me 403 threads running, so even a 64-core cpu (or 64 cpu's) won't be that great...

Has anyone seen http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
64 cpu's and 2TB of memory...mmm
Title: Re: why is multicore such a load of junk?
Post by: zooba on October 02, 2008, 11:57:24 AM
It is a very inexact science. I have seen massive improvements in performance for video processing algorithms and also seen algorithms that, while theoretically are highly parallelisable, don't perform any better. I believe you are correct in that memory is now the limit. I would suggest that an equally suitable test would be to run multiple instances of a single-threaded algorithm (that is, two separate processes, not threads). I have found, in general, that neither process suffers speedwise as a result of the second process doing the same thing on a different file.

I also very much agree with Sinsi: multi-tasking is considerably better on a multi-core system. I personally would prefer that most applications didn't automatically try and thrash every single core, because that way I can have heavy processing going on in the background without affecting the system responsiveness. Also, as Mark Russinovich points out in one of his most recent blogs, you gain the ability to close a program that would totally lock up a single-core PC.

Cheers,

Zooba :U
Title: Re: why is multicore such a load of junk?
Post by: johnsa on October 02, 2008, 12:13:21 PM
I agree at the process level, I do find that two processes seem to be able to produce the same results running concurrently that would previously only be possible one at a time. Which is all well and fine from a user perspective making the overall OS experience a bit smoother.. but from a development point of view multiple processes isn't really an option.. if we want to write code which is "faster" we need some way to make use of this extra processing power, but it just doesn't seem to work.. no matter what type of data or algorithm it is.. the second you use threads to accomplish a task in parallel (even if by simple subdivision.. eg taking a set of 10,000 records and doing 5,000 per core) the best you get is at most 20% for the 2nd core thereafter nothing.

Based on the industry trend at the moment towards ever increasing core count and the cpu companies inability yet to achieve higher clock speeds.. what future is there for "faster" code.. it's almost as if we've reached the .. "this is as good as it's going to get" point... at least until there is some innovation and we start seing 4,5,6+ ghz cores.

The memory upper-bound limit aside, surely if 2 threads worked on two totally independant pieces of data, both equal in size (say 64 bytes.. something absolutely tiny and guaranteed to be in cache).. both threads being the exact same code.. one would expect that at least to increase by around 80-90% in performance... perhaps Windows is at fault here and the threads aren't making use of the 2 cores correctly... or is it the cores themselves which aren't working as one would think.
Title: Re: why is multicore such a load of junk?
Post by: sinsi on October 02, 2008, 12:39:57 PM
The trouble is that windows imposes its own priorities on any process. You can set thread affinity and priority but windows decides when things happen.
It gets even worse when threads have to sync - windows may be multi-tasking, but multi-threading within a process seems to be badly handled.
Title: Re: why is multicore such a load of junk?
Post by: dsouza123 on October 03, 2008, 12:09:30 AM
Running 2 copies of a program, each having the affinity set to a different core, can yield 2 times
the production of 1 copy running on 1 core, as long as the program isn't memory bandwidth limited/data starved.

I wrote a program that did it, factoring integers, but the memory needed easily fit in the L1 data cache.
---------------------------------------------------

A more extensive example that has been tested on numerous x86 systems, having 1 to 16 cores.
Some results from the table, of two similar CPUs, a dual and quad core at the same clock speed 2500 Mhz.

Kümmel's Mandelbrot Benchmark Version 0.53H-32b-MT results

http://www.mikusite.de/pages/x86.htm

Optimized x86 Assembly (used FASM) linear scaling with extra cores.


Intel Core 2 Quad Q9300, 2500 Mhz, 4 Cores, FPU 1450.238 Mil Iter/sec, SSE2 3320.891 Mil Iter/sec
Intel Core 2 Duo  E9300, 2500 Mhz, 2 Cores, FPU  697.183 Mil Iter/sec, SSE2 1607.021 Mil Iter/sec


The source and 3 versions of the executable, FPU, SSE2, SSE2 Pent M, readme.txt and results.xls
are in the zip download from the above page.

When run, it displays multiple zoom ins, 8, of the Mandelbrot set,
repeats 10 times and gives iter/sec result in a messagebox.

A well written multithreaded assembly program can linearly scale with more cores,
or multiple copies of a program on different cores can also linearly scale.

Clarification: The Mandelbrot program is an example of multithreading,
with multiple cores used by one process working separately in parallel.

Normally the most effective use of multiple cores on a CPU is to have
separate processes working independently on a problem with no communication
between the processes, with the only multithreading in a process being two
threads, an interface thread and a worker thread. Ideally if the memory needed
at one time can be held to some fraction of L1 and either just work within it
or if more is needed only occasionally reload it.

A big issue is memory access, even with multiple processes if there
are demands on memory that cause continual cache misses or if the
processes memory use far exceeds cache sizes, so the caches are bypassed,
relying on main memory, bandwidth limits and issues with access contention
will occur.
Title: Re: why is multicore such a load of junk?
Post by: Draakie on October 03, 2008, 03:46:42 PM
QuoteHubby :

Well I bought a quad and boy is it fast as hell - that's running 64-bit XP Pro
with the correct RAM to FSB ratio - that is flat out 1066 MHz - 4 gig's of
Kingston Hyper - and a AMD Phenom clocked to 2.8Ghz per core and SATA
sitting at a 3 GB/sec. Performance is to say the least - AWESOME !
3DSMAX / MAYA / SERIOUS RPG GAMES run like butter in a 400 deg oven.

Draakie
Title: Re: why is multicore such a load of junk?
Post by: hutch-- on October 03, 2008, 11:33:52 PM
Having bothered to read the specs for the upcoming Intel quads, this stuff will continue to become more general purpose over time with better instructions, reduced task switching latency, better thread synchronisation and the like ..... The megabuck end of computing has much of this now and it is within the foreseeable future that synchronous parallel processing will become a reality just like asynchronous parallel computing became viable with 32 bit x86.

Interestingly enough memory speed increases and processor throughput have improved over the last few years even though the clock speeds have not gone up much so hardware is still getting faster. If all else fails there is still an excellent old solution for big tasks, have a fast network between computers and just delegate complete tasks off to another box, the granularity is terrible but for tasks that take minutes to hours you just dump it somewhere else and let it bash away at it while you are doing something else and you certainly do not suffer from core contention, memory saturation and so on.
Title: Re: why is multicore such a load of junk?
Post by: sinsi on October 04, 2008, 06:36:37 AM
Real world test...had to copy 3 digital video tapes (90 minutes each) and burn to DVD.
Copied with movie maker via firewire (at last, found a use for the bloody thing, but had to enable it in the BIOS :redface:), burned with nero.
computer1: athlon 2600+ 2.13GHz, XP home sp3, 1.5GB memory
computer2: q6600 2.4GHz, XP home sp3, 2GB memory (my baby - with an 8800GT it effing flies)

computer1 took about 7 bloody hours to do it (crashed the first time with 70% done  :bdg)
computer2 took about 2 hours


A well-written single program can work well, if the threads do totally different things and talk to each other minimally - not at all is best.

Quote from: dsouza123 on October 03, 2008, 12:09:30 AM
Normally the most effective use of multiple cores on a CPU is to have
separate processes working independently on a problem with no communication
between the processes, with the only multithreading in a process being two
threads, an interface thread and a worker thread.
Yes, like one for DoEvents (?is it?) to process user clicks/keys, and the other for the main program.

QuoteIdeally if the memory needed
at one time can be held to some fraction of L1 and either just work within it
or if more is needed only occasionally reload it.

A big issue is memory access, even with multiple processes if there
are demands on memory that cause continual cache misses or if the
processes memory use far exceeds cache sizes, so the caches are bypassed,
relying on main memory, bandwidth limits and issues with access contention
will occur.
With the number of threads running nowadays memory/cache thrashing is going to happen anyway, so I think
that there's nothing we can really do about it.

draakie, I haven't found a game yet that really taxes my system - the only thing I need now is
a widescreen LCD (I've only got a 19 inch CRT) but I will survive  :bg
Title: Re: why is multicore such a load of junk?
Post by: dsouza123 on October 04, 2008, 01:42:59 PM
I wonder if nero used multiple threads, multiple (child) processes or both.

johnsa
When you did
Quote from: johnsa on October 02, 2008, 10:20:42 AM
a couple different simple examples and "multi-cored/multi-threaded" them in the simplest ways
to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.
did you use SetProcessAffinityMask (for all cores), SetPriorityClass
and for the threads SetThreadAffinityMask (specific core) and/or SetThreadIdealProcessor (specific core) and SetThreadPriority,
so each thread would/should run on a different core ?

If you didn't specify, then it is whatever the OS defaults to,
maybe all threads on one core or bouncing the process and/or threads from core to core.

What about running multiple copies of the tiny data version with only one worker thread,
with and without SetThreadAffinityMask or the multithreaded version using SetThreadAffinityMask ?
-----------------------------------------------------------

As a reference I quickly looked at the assembly code for the Mandelbrot benchmark 0.53H
and Kümmel uses SetThreadAffinityMask to control what core a thread will use.

I don't see a quick way to remove/substitute (display a small subset with GDI in a window),
the full screen DirectDraw graphics so core ultilization can be determined.

A possible work around, start Windows Task Manager on the Users tab then minimize
before running the benchmark and then restore and switch to the Performance tab when it finishes,
and the CPU use graphs will show the history.
Title: Re: why is multicore such a load of junk?
Post by: dsouza123 on October 04, 2008, 06:31:47 PM
  A google search for thread affinity came up with
http://www.flounder.com/affinity.htm

  The developer, Joseph Newcommer, wrote an interesting testbed GUI application,
the process priority and thread core affinities can be selected, configurations saved,
it can work with a PC with 1 to 8 cores. It tests 1 to 8 worker threads and 1 GUI thread.

  What it does, it has each worker thread doing some looped computation,
when all worker threads finish it creates a graph showing when each thread
was working. It also displays the elapsed time for each thread.

  It is written in C++, has the source, resources and exe. The code is
well commented and has a somewhat familar style to MASM
by using API calls (in C++ syntax) such as ::SetThreadAffinityMask,
::SetPriorityClass and mentioning DWORD, DWORD_PTR, WPARAM, LPARAM,
possibly because the developer coauthored a book,
Developing Windows NT Device Drivers.

  I wish it had been done in MASM then an alternative computation(s) could be
plugged in and evaluated.
Title: Re: why is multicore such a load of junk?
Post by: vanjast on November 10, 2008, 06:15:13 AM
Yup, it's the amount of cache allocated to each core that will make the difference, added to this, the method used to tranfer data between the cores.
All these things were already proven in the early 90s when we used to dabble with Transputer systems - Now these were nice toys.  :boohoo:
It's such a pity that they never went further as their last chipset was set to kill Intel.
:8)

Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 18, 2008, 04:46:26 PM
Quote from: johnsa on October 02, 2008, 10:20:42 AM
So.. a few months ago we all had a lot of discussions around multi-core, multi-threading, parallel architectures and lock-free algorithms etc etc. I experimented with quite a few ideas at the time but I only had a single core machine so it was all theoretical. Subsequently I now have a core2duo (2core) and quad core and have re-visited all the old ideas to see what really happens now in practice.

I've come to this defining conclusion:
multi-core is a scam..

I've taken a couple different simple examples and "multi-cored/multi-threaded" them in  the simplest ways to ensure full code and data independance, no locking and tried with both large data sets and tiny data that doesn't change.

The end result.. no matter what you do: the 2nd core gives you around 15-20% improvement, 3rd and 4th core.. absolutely nothing.
i've tried it on various machines, in asm and C# using parallel extension library and so on. The same holds true all the time.

The one thing I see as a major problem with multi-core is that although you have increased computational ability, no computation or "REAL" work is possible without some sort of data, and it's the access of that data from memory that provides an upper bound on your performance. Even with one single core that upper bound can be reached thus yielding little or no gain from subsequent cores.

That said I thought it would be worth testing something that takes some TINY data, say a few hundred bytes and works on  that same data repeatedly in the 2 threads with affinity set to each core to thus remove this memory access issue from the equation to see if 2 cores do truly run twice as fast... result... the same 20% improvement even though that data should reside in the core's local cache.

Unless I'm missing something very obvious, to me it seems that multiple cores just don't do what they say.

Thoughts?


  Intel has slides on their site.  They more cores you add the less benefit each new core has.

  It is also very algorithm independent.  For example, in ray-tracing you get an almost 2x speed up,. because it is so math intensive.  So the more your code uses your CPU hard, the more benefit you get from multicore.

Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 19, 2008, 09:08:35 AM
The ray-tracing example is another one I find hard to believe. Yes it is mathematically complex, but every single one of those calculations relies on object geometry, vectors, scene data etc which is all memory bound.
I reckon that you could could probably get 1 single core ray-tracing at around 75% of the possible maximum given the memory bandwidth constraint, your second core may give you that extra 15-20% as I've noted before but no more. There is no such thing as computationally complex with out data... all computation relies on data somewhere.. this is my argument... even a simple add requires two elements of data. In 95% of all real world cases your "data" is not going to fit in the measly amount of cpu cache available. Add to this the fact that I don't think the caches behave well between cores either.
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 19, 2008, 02:30:32 PM
Quote from: johnsa on November 19, 2008, 09:08:35 AM
The ray-tracing example is another one I find hard to believe. Yes it is mathematically complex, but every single one of those calculations relies on object geometry, vectors, scene data etc which is all memory bound.
I reckon that you could could probably get 1 single core ray-tracing at around 75% of the possible maximum given the memory bandwidth constraint, your second core may give you that extra 15-20% as I've noted before but no more. There is no such thing as computationally complex with out data... all computation relies on data somewhere.. this is my argument... even a simple add requires two elements of data. In 95% of all real world cases your "data" is not going to fit in the measly amount of cpu cache available. Add to this the fact that I don't think the caches behave well between cores either.

you are making an incorrect assumption.  If you run your program, and it's not close to maxing out the CPU, then you don't have a good example.  Your assumption that it is just moving data around is incorrect.  Different programs use more and more of the processor.  In typical raytracing algo's it's 100%.  So you get a big speed up from going multicore.  For most programs that only use 30% or 15%, you won't see much of a speed up.  Does that make better sense?
Title: Re: why is multicore such a load of junk?
Post by: bozo on November 19, 2008, 08:07:56 PM
IMHO, it all depends on your algorithm.
if you can split the work up between 2 cores, you'll see twice the speed..split it between 50 computers, it'll be 50 times (or more, depending on specs) faster.

the problem most face is how to calculate what each of those cpus does.

in the colleges, they're only starting to teach this..
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 19, 2008, 09:04:09 PM
I still am not convinced.. I understand that for purely CPU driven code, more cores (assuming you can parallel the algorithm) will give more performance.

But take ray-tracing as the example...

1. Calculation of camera rays to project into your scene - these could be calculated faster with more cores as the data necessary is small and remains relatively constant
2. Rotations/Transformations of Scene Data into World/Camera space - this is 50/50 to my mind, doing a matrix multiply or vec. calc is 50% cpu and 50% memory, perhaps even more in terms of memory.. from testing I can state that having 2 cores perform matrix multiplies is only about 20% faster than 1..
3. Testing for intersections with obects/polys in the scene.. whats the most intensive part of this? a simple intersection algo(assuming line/plane) or the fact that the scene data could be several hundred meg.. to me there is more load on memory for this operation even with BSP trees etc than there is on the CPU.. cache will be an issue here too
4. Performing the materials/texture/properties lookup for that intersection point ... almost entirely memory driven..
etc.. etc..

So certain parts will benefit, but overall the net result will be about 20% for core2 and less for core 3 so on...
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 20, 2008, 12:36:39 AM
  you are over-thinking it.

  raytracing scales linearly and thus it makes good sense to use multi-core.

  not ALL algorithms scale linearly. 

  In raytracing, if you draw HALF of the screen size the FPS doubles. 

  so let's assume the FULL frame size is 640x480 and HALF is 320x480  clear so far?



  So if at HALF the resolution we are getting 60 FPS, and FULL resolution we are getting 30 FPS.  What would happen if you ran half resolution, but did one half on one core and one half on the other core?  You'd have 60FPS again but with 640x480

make better sense?
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 20, 2008, 08:07:18 AM

Hehe.. I get what your're saying and in a perfect world I would agree 100%.

If we had 4 completely independant computers/cores each with their own memory and a full copy of the scene data in each memory, then yes.. we could subdivide the screen into quarters and get 4x the framerate and so on. This method works well for rendering non-interactive stuff, like the work done by Pixar etc. They can farm out the rendering either sub-frame or have different machines rendering frames concurrently.

However, in the average PC this isn't going to work because of the previously stated reasons.

I bet 20 dollars to anyone who can write an RT ray-tracer and prove beyond a reasonable doubt that by dividing the area to be rendered between 1/2/4 cores can show the sort of performance increase that would make it worth the extra effort. IE:
1 core (640x480) at 10fps
2 cores (640x480 halves) at 20fps
4 cores (640x480 quarters) at 40fps

My feeling is that the results would look more like:
1 core (640x480) at 10fps
2 cores (640x480 halves) at 15fps
4 cores (640x480 quarters) at 17fps

The bottom line for me (and this is how I think about each algo)
If I have a machine with memory bandwidth of 1gig/sec can this loop iterate/process through 1gig/sec of data on a single core.. if it can, then adding cores will yield nothing.
If the algorithm can process through 800mb/sec then i'll see 20% increase from the 2nd core (assuming overheads are taken into account) and so on..

In my experience VERY few algorithms land up in a situation where memory usage per second is like 200Mb (or less... something low) and the CPU is taking strain, don't get me wrong there are cases and for those multiple cores is brilliant. But in general it's not all that helpful.

As an example.. one could try something like the following:
Take a grid of 1000x1000 points between -1,-1,-1 and 1,1,-1 and calculate the vector from 0,0,0 to that point and normalize it and maybe additional calculate the dot with the vector 0,0,-1.
Id be surprised if that got any faster with more cores.. and thats a pretty real-world example.

Note: One of the little gems about trying to improve cache coherency when raytracing (and this applies to many algos) is to render in small blocks to ensure that the rays which are cast land up (mostly) hitting geometry and data which has good locality with each other and should be cached. Just that one simple trick can yield substantially improvements and this would indicate that memory and caching are of far more importance in overall RT performance.
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 20, 2008, 11:54:17 PM
http://ompf.org/forum/
they have forums

they have plenty of raytracers that go 2x on dual core.

You are way off base on your assumption again with raytracers.  Please read up more about it on that website before you post anymore way off data on raytracers.
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 21, 2008, 12:06:34 AM
this should make it easier for you to understand

pdf
http://www.devmaster.net/articles/raytracing_series/State-of-the-Art%20in%20interactive%20ray%20tracing.pdf

it's written by several guys with Doctorates who do real time raytracing research.

there is a page on the benefit of multi-core.

it is figure 14.  Go and look at the figure since it has a graph that helps explain.

up to 7 processors they are getting about a 7x speed up.

here is the quote beneath the figure.

Quote
Figure 14: Our implementation shows almost perfect scalability
of from 1 to 7 dual CPU PCs if the caches are already
filled.With empty caches we see some network contention effects
with 4 clients but scalability is still very good. Beyond 6
or 7 clients we start saturating the network link to the model
server.
Title: Re: why is multicore such a load of junk?
Post by: BogdanOntanu on November 21, 2008, 05:43:46 AM
Quote
With empty caches we see some network contention effects
with 4 clients but scalability is still very good. Beyond 6
or 7 clients we start saturating the network link to the model
server.

Apparently they are using a client server network model. This suggests multiple computers each with his own CPU, cache, memory, harddisk and network.

Yes in this case the only problem is the network and IF you divide the job in a small part for each CPU and copy the whole scene model on each computer THEN you will see an almost linear speed up with each CPU/ computer added until the time needed to copy the scene and the results of rendering back to the central server will grow too big.

I have seen this done with 4-5 computer even back in 199x when rendering 3D films in DOS with 3dStudioMax and it worked ok.  3DStudion max for DOS used to have a queue manager that would distribute scene and frames to each computer via network and get back the results of rendering. It works but you need separated computers.

I kind of agree with what johnsa presents here, it is logical and I kind of remember some university lecture on multiple CPU (before the multicore hype) that demonstrated exactly this: after a few CPU's added the benefits are minimal and lost in "inter CPU" communications and wait. The benefits remain high only for a limited set of algorithms and with costly preparation phases.

In a multi core system there are too many things "shared": the cache, the RAM, some busses, the disk drive, the network (sometimes) and all those represent new bottle necks. It does not matter that the second CPU could eventually perform some useful work IF in real life situations it has to WAIT from a shared resource the be freed by the "other" CPU. Of course the greatest penalty hit is from the cache and the RAM and shared busses and not from the HDD or network but still it is better to have those devices separated.

I have written a minimal realtime raytracer in ASM... I will convert it to multicore and see the results... if any ;)

-----

As a side note I would never ever trust a guy with a doctorate.

I have proven them wrong again and again in university once I have started to make my own research and experiments and when I have confronted them to my results they started to move from corner to corner showing that they have no real knowledge but only pretend to know and copy paste from others "non doctorands" concepts and research.

In the end they started to threaten me as the "last logical argument" and as a last years student I have had to "bend" IF I wanted to get my diploma... In the end I have realized that I was "bending" the truth just in order to get a better life and a doctorate diploma and hence I have quit the university in the last year for this reason ... a diploma is valuable in our society but in the face of the truth it is an abomination.

Later in my work experience I have meet a lot of western "doctor" of science and technology and IT with impressive diplomas and "research" but when I have started to talk and discuss with them I have realized that thy have no clue whatsoever despite their social status and references.... of course I had to be polite but still. 

Real education never ever started on this planet.

I am sure that at my physical death  I should regret the 5 years I have lost in university study...
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 21, 2008, 07:46:03 AM
Bogdan, I concur 100%.

The articles I've read on the topics of RT ray-tracing as with the above article look to gain the needed performance through multiple fully independent computers, each with their own ram,cache,cpu's etc.

I did in fact mention that in my previous post as a perfectly valid way to get almost linear scaling of performance with cores(assuming each core comes with it's own cache and memory) :)

Once again, I would challenge anyone to write some code (either a full RT ray-tracer) or perhaps just a block of sample code doing some scene or vector work as I mentioned before and get it to run at 400% (even 300%) of original performance when comparing 1 to 4 cores, obviously in the same machine with no cheating and not using one of the limited number of algorithms which DO genuinely benefit (scale linearly) with core count... It has to be a real world example.

I personally think that multi-core is all marketing hype to try and cover up the fact that chip makers have hit stumbling blocks in terms of performance gains.
In the long run it's still good that we have gone through the process of multi-cores as it teaches one a lot and given the right peripheral architecture in the machine to support it (think cell processors) could be viable in terms of task separation within one application or at the very least just make general multi-tasking a bit more slick.

On a side note.. I suspect that if they could increase the size of the independant cache per core and the OS ensures that threads affinities are preserved and that cache isn't thrashed then you'd see more and more benefit from the cores. I wouldn't say multi-core is a bad idea, I just think that in it's current implementation within consumer PCs it's not very effective.

Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 21, 2008, 09:13:56 AM



BNT MACRO
db 2eh
ENDM

BTK MACRO
db 3eh
ENDM

include \masm32\include\masm32rt.inc
include c:\dataengine\timers.asm

.686p
.mmx
.k3d
.xmm
option casemap:none

Vector3D_Normalize     PROTO ptrVR:DWORD, ptrV1:DWORD
Vector3D_Normalize_FPU PROTO ptrVR:DWORD, ptrV1:DWORD

test_thread1 PROTO :DWORD
test_thread2 PROTO :DWORD

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

;###############################################################################################################
; DATA SECTION
;###############################################################################################################
.const

.data?

align 4
thread1  dd ?
thread2  dd ?
var      dd ?
var2 dd ?

.data

VECTOR_COUNT equ (20000000) ; Number of vectors in the list.

align 16
TestVector REAL4 2.4,4.3,1.9,1.0 ; A test vector to fill the structure with.
VectorListPtr dd 0 ; A pointer to a list of VECTOR_COUNT vectors (X,Y,Z,W) AOS format.

objcnt2  dd 0,0 ; Event Handles.

;###############################################################################################################
; CODE SECTION
;###############################################################################################################
.code

start:

; Allocate the vector list memory.
invoke GetProcessHeap
invoke HeapAlloc,eax,HEAP_ZERO_MEMORY,(16*VECTOR_COUNT)
mov VectorListPtr,eax

; Fill the VectorList with some data.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 1 core.
timer_begin 1, HIGH_PRIORITY_CLASS

mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
align 16
@@:
invoke Vector3D_Normalize,edi,edi
add edi,16
dec ecx
BTK
jnz short @B

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)

; Fill the VectorList with some data again.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 2 cores.
    mov esi,offset objcnt2
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi],eax
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi+4],eax
    mov thread1,rv(CreateThread,NULL,NULL,ADDR test_thread1,[esi],CREATE_SUSPENDED,ADDR var)
    mov thread2,rv(CreateThread,NULL,NULL,ADDR test_thread2,[esi+4],CREATE_SUSPENDED,ADDR var2)

timer_begin 1, HIGH_PRIORITY_CLASS

invoke ResumeThread,thread1
invoke ResumeThread,thread2

    invoke WaitForMultipleObjects,2,OFFSET objcnt2,TRUE,INFINITE

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)
   
mov esi,offset objcnt2
    invoke CloseHandle,[esi]
    invoke CloseHandle,[esi+4]
    invoke CloseHandle,thread1
    invoke CloseHandle,thread2
   
; Free Vector List Memory.
invoke GetProcessHeap
invoke HeapFree,eax,HEAP_NO_SERIALIZE,VectorListPtr
   
invoke ExitProcess,0

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector V1 into VR (AOS Format).
; VR = 1/||V1|| * V1.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
movaps xmm0,[esi] ; xmm0 [ w | z | y | x ]
movaps xmm3,xmm0 ; xmm3 [ w | z | y | x ]
mulps xmm0,xmm0 ; xmm0 [ w^2 | z^2 | y^2 | x^2 ]
pshufd xmm1,xmm0,00000001b ; xmm1 [ x^2 | x^2 | x^2 | y^2 ]
pshufd xmm2,xmm0,00000010b          ; xmm2 [ x^2 | x^2 | x^2 | z^2 ]
addss xmm0,xmm1 ; xmm0 [                 | x^2 + y^2 ]
addss xmm0,xmm2                  ; xmm0 [                 | x^2 + y^2 + z^2 ]
rsqrtss xmm1,xmm0                ; xmm1 [ 1 | 1 | 1 | 1/|v| ]
pshufd xmm1,xmm1,00000000b      ; xmm1 [ 1/|v| | 1/|v| | 1/|v| | 1/|v| ]
mulps xmm3,xmm1                  ; xmm3 [ w*1/|v| | z*1/|v| | y*1/|v| | x*1/|v| ]
movaps [edi],xmm3
pop edi
pop esi
ret
Vector3D_Normalize ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector (AOS) using FPU.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
fld dword ptr (Vector3D PTR [esi]).x ;st0=x
fmul st,st(0) ;st0=x^2
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=x^2
fmul st,st(0) ;st0=y^2 | st1=x^2
faddp st(1),st ;st0=x^2 + y^2
fld dword ptr (Vector3D PTR [esi]).z ;st0=z | st1 = x^2+y^2
fmul st,st(0) ;st0=z^2 | st1 = x^2+y^2
faddp st(1),st ;st0=z^2+y^2+x^2
fsqrt                        ;st0=len
fld1 ;st0=1.0 | st1=len
fdivr ;st0=1.0/len
fld dword ptr (Vector3D PTR [esi]).x ;st0=x | st1=1/len
fmul st,st(1) ;st0=x*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).x ;st0=1/len
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=1/len
fmul st,st(1) ;st0=y*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).y ;st0=1/len
fmul dword ptr (Vector3D PTR [esi]).z ;st0=z*1/len
fstp dword ptr (Vector3D PTR [edi]).z ;--
pop edi
pop esi
ret
Vector3D_Normalize_FPU ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 1 (does first half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread1 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT/2
align 16
NormVectors1:
invoke Vector3D_Normalize,edi,edi ; or use FPU version.. same result
add edi,16
dec ecx
BTK
jnz short NormVectors1
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread1 ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 2 (does second half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread2 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
add edi,(VECTOR_COUNT/2)*16
mov ecx,VECTOR_COUNT/2
align 16
NormVectors2:
invoke Vector3D_Normalize,edi,edi ; or use FPU version same result
add edi,16
dec ecx
BTK
jnz short NormVectors2
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread2 ENDP

end start



So there is an example.. normalizing a list of 20 million vectors(X,Y,Z,W) on 1 core and then use 2 cores via threads... this case the threads using 2 cores are actually slower... its a bit of a simplified example and the list of vectors
is just divided into 2 halves.. but if something like this has no benefit from 2+ cores.. then it follows logically that 90% of all the code you're going to be using/writing especially in the time critical parts won't either.
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 21, 2008, 11:18:28 AM
Quote
Parallel Scalability Ray tracing is known for being "trivially
parallel" as long as a high enough bandwidth to the
scene data is provided. Given the exponential growth of
available hardware resources, ray tracing should be better
able to utilize it than rasterization, which has been difficult
to scale efficiently 8. However, the initial resources
required for a hardware ray tracing engine are higher than
those for a rasterization engine.

Page 2 of the PDF Mark posted. Paying close attention to the part "as long as high enough bandwidth to the scene data is provided". This implies the memory intensive nature I was referring to.
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 21, 2008, 11:22:22 AM
Quote
Contrary to general opinion a ray tracer is not bound by
CPU speed, but is usually bandwidth-limited by access to
main memory. Especially shooting rays incoherently, as done
in many global illumination algorithms, results in almost
random memory accesses and bad cache performance. On
current PC systems, bandwidth to main memory is typically
up to 8-10 times less than to primary caches. Even more importantly,
memory latency increases by similar factors as we
go down the memory hierarchy.

From Page 7 of that same PDF...
I think my case is clear.
Title: Re: why is multicore such a load of junk?
Post by: Mirno on November 21, 2008, 01:02:09 PM
The top 3 places on the top 500 supercomputers disagrees with you.
Roadrunner (dual core opterons), Jaguar (quad core opteron), and Pleiades (quad core xeon) all seem to run pretty well and use multicore processors.

Parallel algorithms may well be less efficient than monolithic algorithms, but if you want ridiculous speeds, then the history of supercomputing seems to show it's the way to go.

You've got a choice, develop something faster than silicon (and cheaper than Galluim arsenide), or go parallel. Economics points us to multi core. Them's the breaks.
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 21, 2008, 01:48:27 PM
Quote from: Mirno on November 21, 2008, 01:02:09 PM
The top 3 places on the top 500 supercomputers disagrees with you.
Roadrunner (dual core opterons), Jaguar (quad core opteron), and Pleiades (quad core xeon) all seem to run pretty well and use multicore processors.

Parallel algorithms may well be less efficient than monolithic algorithms, but if you want ridiculous speeds, then the history of supercomputing seems to show it's the way to go.

You've got a choice, develop something faster than silicon (and cheaper than Galluim arsenide), or go parallel. Economics points us to multi core. Them's the breaks.


100% .. but there is a HUGE difference between a shared-memory or distributed grid model and a single pc with multiple cores using limited cache with shared memory bandwidth.. that is my point.. we're talking consumer level general purpose desktop/laptop here (the sort of thing you're going to run games or business apps on).. and this is where multi-core is not effective. Not even in mid range servers for enterprise environments (as long as they're based on the same general PC architecture albeit with swappable power and raid).

If we had a PC with memory divided into say 4 blocks of 4gig each, or 1 block per core and each block can be any size (64bit computing+) and some new instructions to manage such so that you could do something like:
special_mov [block0_edi],[block1_edi]   .. to move data from block to block via a seperate shared memory controller ... and then each core had full exclusive access to it's block with a full bandwidth pipe ..  then and only then would parallel / multi-core become truly effective.

Yes synchronization might still be necessary, data will still need to be moved from block to block.. but in critical sections where it counts each core can sub-divide / scatter-gather internally its processing.

The key is to have MEMORY_BANDWIDTH*N CORES... every core must have it's own full pipe to memory.
Title: Re: why is multicore such a load of junk?
Post by: Mirno on November 21, 2008, 02:06:36 PM
All 3 of those machines are using "consumer grade" (albiet validated for use in server level environs - there being no difference between Opterons and Athlons, Xeons, and Core processors).
I agree that memory bandwidth does become a bottle neck, which is why Intel moved away from the FSB with the latest Nehalem processors, and AMD has used hypertransport to avoid (memory) bandwidth contention with inter-process communication.

Multi-socket AMD processors have supported NUMA for a long time because of this (and I guess Intel will do the same), increasing bandwidth as they go. Any processor is hobbled by a lack of memory bandwidth, and perhaps the current generation of multi-core processors are underfed, however it isn't an artifact of multi-core processor design, and future generations of multi-core processors may well be designed with wider memory interfaces because of this.

Essentially I'm saying that multi-core isn't a bad idea (as the thread title suggests), although current implementations may not be as effective as we would like. The top500 (where cost of high performance memory subsystems is no obstical) multi-core designs offer a good way of upping CPU power, without a big increase in space/electrical power.

I use some fairly high powered server grade hardware (quad socket, dual core, 16GB of RAM), and they perform as well as required (IO over the network to the databases seems to be our bottleneck). Of course if your problem was already IO bound, then more CPU power will never help you, but then again that's true of single core machines too.

Mirno
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 21, 2008, 05:32:22 PM
Quote from: johnsa on November 21, 2008, 09:13:56 AM



BNT MACRO
db 2eh
ENDM

BTK MACRO
db 3eh
ENDM

include \masm32\include\masm32rt.inc
include c:\dataengine\timers.asm

.686p
.mmx
.k3d
.xmm
option casemap:none

Vector3D_Normalize     PROTO ptrVR:DWORD, ptrV1:DWORD
Vector3D_Normalize_FPU PROTO ptrVR:DWORD, ptrV1:DWORD

test_thread1 PROTO :DWORD
test_thread2 PROTO :DWORD

Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

;###############################################################################################################
; DATA SECTION
;###############################################################################################################
.const

.data?

align 4
thread1  dd ?
thread2  dd ?
var      dd ?
var2 dd ?

.data

VECTOR_COUNT equ (20000000) ; Number of vectors in the list.

align 16
TestVector REAL4 2.4,4.3,1.9,1.0 ; A test vector to fill the structure with.
VectorListPtr dd 0 ; A pointer to a list of VECTOR_COUNT vectors (X,Y,Z,W) AOS format.

objcnt2  dd 0,0 ; Event Handles.

;###############################################################################################################
; CODE SECTION
;###############################################################################################################
.code

start:

; Allocate the vector list memory.
invoke GetProcessHeap
invoke HeapAlloc,eax,HEAP_ZERO_MEMORY,(16*VECTOR_COUNT)
mov VectorListPtr,eax

; Fill the VectorList with some data.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 1 core.
timer_begin 1, HIGH_PRIORITY_CLASS

mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
align 16
@@:
invoke Vector3D_Normalize,edi,edi
add edi,16
dec ecx
BTK
jnz short @B

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)

; Fill the VectorList with some data again.
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT
movaps xmm0,TestVector
@@:
movaps [edi],xmm0
add edi,16
dec ecx
jnz short @B

; Time how long it takes to normalize this list with 2 cores.
    mov esi,offset objcnt2
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi],eax
    invoke CreateEvent,0,FALSE,FALSE,0
    mov [esi+4],eax
    mov thread1,rv(CreateThread,NULL,NULL,ADDR test_thread1,[esi],CREATE_SUSPENDED,ADDR var)
    mov thread2,rv(CreateThread,NULL,NULL,ADDR test_thread2,[esi+4],CREATE_SUSPENDED,ADDR var2)

timer_begin 1, HIGH_PRIORITY_CLASS

invoke ResumeThread,thread1
invoke ResumeThread,thread2

    invoke WaitForMultipleObjects,2,OFFSET objcnt2,TRUE,INFINITE

timer_end
print ustr$(eax)
    print chr$(" ms",13,10)
   
mov esi,offset objcnt2
    invoke CloseHandle,[esi]
    invoke CloseHandle,[esi+4]
    invoke CloseHandle,thread1
    invoke CloseHandle,thread2
   
; Free Vector List Memory.
invoke GetProcessHeap
invoke HeapFree,eax,HEAP_NO_SERIALIZE,VectorListPtr
   
invoke ExitProcess,0

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector V1 into VR (AOS Format).
; VR = 1/||V1|| * V1.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
movaps xmm0,[esi] ; xmm0 [ w | z | y | x ]
movaps xmm3,xmm0 ; xmm3 [ w | z | y | x ]
mulps xmm0,xmm0 ; xmm0 [ w^2 | z^2 | y^2 | x^2 ]
pshufd xmm1,xmm0,00000001b ; xmm1 [ x^2 | x^2 | x^2 | y^2 ]
pshufd xmm2,xmm0,00000010b          ; xmm2 [ x^2 | x^2 | x^2 | z^2 ]
addss xmm0,xmm1 ; xmm0 [                 | x^2 + y^2 ]
addss xmm0,xmm2                  ; xmm0 [                 | x^2 + y^2 + z^2 ]
rsqrtss xmm1,xmm0                ; xmm1 [ 1 | 1 | 1 | 1/|v| ]
pshufd xmm1,xmm1,00000000b      ; xmm1 [ 1/|v| | 1/|v| | 1/|v| | 1/|v| ]
mulps xmm3,xmm1                  ; xmm3 [ w*1/|v| | z*1/|v| | y*1/|v| | x*1/|v| ]
movaps [edi],xmm3
pop edi
pop esi
ret
Vector3D_Normalize ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; Normalize Vector (AOS) using FPU.
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
Vector3D_Normalize_FPU PROC ptrVR:DWORD, ptrV1:DWORD
push esi
push edi
mov esi,ptrV1
mov edi,ptrVR
fld dword ptr (Vector3D PTR [esi]).x ;st0=x
fmul st,st(0) ;st0=x^2
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=x^2
fmul st,st(0) ;st0=y^2 | st1=x^2
faddp st(1),st ;st0=x^2 + y^2
fld dword ptr (Vector3D PTR [esi]).z ;st0=z | st1 = x^2+y^2
fmul st,st(0) ;st0=z^2 | st1 = x^2+y^2
faddp st(1),st ;st0=z^2+y^2+x^2
fsqrt                        ;st0=len
fld1 ;st0=1.0 | st1=len
fdivr ;st0=1.0/len
fld dword ptr (Vector3D PTR [esi]).x ;st0=x | st1=1/len
fmul st,st(1) ;st0=x*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).x ;st0=1/len
fld dword ptr (Vector3D PTR [esi]).y ;st0=y | st1=1/len
fmul st,st(1) ;st0=y*1/len | st1=1/len
fstp dword ptr (Vector3D PTR [edi]).y ;st0=1/len
fmul dword ptr (Vector3D PTR [esi]).z ;st0=z*1/len
fstp dword ptr (Vector3D PTR [edi]).z ;--
pop edi
pop esi
ret
Vector3D_Normalize_FPU ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 1 (does first half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread1 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
mov ecx,VECTOR_COUNT/2
align 16
NormVectors1:
invoke Vector3D_Normalize,edi,edi ; or use FPU version.. same result
add edi,16
dec ecx
BTK
jnz short NormVectors1
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread1 ENDP

;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
; THREAD 2 (does second half of the list)
;=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
align 16
test_thread2 PROC nThread:DWORD
push edi
push ecx
mov edi,VectorListPtr
add edi,(VECTOR_COUNT/2)*16
mov ecx,VECTOR_COUNT/2
align 16
NormVectors2:
invoke Vector3D_Normalize,edi,edi ; or use FPU version same result
add edi,16
dec ecx
BTK
jnz short NormVectors2
invoke SetEvent,nThread
pop ecx
pop edi
ret
test_thread2 ENDP

end start



So there is an example.. normalizing a list of 20 million vectors(X,Y,Z,W) on 1 core and then use 2 cores via threads... this case the threads using 2 cores are actually slower... its a bit of a simplified example and the list of vectors
is just divided into 2 halves.. but if something like this has no benefit from 2+ cores.. then it follows logically that 90% of all the code you're going to be using/writing especially in the time critical parts won't either.

does it hit 100% on one processor with one thread?  or even close?  that is what raytracers do.  Simply doing a Normalization isn't a good enough example.  and yes ray-tracers are a SPECIAL case, not all code parallelizes as easily.
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 23, 2008, 11:38:32 AM

With 1 thread the first core is maxed out (50% of total available cpu). With both threads 2 cores are fully maxed out (100% of cpu used) and the final timings are identical (in fact 2 threads is slightly slower).

I've tried this same approach to testing various other algorithms and sofar, as I mentioned previously not 1 has even gained more than 20% by having 2 cores thrown at it.

So if normalizing vectors is a bad example (I thought it's a pretty real world example) then how about doing the same test but calculating fresnel factors or something? I can tell you now.. it will end it tears.. with no performance gain from multi-core :)
Title: Re: why is multicore such a load of junk?
Post by: Mark_Larson on November 23, 2008, 05:04:53 PM
Quote from: johnsa on November 23, 2008, 11:38:32 AM

With 1 thread the first core is maxed out (50% of total available cpu). With both threads 2 cores are fully maxed out (100% of cpu used) and the final timings are identical (in fact 2 threads is slightly slower).

I've tried this same approach to testing various other algorithms and sofar, as I mentioned previously not 1 has even gained more than 20% by having 2 cores thrown at it.

So if normalizing vectors is a bad example (I thought it's a pretty real world example) then how about doing the same test but calculating fresnel factors or something? I can tell you now.. it will end it tears.. with no performance gain from multi-core :)

there is actually an open source raytracer that is done as a tutorial on devmaster.net

I'll send you the link.  they actually have 7 tutorials with source code, but I could only get the first 3 to compile.


that way we have a real raytracing example.  And then we can see what happens.  Yea Quake 3 had the same problem.  They only got a 20-30% speed up from going multi-core.  So in general your statement is accurate.  But there are special cases.


this is the first tutorial.  That tutorial links to the rest.  The pdf I posted I got from this tutorial.
http://www.devmaster.net/articles/raytracing_series/part1.php

Title: Re: why is multicore such a load of junk?
Post by: vanjast on November 24, 2008, 11:36:24 PM
Has anybody mentioned yet that the with dual/quad core that you can get the same amount of work done for less power wastage.
The CPU runs cooler..
That's if the OS is optimised for this, that is
:8)
Title: Re: why is multicore such a load of junk?
Post by: BogdanOntanu on November 25, 2008, 07:57:02 AM
Quote from: vanjast on November 24, 2008, 11:36:24 PM
Has anybody mentioned yet that the with dual/quad core that you can get the same amount of work done for less power wastage.
The CPU runs cooler..
That's if the OS is optimised for this, that is
:8)

No, nobody did mention that because it is NOT true.

Think a little.... if you have a dual core CPU with a fixed number of transistors and it consumes a certain amount of power say 40W.

It is then simple logic that IF you make the chip in half ie 1/2 THEN it will consume 40W/2=20W and it will dissipate less power and it will run much more cooler than a dual core. The same logic works for 4xcore or 8xcore. Yes you can stop parts of the CPU BUT by design in a multi core CPU there will be some stuff shared and you can not stop that parts because they are needed for the one part that is still running. And if you stop them then you loose the "multi core advantage".

Hence what you say is illogical.

However it is true that faster but "older" P4 CPU's are usually made in a technology less advanced than the technology used to the "new" dual/quad cores and because of this they do consume much more power. But this is done on purpose in order to promote new multicore CPU's and to diminish the importance of faster single cores because apparently they do consume more and are harder to produce. Simple market manipulation.

Yes, in practice the faster you switch (depending on technology) the more power "might" be lost in the transition phases. However adding 2x or 4x number of switching elements and packing them on the same chip is not going to help either. Having shared parts that can never be switched off but have to be designed to accommodate 2x or 4x load does not help either.

One way to reduce internet power consumption would be to use a much smaller picture in your signature and eventually a less offending one. More space for text information... :P
Title: Re: why is multicore such a load of junk?
Post by: Mirno on November 25, 2008, 12:16:32 PM
The last set of comparable single and dual core processors out I could find was the Yonnah core (not core2) duo vs solo.
The 65nm Core solo is quoted with a TDP of 27 watts, while it's dual core equivalent is only 4 watts more.

So it appears that while it is "illogical" it is the case that there are power savings. Further to this, both Intel and AMD have learnt much from their initial forays into the dual core world, and now have separate power planes for each core (as well as memory controller, and IOs), so that power saving should be reflected more evenly if the load on the two cores is unbalanced.


Mirno
Title: Re: why is multicore such a load of junk?
Post by: BogdanOntanu on November 25, 2008, 12:52:28 PM
Quote
So it appears that while it is "illogical" it is the case that there are power savings.

This is like this because "they" want it to be so ;)

The manufactures can and in fact will promote what they want to produce in the future and will "leave behind" in more or less "subtle" ways the products that they do not want to sell anymore.

Anyway the "writing is on the wall": no matter if good or bad the near future belongs to multi core CPU designs because the producers say so and the consumers have no way to influence this. And it might well be the "whole" future not only the "near" future.

"Junk" or "not junk" it does not really matter. Just accept it as it is and make the best out of it because you can not produce your own CPU :D

Title: Re: why is multicore such a load of junk?
Post by: Mirno on November 25, 2008, 01:05:18 PM
If by "they", you mean the laws of physics, then yes.
Die shrinkage isn't the solution it once was - electrical leakage becomes a more and more dominant figure with each sucessive manufacturing process, and we can't just ramp the clocks up anymore.
Materials sciences have been pushed pretty far, but we simply cannot disipate the heat that the silicon would produce at the higher levels. Both Intel and AMD's latest architectures both hit 6+ GHz, but they need liquid nitrogen cooling to do it. There isn't any more speed available with air cooling.

The fact that even ARM are looking to dual core processor design says a lot about the direction of the industry, and I'm inclined to believe they (and the VHDL engineers I worked with, and all the other industry analysts) have a good reason for supporting multi-core.

Mirno
Title: Re: why is multicore such a load of junk?
Post by: vanjast on November 25, 2008, 02:09:56 PM
Quote from: BogdanOntanu on November 25, 2008, 07:57:02 AM
Yes, in practice the faster you switch (depending on technology) the more power "might" be lost in the transition phases.
A freudian slip here maybe...  :wink I realise the numbers, but haven't looked properly into this. I vaguely remember seeing it mentioned someplace.
Maybe it was the power dissipation efficiency... lower clock speeds + slightly bigger die area = better cooling (or running cooler)

Quote from: BogdanOntanu on November 25, 2008, 07:57:02 AM
One way to reduce internet power consumption would be to use a much smaller picture in your signature and eventually a less offending one. More space for text information... :P
...the exact reason why I didn't post a pic of myself...  :bg :green2
Title: Re: why is multicore such a load of junk?
Post by: Rockoon on November 27, 2008, 04:30:48 AM
It is not johnsa's benchmark that I disagree with, but rather his conclusion:

Quote... but if something like this has no benefit from 2+ cores.. then it follows logically that 90% of all the code you're going to be using/writing especially in the time critical parts won't either.

It follows logically that the specific algorthm in question is bottlenecked on something...

...and until we identify that that something is we cannot attribute that something to 90% of all algorithms, or even 1%.

Both Intel and AMD provide tools for identifing such things (VTune and CodeAnalyst.)


Someone had asked if the single-process version used 100% (50%) of a core..

..he was barking up a wrong tree because no time at all is ever spent in the task schedulers idle loop on the core running it (the thread never sleeps or waits for an event), so by definition its 100% usage until its done.


Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 27, 2008, 08:01:35 AM
Ok.. here is how I see it..

This debate has gone on for quite some time and in numerous threads. I am not saying that what I've said is gospel, it is merely my opinion (albeit based on a fair amount of experimentation).
I would like to approach this in a scientific manner, and am open to debate. I'll be the first to admit that I am wrong (and gladly so as it would imply a brighter future for development).
So with that said the only thing I can suggest is that we come up with an agreed upon test-bench to prove or disprove my argument. I have presented a small sample above, which you may or may not agree to be a good test.
I believe it to be a real-world example, sofar no one who has disagreed with me has presented anything that we can all re-compile/assemble and test with multi-core options to prove that there is performance to be gained from doing so.

I'm up for the challenge, anyone else want to jump in and help put something together that we can switch between 1/2/4 cores and see the benefit (something that everyone can agree to be a "real-world" example)?

The rt ray-tracer would be good, but is quite a large under-taking to prove the case, perhaps something on smaller scale but more inclusive than my simple vector sample.
Title: Re: why is multicore such a load of junk?
Post by: Rockoon on November 27, 2008, 07:33:14 PM
If you want to approach it scientifically, then fire up VTune or CodeAnalyst and gather some data. Scientists measure.
Title: Re: why is multicore such a load of junk?
Post by: johnsa on November 28, 2008, 07:54:14 AM
I have been measuring.. using VTune and custom timers...

At the end of the day if an algorithm completes a finite amount of work in a set time, if that time decreases it will be perceived as increased performance. Thats a fact.. and that can be measured in MANY ways,  timers, visually, using a watch, using profiling tools (which will obviously give more detail as to where and how the time is used).

Nobody has been able to prove me wrong, apart from little jests and sarcastic comments.

In any event thanks for you very un-helpful post. :)
Title: Re: why is multicore such a load of junk?
Post by: sinsi on November 28, 2008, 08:07:22 AM
Have a look at stuff like hyper-v, each server running by itself - that's why new cpu's have virtualization. Instead of 4 servers, each with its own computer,
we have one quad-core running all 4. Single point of failure sure but hardware seems to last longer nowadays.
Title: Re: why is multicore such a load of junk?
Post by: Rockoon on November 28, 2008, 09:20:01 AM
Quote from: johnsa on November 28, 2008, 07:54:14 AM
I have been measuring.. using VTune and custom timers...

Timers cannot answer the question posed. VTune should be able to tell you how often cache misses are happening, how often branches are mispredicted, and so forth.

Where is the extra 100% of the time spent?


Title: Re: why is multicore such a load of junk?
Post by: nuckfuts on November 30, 2008, 01:35:43 PM
Quote from: sinsi on November 28, 2008, 08:07:22 AM
Have a look at stuff like hyper-v, each server running by itself - that's why new cpu's have virtualization. Instead of 4 servers, each with its own computer,
we have one quad-core running all 4. Single point of failure sure but hardware seems to last longer nowadays.

A single point of failure yeah, but if you *need* all four servers, 1 point is better than 4 points.  Not to mention it's virtualized, so putting it on other hardware could happen pretty darn fast, if not automatically too.