News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

codewarp

Quote from: hutch-- on June 07, 2008, 02:00:19 PM
Thanks for the link, its a good article. i come down on the side of knowing what you are doing, not the magic library approach.

The "magic library approach"?  We call that "encapsilation of complexity" where I come from, you ought to try it sometime.  i come down on the side of knowing what you are doing, not the magic hardware approach.


Quote from: hutch-- on June 07, 2008, 02:00:19 PM
I wonder when we will see the first terahertz processor ?  :bg

And the day after they do that, someone will gang four or 16 of them together, then run them using same multi-processor software technology you are trying to tell everybody here doesn't work.  Come on, Hutch!  You know that mhz is a different dimension from multiple CPUs, terahz is patently irrelevant to this discussion.  That you would actually scoff (in several of your posts) at getting 1.9 effective CPU performance out of software, on a Web site dedicated to cutting out clock cycles, is truly astounding to me, and lowers the credibility of your entire Masm Form, Mr. Moderator.

The brave new world of ubiquidous multi-processors is upon us, and ripe for the taking.  I am actually doing it, and I wanted to share my experiences here.  But you and your bulldogs here just can't have anyone contradicting this "can't get there from here", mentality.  Your comments are discouraging others who are trying to learn and explore this material, and insulting to those of us actually doing it with great success.  I am sorry for you, this it a topic to be showcased, not banned and scoffed at.  You guys have severely missed the boat on this one.

askm

Since I am here logged in

asking

What if Intel or AMD, at the time of the 16-bit or even 8-bit cpu heydey,

multi-cored them then ?

Where would we be in the multicore software performance understanding today ?


c0d1f1ed

sinsi,

That's an interesting article, thanks for the link. It's not very shocking though. It's blatantly obviously programmers would want their free lunch to continue. But we don't have a choice. This isn't chip manufacturers being lazy, they're fighting physical laws they best they can - it's the software developers who have to stop being lazy!

Veterans like Knuth obviously try to be conservative. His glory days are fading fast while programming is about to change in revolutionary ways he won't be part of. It's also ironic that the article mentions he wrote "The Art of Computer Programming" and he won the Turing price, while the authors of "The Art of Multiprocessor Programming" won the Dijkstra and Gödel prize. There are many 'famous names' sharing their opinion, but I have yet to find one who sais multi-core is a bad idea and who has an actual realistic hardware solution.

One of the problems here is that most progammers, even the seasoned, have become clueless about hardware behavior. Those who have a better understanding of it, realize that multi-core is the way forward and hardware manufacturers are doing an amazing job. I had a course based on Digital Integrated Circuits and one about Advanced Computer Architecture (with a focus on CPU design), and it really broadened my perspective.

Lastly, for every pessimistic article about the challenges of multi-core programming, I can give you ten articles that regard it as a major opportunity...

codewarp

Quote from: sinsi on June 07, 2008, 09:50:07 AM

Seems to me that one side here is fixed on hardware and the other on software. I would like an OS that will run on
two of my cores and leave the other two for the one or two programs that I actually use 'simultaneously/multitaskingly'

Famous names: Multi-threaded development joins Gates as yesterday's man

edit: interesting that 8 out of 20 new topics in the lab are about multi core/cpu...

Software is what this web site purports to be about.  Software is what we can actually do something about.  All this hairy, fairy, triple terahz fantasy nonsense about wishful hardware is hardly fit for a blog on FOXNews--the hardware at any point is what it is--who cares.  Software is what we all come here for, as far as I can tell.  I like the hardware the way it is--I can get the extreme performance, and the serious development work, while much of my competition is sitting in the corner, wishing for magical hardware to come save them--it won't.  It is truly hilarious--this is a great time to be a programmer.

The two for you, two for the OS method of CPU allocation would not be very effective.  Just pull up a Window task monitor, like taskinfo2000, and get familar with how many things are running simulteneously in your Windows system--30 processes, 320 threads.  Reservation of shared resources like that would effectively turn a Quad back into a Dual.  

You can use two cores now, all you want, just keep two threads busy doing what you need them to do, and you got 'em all to yourself, the system even shrinks down to 2 cores to let you do that.  What is the problem you are trying to solve with this proposal?  Today's system is like a genie in a bottle, your wish is its command.  What exactly is it that is missing for you, to want fixed CPU assignments, with such dire consequences?

codewarp

Quote from: askm on June 07, 2008, 07:38:04 PM
Since I am here logged in

asking

What if Intel or AMD, at the time of the 16-bit or even 8-bit cpu heydey,

multi-cored them then ?

Where would we be in the multicore software performance understanding today ?
And where would we be today, if the Egyptians had invented multi-layer integrated circuits 5000 years ago?  Things happen when they are good and ready to happen, and not until then.

GregL

This has sparked my interest in what is going on here. I never paid a whole lot of attention to it before.

There is some relevant information here Measuring Multiprocessor System Activity (best viewed in IE), it's a little dated but I haven't found anything newer.

Under 'Thread Partitioning':
QuoteWindows 2000 uses soft processor affinity, determining automatically which processor should service threads of a process. The soft affinity for a thread is the last processor on which the thread was run or the ideal processor of the thread. The Windows 2000 soft affinity thread scheduling algorithm enhances performance by improving the locality of reference. However, if the ideal or previous processor is busy, soft affinity allows the thread to run on other processors, allowing all processors to be used to capacity.

hutch--

There are a number of things in this thread that have made me laugh, you cannot learn from history, WE already know it all, you cannot learn from existing massive parallel hardware like SGI and similar, we already know it all. You cannot learn from direct testing with graphs provided, it does not fit our theory and we know it all etc ....

With clock speeds, would the designers of the 4 megahertz 8088 have seen x86 architecture running at just under 4 gigahertz ? I seriously doubt it, would they have seen multiple pipeline hadware from early pentiums upwards, not realy as the first pipelined x86 chip was a 486.

RE the free lunch and getting left behind etc .... I did not buy an Itanium and got left behind and I did not buy a RISC box and got left behind and I never wrote software for the early multiprocessor MAC and got left behind and with interim hardware of the type that is available in x86 I hope I get left behind as I have almost no respect for hybrid inerim fudges that end up in the dustbin of history.

Even though the starting price is a bit high for a desktop, SGI and other companies have LONG AGO build hardware controlled multiprocessor computers with throughputs tha are out of the PC world. Forget magic libraries running in ring3, it is far too slow as any of the driver guys will tell you, to do anything even vaguely fast you need exclusive ring0 core control and that occurs only at an OS level of operation. Just above the system idle loop are your thread synchronisation methods, spinlocks, thread yield and so on, trying to do this in ring3 co-operative multiasking as like trying to win the Indy car race with a T model ford, its ancient redundant junk from the early win3.0 days.

Codewarp,

Quote
And the day after they do that, someone will gang four or 16 of them together, then run them using same multi-processor software technology you are trying to tell everybody here doesn't work.  Come on, Hutch!  You know that mhz is a different dimension from multiple CPUs, terahz is patently irrelevant to this discussion.  That you would actually scoff (in several of your posts) at getting 1.9 effective CPU performance out of software, on a Web site dedicated to cutting out clock cycles, is truly astounding to me, and lowers the credibility of your entire Masm Form, Mr. Moderator.

Tread carefully here, while everyone gives me cheek which I in turn sit up at night wiping away the tear stains while wringing my hands in despair, try it out on any of the other members and this thread is dead. We have already had one that had to be closed due to the smartarse wisecracks offending members, keep it objective or see it disappear.

While John posted code and ideas, all I have heard is dogma about high level code, abstraction, magic libraries and old hat technology. This is finally the "Laboratory" for code, not anecdotal waffle and dogma.

Here is a test piece for you to deliver code for. show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core noting that the code is memory bound and core intensive.


    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

xmetal

Quote from: hutch-- on June 08, 2008, 02:22:17 AM
Here is a test piece for you to deliver code for. show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core noting that the code is memory bound and core intensive.


    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi


Runs about 4000000000 times faster and does not need more than a single core...


    LOCAL var   :DWORD

    mov var, 12345678

    mov eax, var
    mov ecx, var
    mov edx, var

hutch--

xmetal,

1 pass does not test anything. Look at the words "run the identical code on 2 or 4 cores" to see what the test is about.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

hutch,

Ring0 thread control is not necessary for fast multi-core processing. Instead of scheduling threads at the O.S. level, you can schedule tasks at the application level.

Think of ray-tracing. A straightforward approach would be to have one thread per pixel, or even per ray. That would require the O.S. to constantly schedule different threads, suspend some, and resume some. It will perform horribly. Indeed only ring0 access might make that a little faster. But you can get far better results by only running as many threads as there are cores, and having each of them continuously request a pixel from a shared queue and process it. And this approach works fine from ring3, the O.S. isn't even actively involved in it. 100% multi-core utilization and minimal overhead right in your lap. What more could you ask for?

Of course, for applications where there isn't a clear pile of independent tasks to execute things get a little more complicated. But that's exactly what this thread is about. It's not going to get solved by hardware, as there is no way to identify independent work more than a couple instructions away. It has to happen at the software level, using frameworks, tools and languages that help extract the parallel tasks.

By the way, your code is trivial to run 4 times faster on a quad-core. It's not memory limited or anything. The stack variable will be in each of the core's L1 caches.

hutch--

It appears you don't understand the difference in speed from memory writes to register writes, the memory read speed is the limiting factor in the simple test piece, the loop is short enough to saturate the cores capacity to run the instructions in the loop thus it is both memory bound and core intensive.

Current multi-thread as per your example is uncontentious as parallel asynchronous operations but you are synchronising them with the shared queue for pixel data. Saturate the capacity of all 4 cores and you will not see any speed gain. You may get some speed gain if your threads are nowhere near instruction saturation.

The discussion has not been about asynchronous parallel operation, its been about the "free lunch" notion that you have taken out of the context of slow bloated high level languages running out of puff as the processor speeds don't outpace the software slowdown of lousy code. Until you can display improved performance in highly processor intensive code, your just dabbling with iternet download parallelism technology.

There are enough people here who own late model Core 2 Duos to see the results of single thead testing and they are faster that a matching clocik speed PIV and while bus speed and memory read/write speed will have a little to do with this, the load sharing that occurs across both cores is the real difference.

This type result is showing up in objective testing where all I have yet to see that software multitasking you have mentioned actually perform. This is why I suggested a simple test piece run across multiple threads on multiple core to show us how its done. I doubt you can deliver.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zooba

Okay, I haven't been following this thread the entire time but I dropped in to have a read and decided I should speak up.

The screenshots posted by Greg show only that Windows uses soft affinities (as mentioned earlier). The thread keeps switching between cores - it's an OS function and can be 'disabled' by using SetThreadAffinityMask() or SetProcessAffinityMask(). My own benchmarking counterintuitively showed a slight decrease in speed when memory bound processing (array arithmetic using SSE) was restricted to one core. It also showed (and others have found this as well, though I'm coming up blank for sources right now, so feel free to take this with a grain of salt) that the best number of threads is 1.5x the number of cores (without forcing the affinity).

Personally I am quite happy with my code not automatically being shared between processors. Having a second core to keep running while the other is hung adds a huge amount of stability. Previously a completely hung process would require a restart - now I can actually terminate it. Also, I am quite confident in my ability (after a lot of failures, mind you) to create multithreaded code where it is useful to take advantage of parallel processing and to get it right.

The apparent better performance of single-threaded code on many-core CPUs is most likely due to the operating system being able to use a separate core. A genuinely processor intensive process can actually get 100% of a core without the OS interrupting it (or locking up the system).

Cheers,

Zooba :U

johnsa

c0d1f1ed, I agree with most of what you are saying in terms of that, multi-core/processor is the future, there is no escaping this fact. It is highly unlikely that anything significant is going to happen architecturally or OS wise. So the only option we as programmers are left with is to work out how to make best use of the tools we are given.

That being said, perhaps we should take Hutch's example code and possibly a few other similar ones which try to create load in different areas, processing, memory etc and try to multi-thread multi-core enable them and see what comes of it. Perhaps we can find some hybrid solutions to common programming tasks (after all there should be some good brains in this forum). Perhaps even start working towards putting together an additional library for MASMv11/v12 that provides some multi-core "helper" routines.

Perhaps something similar to Intel's TBB for C++..

c0d1f1ed

#88
hutch,

Memory read speed of a shared variable that is not being written to is no different from that of a single thread. Each core simply has a copy of the data in their cache and because it's in the Shared state of the MESI protocol you don't get any performance penalty. So your code runs four times faster on a quad-core.

Regarding ray-tracing, getting a new pixel is as simple as atomically incrementing a counter. That only takes a dozen cycles or so and can be done in ring3. It's not just lock-free it's wait-free synchronization, giving the best possible scaling behavior. You might want to read chapter 1.1 from The Art of Multiprocessor Programming (just click on SEARCH INSIDE! and go to the Excerpt) for an introduction too.

Proof of multi-core speedups for "highly processor intensive code" is everywhere. Just read AnandTech's Nehalem preview again.

The one and only reason Core 2 Duo is faster for a single thread than a Pentium 4 at the same clock speed is its higher IPC per core. Core 2 can retire 4 instructions per clock, Pentium 4 only 3. Add to this shorter pipelines with lower latencies and you have a clear winner. Reverse Hyper-Threading is a myth, it's physically impossible.

As for "delivering", check my profile.

hutch--

The reason why I suggested objective testing is that I have only heard opinion on what results will be, not what they are. Any processor intensive piece of code will do as long as it both processes instructions and performs read/write of memory that is in cache as using an extended memory range introduced the extra variable of page table load times if the addresses are far enough apart.

Now the way to prove the theory you have been putting forward is to show an identical piece of code running on one core complete with timing them run it on dual coes to show the speed improvement.

Now in relation to your description of your shared meory queue, the problem is that even if it works on processor and memory intensive code is that it stops waiting threads and thus the core it is running on stone dead and while this may be a typical gamer brain, its hardly useful on a modern computer. This is exactly why this type of work needs to be done at the OS level just above the idle loop where the wait time can be passed off to another runing process or even for that matter another thread on the same core controlled by the running application.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php