Multicore theory proposal

hutch-- · June 10, 2008, 01:35:03 AM

These results make sense, the single thread has no extra thread overhead as it does not need it. The 2 thread test does twice the work and the 4 thread test does 4 times the work so allow for the thead overhead of the latter two tests you are getting close to a two times speedup. It should also show a 4 times speedup on a double dual core.

sinsi · June 10, 2008, 01:37:44 AM

Q6600 2.4 GHz

Code Select


===========================================
Run a single thread on fixed test procedure
===========================================
5016 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
5000 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
5000 MS Four thread timing

c0d1f1ed · June 10, 2008, 02:18:31 PM

Quote from: MichaelW on June 09, 2008, 11:17:35 PM
How can this be? Without the division each thread will do 4 billion iterations, so if each thread is running on a different core they should all complete in approximately the same time, independent of the number of threads.

It's without the division inside the loop, I obviously still do it outside the loop, to divide up the work among the threads to get the correct results.

QuoteAnd why exactly is it that you did not post an EXE and/or made it difficult for us to create our own EXE from your source? And while I'm asking questions, why not a source in the preferred language of this forum?

I didn't post an executable because then I could have been accused of cheating or whatever. Plus they asked for source, which is exactly what I gave them. And I wrote it in C++ because that was faster to write and it's trivial to understand and modify. Last but not least I've proven the point which is the only thing relevant right now.

hutch-- · June 10, 2008, 03:33:13 PM

hmmmm,

> Last but not least I've proven the point which is the only thing relevant right now.

Which one, the bloat, abstraction, magic libraries, "it can't be done in assembler", there have been many points made but few successfully. Your example worked after it was fixed then rewritten which was a change from the gabfest and waffle but lets face it, its 1995 windows thread technology with a tail end system method wait based synchronisation. So much for ring3 co-operative multitasking, its normal Windows ring0 thread synchronisation.

I think your code was worthwhile and it has been a useful contribution but it has extremely limited application to the vast majority of code written on a day to day basis as unsynchronised concurrent theads are very poorly suited to general purpose programming.

NightWare · June 10, 2008, 09:45:20 PM

since the single thread isn't shared on both core here (otherwise the speed would be divised by 2), maybe filling a small area should show when the thread start to be shared (16/32kb, 2x or 4x the l1 cache should be fine), then we will see if the cache is the factor...

johnsa · June 10, 2008, 10:26:01 PM

I agree, we've gotten a base here that proves that it is possible to setup multiple threads assigned to cores (perhaps we should set the thread affinity mapping in the example) and that each thread is capable of running code in parallel which is accessing a core-local (l1 cached) variable in each thread.

So now to up the ante we should create a single large data structure in memory (too large for the l1 caches) and have each thread perform a function on that structure.. making sure that the calculation is constant-time and not dependant on the position within the data etc.
Each thread could take 1/2 or 1/4 of the total data (chunks or interleaved?).

Then we compare that against the single thread, to see how much having the shared memory/caching implications affect the result.

johnsa · June 10, 2008, 10:28:56 PM

refer to me SSE Weirdness thread... we could make the data-structure a huge list of vectors and run through and normalize them... then at the same time we can test on different processors what might be causing the wierd
movaps/movdqa behaviour i mentioned in that thread. Perhaps it's just a byproduct of my crufty Pentium M :)

c0d1f1ed · June 11, 2008, 07:00:54 AM

Quote from: hutch-- on June 10, 2008, 03:33:13 PM
Which one, the bloat, abstraction, magic libraries, "it can't be done in assembler", there have been many points made but few successfully. Your example worked after it was fixed then rewritten which was a change from the gabfest and waffle but lets face it, its 1995 windows thread technology with a tail end system method wait based synchronisation.

It worked flawlessly the way it was. The only one having trouble building it was you, and ironically it was me who pointed out how to get it to build your way. The division within the loop didn't change a damn thing about the validity of the test.

QuoteSo much for ring3 co-operative multitasking, its normal Windows ring0 thread synchronisation.

Duh! I wasn't going to write an example that was ten times longer if I could prove the point of multi-core throughput with a trivial example.

Now it's your turn to prove things. Write a reusable framework that can perform different tasks of varying execution time and scales well with the number of cores, without using lock-free synchronization and completely in assembly. As long as you can't show me that I'll have every reason to assume it can't be done...

Oh and what about your Reverse Hyper-Threading claims? Given up on that already as well? And here's some more facts for you to consider: InfiniBand, used to interconnect Opteron-based supercomputer nodes, has an end-to-end latency of 1 millisecond and up. So much for magical hardware offering fast synchronization.

hutch-- · June 11, 2008, 08:14:00 AM

c0d1f1ed,

Your trying to pull my leg now, the code you posted ran on my box at 44 seconds where the test on a single core ran at 4.5 seconds. You had an error in your "for" loop that made it run ten times slower. Your published numbers were meaningless and this is exactly the problem of gabfest versus writing some code.

Then you refusal to supply the build specs simply did not make sense, it appeared that you did not want anyone to build and test the code.

Now come back to the dogma you were trying to inflict over the high level advantage, magic helper libraries, "ths is too difficult to do in assembler" and other related waffle and see why it has not been taken seriously, the example when rewritten in masm was less than a tenth of the size and it did what it claimed to be able to do and the funny part is I don't even have a multi-core processor to try it out on.

All you managed to prove apart from delivering a working example is you don't need high level code, no magic libraries and that the task was trivial in masm.

Maybe you should stick to the code and spare us all the dogma, at least it was not much work to rewrite it so it worked properly in masm.With quads common, 6 core version in the pipeline and much more powerful stuff in the near future, I will happily use my oldest PIV until the wheels fall off it as the longer it lasts, the faster and cheaper the multicore stuff will get and its not lke 1995 win95 technology in thread manipulation is any big deal to write.

RE: AMD quad latency problems, they are still about a year off delivering competitive performance although they do have some legs in the FP area.

Reverse Hyper-Threading is not one of my expressions, the closest I have come to "Reverse Hyper-Threading" is turning Hyper-threading OFF in the BIOS of my 3 PIVs as it interfered with algorithm timing and nothing went faster with it. Multiple pipeline out of order instruction scheduling has been with us since the early PIV days.

c0d1f1ed · June 11, 2008, 10:19:20 AM

Quote from: hutch-- on June 11, 2008, 08:14:00 AM
Your trying to pull my leg now, the code you posted ran on my box at 44 seconds where the test on a single core ran at 4.5 seconds. You had an error in your "for" loop that made it run ten times slower. Your published numbers were meaningless and this is exactly the problem of gabfest versus writing some code.

There was no error in the for loop. It simply included the division. The results perfectly proved the 4x speedup and you were free to throw in or leave out any instruction from the loop to get the same result. The total run time was entirely irrelevant and you should get yourself a quad-core to see it divided by four.

QuoteThen you refusal to supply the build specs simply did not make sense, it appeared that you did not want anyone to build and test the code.

That doesn't make any sense. First I provide a trivial piece of code to prove a point and then I don't want you to build it? I even told you early on to use Visual C++. Don't blame me of your incompetence to build these few lines of code.

QuoteNow come back to the dogma you were trying to inflict over the high level advantage, magic helper libraries, "ths is too difficult to do in assembler" and other related waffle and see why it has not been taken seriously, the example when rewritten in masm was less than a tenth of the size and it did what it claimed to be able to do and the funny part is I don't even have a multi-core processor to try it out on.

All you managed to prove apart from delivering a working example is you don't need high level code, no magic libraries and that the task was trivial in masm.

The need for abstraction is in no way what I claimed to prove with that example. You do need abstraction though to do more complicated things with different tasks that have varying execution time. The only way to prove me wrong is to code something yourself. I have yet to see something complex that scales convincingly to quad-core, written entirely in assembly.

And before you start trying to tell me that anything is possible in assembly, let me add that it should be coded in a timely fashion. FYI, I have worked on industry quality software that scales up to quad-core, released months ago. Good luck trying to catch up with that.

QuoteWith quads common, 6 core version in the pipeline and much more powerful stuff in the near future, I will happily use my oldest PIV until the wheels fall off it as the longer it lasts, the faster and cheaper the multicore stuff will get and its not lke 1995 win95 technology in thread manipulation is any big deal to write.

Wait as long as you like, but O.S. level thread synchronization isn't going to give you good scaling for anything except the most trivial code like that which I posted. You'll need every trick in the book (The Art of Multiprocessor Programming will do), or use a framework or language that provides the same functionality to get good speedups.

It's pretty ironic that you're trying to teach me things about multi-core programming while you don't even own one yourself.

QuoteRE: AMD quad latency problems, they are still about a year off delivering competitive performance although they do have some legs in the FP area.

Who was talking about quad latency? I was talking about InfiniBand, used by Cray to interconnect nodes. Even if you can point me to any interconnect technology that is an order of magnitude faster, that's nowhere near what would be needed for magical speedups provided by hardware solutions. The technology for maximizing concurrency is entirely in the software. So there's no reason to wait for any mythical hardware advancement, they don't exist. You can start multi-core development today. If performance is the reason you code in assembly, don't leave multi-core aside because you'll easily get beaten by people who do master multi-core development, using high-level tools and languages where necessary.

QuoteReverse Hyper-Threading is not one of my expressions, the closest I have come to "Reverse Hyper-Threading" is turning Hyper-threading OFF in the BIOS of my 3 PIVs as it interfered with algorithm timing and nothing went faster with it. Multiple pipeline out of order instruction scheduling has been with us since the early PIV days.

You referred to a multi-core being able to execute instructions from one thread on multiple cores simultaneously. That's called Reverse Hyper-Threading. But no matter what you want to call it, you still haven't given me any proof of its existance or even future feasibility. So either provide it or admit you were dead wrong. Superscalar execution has existed since the first Pentium (for x86 at least). Yet we still only have four execution ports. There are two reasons for this: 1) It's technically infeasible to have many more execution ports (let alone double their amount every silicon node) due to exponentially growing dependencies. 2) There's hardly ever more closeby independent instructions in straight-line code. You have to seek concurrency higher up.

Anyway, I'm going to stop wasting my time in this thread, unless you can actually prove something tangible to me.

hutch-- · June 11, 2008, 12:39:36 PM

:bg

> There was no error in the for loop. It simply included the division.

There was an error in your loop, it contained a division that should not have been there. I dumped your test piece with dumpbin, identified the problem in assembler then rewrote the proc without the error. You reported results were not accurate, each test ran 10 times longer than the test piece, this is normally why you post an example that can be built so other people can test it.

Posting an example then giving the advice that all you needed to do was download a 938 meg ISO from Microsoft to install a pile of crap that most people would not want on their machine says you tried to make it difficult to build it. Then you held up providing the build data that should have accompanied your example in the first place when in fact it built with VCtoolkit2003 and the version of CL and LINK from vc2005.

The problem as I see it is you were willing to wade into a whole field of people who have been programming for many years trying to tell them that their choice to program in a particular language was mistaken and they would be left behind or unable to code multicore applications without high level junk, additional magic libraries and the like. All you have done there is to prove you were wrong.

If you had have posted that code about 6 to 8 pages earlier sparing us all the dogma and waffle about high level languages and the like, much of the nonsense could have been avoided and a lot of time could have been saved as many people are in fact interested in this style of programming.

Quote
And before you start trying to tell me that anything is possible in assembly, let me add that it should be coded in a timely fashion. FYI, I have worked on industry quality software that scales up to quad-core, released months ago. Good luck trying to catch up with that.

There are many things in the dustbin of history that I never felt compelled to catch up on but just to confuse you further, disassemble ntoskrnl.exe and hal.dll and have a look at the masm code in it. You tend to find it by the LEAVE mnemonic. This code in fact scales perfectly to quad core hardware.

Quote
You can start multi-core development today. If performance is the reason you code in assembly, don't leave multi-core aside because you'll easily get beaten by people who do master multi-core development, using high-level tools and languages where necessary.

Here you in fact mean multithread, welcome to Windows95.

Quote
You referred to a multi-core being able to execute instructions from one thread on multiple cores simultaneously. That's called Reverse Hyper-Threading

No, in fact its called multiple processing, it is a mistake to assume that the history of computer hardware is contained in x86 desktops. See IBM, SGI and the other large computer manufacturers. Is a 1024 Itanium SGI superbox capable of 1024 concurrent threads only or can it deliver the computing power in its spec sheets ?

The current PC market has shifted to multicore at the moment due to clock speed limitations, not any desperate need for parallelism but the market will keep demanding improvements which will require higher performance per thread than is current which will either produce faster cores or faster synchronisation of multiple cores and probably both over time.

Bill Cravener · June 11, 2008, 07:41:57 PM

Quotec0d1f1ed

It's pretty ironic that you're trying to teach me things about multi-core programming while you don't even own one yourself.

That reminds me, so there I was pigging out on a big roast beef sandwich with all the fixens and grabbing handfuls of Snyder's original Bar-B-Q chips between bites while enjoyably reading this message thread when I suddenly felt a sharp pain in my chest. No worry, it was just gas, freaking had me scared though.

Got me to thinking, imagine my family finding me here dead at my computer seat with half a big roast beef sandwich and a Snyder's original Bar-B-Q chips bag almost empty. I mean after all I'm 57 years old and I like to eat and drink.

Anyway they, the family, then look over at my PC computer screen and there's these folks talking about duo core processor thingies and single-multi-threaded hicky-ma-bobs doing things in itty-bitty seconds.

I don't know, just seemed funny! :lol

GregL · June 11, 2008, 08:46:38 PM

Cycle Saddles,

:lol It made me laugh out loud. :lol

Bill Cravener · June 11, 2008, 09:16:53 PM

Greg,

Laughings a good thing buddy. Hutch is my friend and in my book he's always right. Just thought this was a good spot for a laugh. I have a twisted way of seeing things. :bg

hutch-- · June 12, 2008, 12:51:35 AM

Bill,

Apart from all of the extremely serious considerations in this thread, how do you organise the important things of life like a big roast beef sandwich with all the fixens and grabbing handfuls of Snyder's original Bar-B-Q chips. I suffer the historical programmers problem of forgetting to eat and wondering why you start to feel seedy after a couple of days.

Since I am the worlds lousiest cook as well as having a few dietry limitations imposed by old age and bad habits, the current indulgence is to boil a dozen eggs at a time until they are like bullets, put them in the refrigerator and next morning, shell 3 of them, dice them with an egg slicer, add a generous sprinkle of salt and a light dusting of a very mild curry powder.

It tastes like the curried egg sandwiches that old ladies used to make for church fund raisers back in the 50s minus the sliced bread.

Many of the things addressed in this thread have a history that is yet to be written. Quad cores are becoming common, there are 6 core versions in the works and a number of interesting techniques that allow an ever increasing number of transistors while reducing substrate leakage, narrower tracks, a new wafer doping technique and some metal tracks for lower resistence.

AMD are currently playing catchup to Intel on the 4 cores but apparently have some good design in the pipeline and plenty of headroom to wind it up higher so if my old PIV lasts another year or so there is a good chance that there will be dual quad cores on chip with the price coming down at the same time.

News:

Multicore theory proposal

c0d1f1ed

c0d1f1ed

c0d1f1ed