News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

Bill Cravener

QuoteI suffer the historical programmers problem of forgetting to eat and wondering why you start to feel seedy after a couple of days.

Steve, I know what you mean I for many years suffered the same problem. I was often referred to as the bean-pole, stick-man or pencil-necked geek by my peers. Back then food wasn't important and at 6'2" and 165 pounds I was pretty skinny. Now that I've slowed down I love to eat and drink and I'm not particular as to what as long as its eatable and the drink has alcohol in it.  :lol

You stay healthy and keep posting interesting topics. I like to read while feeding my face.  :bg
My MASM32 Examples.

"Prejudice does not arise from low intelligence it arises from conservative ideals to which people of low intelligence are drawn." ~ Isaidthat

NightWare

i've made some modifications to allow me to compile it with the old masm32 v8 i use... i've added thread priority to obtain better results, i've changed the algo with a fillmem one, i've made 4 differents threads (coz you can't pass parameters like it should with the synchronising tech used)... when i test it, i have :
===========================================
Run a single thread on fixed test procedure
===========================================
1280 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1295 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
2590 MS Four thread timing



Press ENTER to quit...
(here, it look similar to previously...),


but after that, i've given the corresponding size/start address (uncomment the lines 307,314,359,370,372,374) to the threads, for the same amount of work in all cases... and the results :
===========================================
Run a single thread on fixed test procedure
===========================================
1279 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1170 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
312 MS Four thread timing



Press ENTER to quit...

there is a problem somwhere... (all of them should give similar results...) now as i've said somewhere else, i'm not very familar with win32 api, so maybe i've made a mistake somewhere...

[attachment deleted by admin]

zooba

You code looks fine as far as I can tell. Are you running on a quad-core processor? The results of the first test look like you have a dual-core but the results of the second look like a quad-core. I get almost identical results to yours and I know I'm only using a dual-core.

Also, you've almost got the parameter passing perfectly. Try including the events in each thread's data structure and passing a pointer to the entire structure in lParam. What would be a better way of passing parameters than by using a pointer?

Cheers,

Zooba :U

NightWare

hi,
like you i have a core2, so the 2nd results are... weird... look like only one thread is used...

zooba

I just had another look and noticed that you're passing the thread ID around (SetThreadPriority, etc.) when it should be the thread handle (return value of CreateThread). Not that this will be causing such a huge different in timings.

I changed the fill data and set a breakpoint on the return from WaitForMultipleObjects and the memory was filled correctly. Something strange seems to be going on here...

I have attached a slightly modified version of NightWare's original code, fixing the handle problem I mentioned above and enabling the work-division. (There is also an executable for people who can't be bothered building it :bg ) Getting some more validation of these numbers would be great.

Cheers,

Zooba :U

My results:
===========================================
Run a single thread on fixed test procedure
===========================================
1295 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
1139 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
328 MS Four thread timing

[attachment deleted by admin]

sinsi

Q6600, 2.4GHz, XPSP3

===========================================
Run a single thread on fixed test procedure
===========================================
1265 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
203 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
110 MS Four thread timing

Light travels faster than sound, that's why some people seem bright until you hear them.

lingo

Core2 E8500,4 GHz,Vista64-SP1:
===========================================
Run a single thread on fixed test procedure
===========================================
405 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
156 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
140 MS Four thread timing



Press ENTER to quit...

NightWare

here my opinion :
the os (because of protected mode) loop every x time with all the programs (and alloc more or less time for the tasks depending of the priority), when you add threads, you just add other tasks (the threads) in the loop, so you just double or quad the amount of time you allocate for your app (and reduce the time for all the other tasks, including os...), nothing more... it's seems quite logical... we've probably been misdirected by pseudo speed test results, SetMaskAffinity, etc...  :lol

zooba

You're probably right. I can't think of any other explanation. Windows is designed to execute everything fairly, rather than provide all power to any thread/process that asks for it (though it can be coerced).

Clearly though, dividing this sort of work up amongst multiple threads is where multi-core speed gains come from.

Cheers,

Zooba :U

johnsa

http://www.informationweek.com/shared/printableArticle.jhtml?articleID=208403616

Interesting article about Intel Ct (exentions to C++ language) to facilitate multi-core development.

hutch--

Thanks for the link John, it looks like in the near future that this type of code will improve the entry of higher level languages into multicore programming. One of the commercial offerings I have seen recently (while forgetting its name) got under the threaded model and does this type of synchronisation at a lower level which apparently improves the performance some.

I suspect that later OS versions may have this type of option built into it but there also appear to be some fast moving changes in the hardware area going on as well so some of this type of technology may not last all that long. In current multicore development at a hardware level multiple identical cores are what is happening at the moment but I have heard talk of different core types as clusters that have diminished capacity but faster better integrated performance in serial code.

The holy grail will be improvements in both serial and parallel performance which will probably take both major cores and minor core to deliver. All paralel programming involves running serial code in parallel so an improvement in both will see far larger performance gains than either individually.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Quote from: hutch-- on June 16, 2008, 12:17:10 AM
Thanks for the link John, it looks like in the near future that this type of code will improve the entry of higher level languages into multicore programming. One of the commercial offerings I have seen recently (while forgetting its name) got under the threaded model and does this type of synchronisation at a lower level which apparently improves the performance some.

Ct already does task synchronization at a lower level. Read the section Tasks in Ct. RapidMind works similarly.

Quote...but there also appear to be some fast moving changes in the hardware area going on as well so some of this type of technology may not last all that long.

Please name those changes.

As the number of cores increases we'll become more dependent on this type of technology, not less.

QuoteIn current multicore development at a hardware level multiple identical cores are what is happening at the moment but I have heard talk of different core types as clusters that have diminished capacity but faster better integrated performance in serial code. The holy grail will be improvements in both serial and parallel performance which will probably take both major cores and minor core to deliver. All paralel programming involves running serial code in parallel so an improvement in both will see far larger performance gains than either individually.

Heterogeneous architectures actually have lower per-core instruction throughput. Look at Cell and Larrabee for instance. Their cores can't do out-of-order execution, don't do register renaming, there's no speculative execution with branch prediction, etc. But because they got rid of this 'hardware bloat' which only offers minor IPC improvements they can spend the extra transistors on more of these simple cores. Just look at the Cell die. Instead of having room for only about three PowerPC cores, it has one complex PowerPC core and eight simple SPE cores. Despite somewhat slower sequential code execution, four SPEs can still deliver much higher combined throughput than one complex core. The complexity shifts to the software though, as you have to maximize thread concurrency.

So don't expect sequentual code performance to increase significantly any time soon.

hutch--

 :bg

> So don't expect sequentual code performance to increase significantly any time soon.

Who was it who said you will never need more memory the 64k ?

> Ct already does task synchronization at a lower level. Read the section Tasks in Ct. RapidMind works similarly.

How much faster does it make a serial task like chained encryption ?

0.000000000000000%

The problem is when the task has no inherant parallelism to distribute across multiple cores and there are a massive number of tasks like this.

Even in the parallel model you have in mind with current technology, it still runs serial code in parallel, the need for fast serial code will always exist and sooner or later the hardware will address it again. The current hiatus is due to clock speed limitations imposed by heat.

In a world where you already have multiple pipelines with out of order execution to improve throughput on a single core, it would indeed be a brave man who predicts the end of linear (serial) improvements.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Quote from: hutch-- on June 16, 2008, 12:50:35 PM
Who was it who said you will never need more memory the 64k ?

Nobody did. It's a popular misinterpretation. And you're misinterpreting me as well. I didn't say you don't need higher sequential performance, I said don't expect it any time soon. Neither did I say overall performance won't increase rapidly either, as you're going to get it in the form of extra cores.

QuoteHow much faster does it make a serial task like chained encryption ?

0.000000000000000%

The problem is when the task has no inherant parallelism to distribute across multiple cores and there are a massive number of tasks like this.

This is why we have other encryption methods than CBC. There might be a massive number of algorithms with no inherent parallelism, but there's an even more massive number of algorithms suited for parallelisation. And the number of parallel algorithms is still growing, as more developers have the opportunity to develop on multi-core systems. You also have to stop thinking in terms of performing just one kind of task on one block of data. Instead of encrypting one file, encrypt multiple files divided into multiple sections. The opportunity for parallelism is so substantial that it can even be done on the GPU. And instead of having your application doing just encryption, allow other things to run concurrently as well. For instance an image editing application can compress and encrypt a number of images in the background, while allowing the user to keep working on his image in the foreground, while generating previews and updating animated GUI elements... You have to start thinking outside the box.

What you also seem to keep forgetting is that there is no other solution than to start using new, parallel algorithms. There is no way to make a sequential algorihm run faster other than increasing the clock speed, which is hitting physical limits whether you like it or not. Even if they suddenly have a breakthrough in technology and they can bump it up tenfold, you'll still have multiple cores. The only way to take advantage of them is with parallel algorithms, so stop talking about sequential algorithms, they belong in the "dustbin of history".

QuoteIn a world where you already have multiple pipelines with out of order execution to improve throughput on a single core, it would indeed be a brave man who predicts the end of linear (serial) improvements.

There is nothing left to do for single-threaded code at the architectural level.  As Moore's Law allowed more and more transistors, they added pipelining, they added branch prediction, they added superscalar execution, they added out-of-order execution, etc. Now they've simply gotten to the point where you can't churn through instructions from a single thread any faster. If we're missing any technology that is yet to be implemented, specify. So far you haven't named any of your magical technology.

hutch--

The 64k was from about 10 years earlier when an IMB PC had an amazing 64k when earlier PCs had 2, 4 or even 16k. I doubt it gets said much any more.

Algos like encryption, compression, searching, data structures(trees - hash tables) are fundamentally sequential (serial) in nature. There may be alternatives but they tend to be inferior designs in terms of encryption and compression as both designs are serial chained algos. Easy parallelism belongs to gaming in multimedia code, some engineering calculations as long as they are not sequentially dependent. The dustin of history is full of many things but serial algorithms are not one of them. Parallel processing is still done with serial processing on each core.

> What you also seem to keep forgetting is that there is no other solution than to start using new, parallel algorithms.

Who cares if they cannot do the same task.

Quote
There is no way to make a sequential algorihm run faster other than increasing the clock speed, which is hitting physical limits whether you like it or not.

This is claptrap, instruction throughput is the action, not clock speed, it was just one way to get more instructions through. Physical limits change as in fact they have over time, its a brave man who predicts no speed increase in instruction throughput.

Quote
There is nothing left to do for single-threaded code at the architectural level.  As Moore's Law allowed more and more transistors, they added pipelining, they added branch prediction, they added superscalar execution, they added out-of-order execution, etc. Now they've simply gotten to the point where you can't churn through instructions from a single thread any faster. If we're missing any technology that is yet to be implemented, specify. So far you haven't named any of your magical technology.

This IS magical technology in that it parallels single thread instructions through multiple pipelines and YES it did get faster because of it.

> If we're missing any technology that is yet to be implemented, specify.

Predicting the future is best done with a crystal ball, most have to be satisfied with continuity and that has been faster machines over time, multicore technology is still in its infancy in the PC market, it is useful but its not universal in its application. Most multicore technology on current PCs is win95 multithreading technology applied to multicore hardware.

Fine for where its faster but lousy where its not.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php