News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multi-Threading as it applies to Multi-Processing

Started by Glenn9999, December 22, 2010, 07:57:04 AM

Previous topic - Next topic

Glenn9999

Has anyone studied the topic as indicated in the subject?  I've done some work thinking about it and doing a couple of algorithms, but just curious if anyone here has worked on it, especially in light of utilizing the multiple cores on most newer CPUs and so forth.  For example, where the ability to use the multiple cores outweighs the code it takes to set up and manage the cores and so forth?

Curious more than anything...

oex

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

that seems like kind of a terse answer for a complicated subject   :P

you might use the forum search tool
Hutch was playing with some ideas on this a while back

there are some good reasons for creating alternate threads that aren't necessarily on a per-core basis
for example, a message box with a cancel button on one thread while the code runs on another
this allows a user to cancel an operation that has tied up the CPU

as for trying to get better performance by controlling affinity...
the general consensus is that it's usually best to let the OS handle it   :P
there may be cases where selecting a specific core may help on one machine and hinder on another
of course, it depends a lot on your application

however, there are some interesting possibilities
in a manner of speaking, having alternate cores (or hyper-threaded cores) may be used as an extra set of registers
while this idea may sound appealing, i have yet to see any particularly "cool" implementation of it

multi-threading certainly has a place in code-performance profiling

redskull

Quote from: oex on December 22, 2010, 12:41:33 PM
1 thread per core is optimal

1 running thread per core is optimal.

But beware, as this guidline can be misleading.  A running thread is not waiting for I/O, user-input, semaphores, mutxes, disk reads, or anything else; it is just crunching numbers.  Most programs do more waiting than running, so the best thread/core setup is highly application depedenent.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

oex

There is some less terse info here :lol

http://www.masm32.com/board/index.php?topic=13374.0

Always good to understand the most important parts first :bg

Like using adjectives in English sentences :lol
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

QuoteThere is some less terse info here

QuoteLike using adjectives in English sentences

less terse = more verbose   :P

Glenn9999

There's some good information here.  Thank you for posting it.  It will be helpful to study as I need to improve my thread handling and control code that I already have.

But maybe I should clarify what I was looking to discuss in making this post.  What I'm talking about is the utilization of those multiple cores on a job in order to make it go faster.   Given that, the answer to be able to use the multiple cores is to have multiple threads.  Of course, the job in question would have to be an algorithm that tends to parallelism.

For example, one of the algorithms I've identified is quick sort.   As a divide and conquer algorithm, you can piece out sections initially and instead of recursing the procedure on them, you can call a number of threads = number of processors and get a performance increase if you have enough data to make it worthwhile.  I've done this and get a clock time anywhere from n to (n / p) where p is the number of processors on the computer and n is the single core clock time.  However, this variation occurs depending on the sizes of the sections for this algorithm.  N / P occurs when the sections are almost evenly divided by the algorithm.  Eventually I'd like to try getting the code so I can keep track of the number of active threads and restart them on a new section if one is terminated before the others are done.  I wouldn't expect the code speed to be (n/p), but I would expect it to be more consistent towards that.

Now by "ability to use the multiple cores outweighs the code it takes to set up and manage the cores and so forth", I will note there is an overhead in setting up and managing all the threads.  The multi-threaded algorithm against 1 processor is slower than a single-threaded model on a similar amount of data.  As well, the single-threaded algorithm is faster on a multi-core system against a small amount of data.

I was wondering if anyone had any observations along these lines.

(I'll probably pick this thought back up as a project as soon as the one I'm working on is over.)

dedndave

i think if you go to all the trouble to set up threads (which isn't that difficult),
then it should be a simple matter to make some pre-determinations about the size of the data and the strategy to be used   :P
you may not be able to easily optimize it for all platforms, but you should be able to achieve "close enough"

oex

Quote from: dedndave on December 22, 2010, 03:35:54 PM
less terse = more verbose   :P

:lol How embarrassing being taught English by Americanz :red

:lol Now I know how my computer feels
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

shooooot
wifee is a Brit
i take lessons every day - and not just in English
if it weren't for her, i wouldn't know how to act  :eek

zemtex

Quote from: oex on December 22, 2010, 12:41:33 PM
1 thread per core is optimal

...and 2 for hyper threading  :bdg

If you have a six core cpu you have 6 physical cores and 12 logical cores.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

oex

:lol I was talking about logical cores I was just being less verbose :P
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

hutch--

Glen,

Its always useful to start at the other end, core count matching thread count where intensive data is being processed. The issue that will dictate in part what you do with multiple threads is interval granularity versus thread overhead. When you are performing tasks that take a reasonable amount of time then distributing that load across multiple cores makes sense but as the intervals get shorter the thread handling overhead gets larger and when you get down to low millisecond counts that same thread handling overhead will kill any speed advantage of multiple cores running multiple threads.

Unless you know for sure what the core count will be you need to make the data processing scalable so that it has some chance of adapting to the core count. The worst scenario is a vast thread count with a low core count as you pick up all the thread handling overhead without the processing powewr to handle it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zemtex

Amount of threads can be automatically adjusted in the program/game during runtime to find the optimal use of threads. But to use 1 thread per core is a good template to begin with. What works good for you during compile-time is not the same result that will show up when your friend runs it (ofcourse  :lol)

Some programs run a test in the background in a black screen to find this optimal setting and then it launch to give you the interface.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

Glenn9999

Quote from: hutch-- on December 23, 2010, 04:30:15 PM
Its always useful to start at the other end, core count matching thread count where intensive data is being processed.

Yes, that's what I did in my test.  Query the number of processors and that becomes the maximum number of threads my multi-processing code spawns.

Quote
the intervals get shorter the thread handling overhead gets larger and when you get down to low millisecond counts that same thread handling overhead will kill any speed advantage of multiple cores running multiple threads.

Exactly.  This number varies among systems, I'm sure, but you can always pick an amount of data where single processing needs to kick in.

On this, I actually got to thinking about the MD5 stuff I've been trying to optimize.  Since my testing routine pulls in a much larger block of data than 64 bytes when reading from the hard disk, it might not hurt to try testing multi-processing on that.   And it'd be much easier to scale than quicksort.   For handling the middle of the data (always multiple of 64 bytes), divide it up evenly, then let threads = processor count handle that (provided pointers and correct data sizes for each piece).  It might not net a huge performance advantage but it just might, too.