News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multi-Threading

Started by Twister, July 30, 2010, 09:48:07 PM

Previous topic - Next topic

ecube

Quote from: redskull on July 31, 2010, 07:22:38 PM
"One running thread per core" is the optimum, most efficient way; any more, and you gain no more speed, and lose a little bit of task switching overhead for each additional one.  Generally, the logical benefits and "perceived responsiveness" of splitting up your program outweighs the fractions of time you lose, but 1000 is far too many (assuming they all run all the time).  There's no way to escape the fact that if you have 'x' operations to do, and 'y' CPU's to do them on, you'll never get them done any faster than x/y.

-r

how do you gain no more speed? i've already given a simple example of how multi-threads can drastically increase throughput. I feel like you guys are giving him a lot of misinformation so i'm gonna have to write a proof of concept to prove my point. Logically or otherwise it just doesn't make sense to be fixated on this 1 thread per core non-sense, experience wise I know it isn't accurate.

Geryon

I understand your doubt E^3. When you write your demonstration, please do not use any IO. Because when your program make a IO operation, other mechanisms includes.
For instance, My download speed 50kb/sec for a file with one thread. When I download exacty same file with two thread, my speed is 100kb/sec.
Now, can I say multi-threading increase my computational capacity ? No. There is no connection between thread count and cpu in this example.
However the point is there is crucial relation between TCP/IP Stack behaviour and threads.
"Some people have got a mental horizon of radius zero and call it their point of view." --D.Hilbert

redskull

Notice the emphasis I put on *running*.  If you have a single thread that adds up 1 million numbers, and each addition takes one second, then it will take a million seconds.  If you split it into two threads, have each do half the numbers and run both threads on the same CPU, then one thread takes a half million seconds, there's a small fraction of time to switch to the other, and then the other takes another half million seconds.

Using multiple threads to decrease the time spent doing nothing is not the same thing as speeding up the total work that have to get done.  Like you said, "it depends on what you're using them for", and so far the OP hasn't gone into much detail about what those threads would be doing.

-r

Edit - I see now the OP subsequently clarified most would remained blocked, but the same principle still applies.  For example, in your queu example, instead of simply 'blocking' on an i/o call, you could just as easily use a non-blocking versions, check the result, and go about doing other tasks until there is I/O available.  It would make a programming nightmare to do it all in a single thread, but you would still save the small task switching time.  Which, like I originally said, is usually worth it to avoid complexity.
Strange women, lying in ponds, distributing swords, is no basis for a system of government

BogdanOntanu

#18
Quote from: Radio Ga Ga on July 31, 2010, 07:27:45 PM

...
That is exactly what I am using those threads for!  :clap:

So, referring to my previous question: Will there be anything negative?

I could probably right it in C++ using a class; it would be slow, but at least thousands of threads won't be hitting the same procedure.

IF the threads do nothing most of the time then it means that they are not using any CPU time and hence there will be no impact.

Well, some impact must exist because the OS has to manage those threads but I do estimate this to be minimal.
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

Rockoon

The dioxin argument is interesting, but flawed.

To recap his argument, suppose you have a 4-core CPU and 5 dependency-laden tasks to perform, each task taking 1 hour.

A naive implementation with only 4 threads would take 2 hours to complete all the tasks.

The key word is naive. There is no reason that each thread can't also work on 25% of that 5th task. Is it easier to write than the naive implementation? Probably not. But its absolutely more efficient.

E^cube's argument is the same, 'cept based on blocking I/O calls instead of work your own code must accomplish like in the dioxin argument.


http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Excuse my ignorance but I can't see how multiple threads could speed up a task, *if* the task itself is efficiently coded. "Dependency-laden tasks" sounds nice, but what are these "dependencies"? Will you get more data from the harddisk if you request them from 4 different threads?? Unlikely.

However, you might get more data from a server if you request them from 4 different threads. Browsers do that all the time. But that is cheating, because you explicitly ask for more hardware, on the server side where the bottleneck sits.

ecube

Quote from: jj2007 on August 01, 2010, 12:35:08 AM
Excuse my ignorance but I can't see how multiple threads could speed up a task, *if* the task itself is efficiently coded. "Dependency-laden tasks" sounds nice, but what are these "dependencies"? Will you get more data from the harddisk if you request them from 4 different threads?? Unlikely.

However, you might get more data from a server if you request them from 4 different threads. Browsers do that all the time. But that is cheating, because you explicitly ask for more hardware, on the server side where the bottleneck sits.

I've already said it depends on how you're using them, what you need to do is find your bottlenecks, i'll give you some more examples. In terms of a download application, there's a nice one call Download accelerator plus which can download files up to 300% faster than normal. How it works is it find multiple locations the file you want is located, connects to each one in a different thread and downloads different parts of the file at the same time, this
1)removes the raping of resources for any 1 server
2)bypasses any rate limiting(as far as bandwidth) any 1 server has
3)fully utilizes what your internet bandwidth capabilities(which usually isn't the bottleneck)

also someone mentioned you could just for instance make a socket non-blocking and continue to check it...that's even less efficient than blocking, because your code actually has to continue to check the socket state, versus just wait on it, which does you no good.

In most cases the bottleneck is more so processing/handling the data anyway, as I already gave for compressing the file. Imagine you used 1 thread for compressing an entire 7 gig file, you'd have to read a part,then compress it, then maybe write it back out somewhere...
that's 3 things that the thread has to do before it can handle another part....horrible, and it'd be much more efficient for the first thread to read the data,toss it in a que and continue to to read more, and then have other threads process it. We all know playing around in memory is much faster than disk, so only makes sense to use that to your advantage....

another example is say you need to encrypt 100 random 80mb files in a folder,and the encryption algorithm takes 5 seconds for 80mbs. what you could do is have 1 thread findfirst/findnextfile the directory and add each file name into a que, and have other threads actually open the files and do the encryption, when they finish their file they check to see if anymore have been added...

so they're countless examples of how multiple threads can drastically increase performance, to jump on the other side of the fence for a second and give an example to where threads are an awful idea is a server program that attempts to create 1 thread per client and can have more than 20 simultaneous connections...a much better remedy for that is a i/o system like IOCP and using thread pooling/que system to process.

also here's a link on thread pooling, which is what I been meaning when I talk about a queuing system and rocks  :U. http://msdn.microsoft.com/en-us/library/ms686756%28VS.85%29.aspx

hutch--

I would have thought that the distinctions are simple, for any core in a processor it can at best run one thread at a time, with hyperthreading you can run a second from that core with about a 30% improvement over a single core without the hyperthreading.

What you cannot get is more processing power than the total core capacity has to offer, if you have 4 cores running at full capacity then any other intensive thread will slow all of the rest down and it gets worse with every extra thread started.

High suspended thread counts as is common in the OS is viable only if the machines full capacity is underused, as soon as the total core capacity is in use everything starts to slow down and the more threads you start the worse it gets.

A suspended thread is similar to having a technique to fast start a process, its startup overhead has already been performed and with a suspended thread it can be restarted much faster than creating a new process every time it is needed.

A typical high thread count can be applied to multiple internet connections as each connection is transferring data far slower than the macximum capacity of a given core, by running multiple threads of this low data movement type you more efficiently use the processing capacity of any given core.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Rockoon,
QuoteThe dioxin argument is interesting, but flawed.

To recap his argument, suppose you have a 4-core CPU and 5 dependency-laden tasks to perform, each task taking 1 hour.


The argument is not flawed.
Your recap is not a recap of my argument. I stated clearly that my argument applied to cpu intensive threads, not threads which were waiting, doing nothing most of the time.

What I intended to get across is that it is not always the best strategy to limit the number of running cpu intensive threads to 1 per core.

Paul.

Rockoon

Quote from: dioxin on August 01, 2010, 11:30:48 AM
Rockoon,
QuoteThe dioxin argument is interesting, but flawed.

To recap his argument, suppose you have a 4-core CPU and 5 dependency-laden tasks to perform, each task taking 1 hour.


The argument is not flawed.
Your recap is not a recap of my argument. I stated clearly that my argument applied to cpu intensive threads, not threads which were waiting, doing nothing most of the time.

What I intended to get across is that it is not always the best strategy to limit the number of running cpu intensive threads to 1 per core.

Paul.


My recap did not presume anything about threads waiting in your argument.

You are just imagining that it doesn't apply by trying to grasp at my side-note about e^cubes argument.


Lets take the case of rendering 5 pixels of an extremely deep mandelbrot zoom. Billions of iterations per pixel, with a full sampling taking 1 hour for each pixel.

You have 5 tasks to accomplish, 4 threads, and no imaginary waiting.

At T = 0:

Thread #1 begins Task #1
Thread #2 begins Task #2
Thread #3 begins Task #3
Thread #4 begins Task #4

At T = 15 minutes:

Thread #1 shifts to Task #2
Thread #2 shifts to Task #3
Thread #3 shifts to Task #4
Thread #4 shifts to Task #5

At T = 30 minutes:

Threads #1 shifts to Task #3
Threads #2 shifts to Task #4
Threads #3 shifts to Task #5
Threads #4 shifts to Task #1

At T = 45 minutes:

Threads #1 shifts to Task #4
Threads #2 shifts to Task #5
Threads #3 shifts to Task #1
Threads #4 shifts to Task #2

At T = 60 minutes:

Threads #1 shifts to Task #5
Threads #2 shifts to Task #1
Threads #3 shifts to Task #2
Threads #4 shifts to Task #3

Each task ran for 1 hour each, and the total time was only 1 hour and 15 minutes. Your argument insists that it should take 2 hours, 1 hour for 4 tasks and then 1 more hour for 1 task. Thats why your argument is flawed.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Rockoon

Quote from: jj2007 on August 01, 2010, 12:35:08 AM
but what are these "dependencies"?

Dependency chains. Non-parallelizable tasks. For instance, calculating a single pixel from the mandelbrot set. You cant set 4 cores off on the same pixel and get any advantage. Each iteration depends on the previous. Dependency-laden.


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

redskull

The dioxin argument is not "flawed", but it misses the entire point of the "one per one" guidline: It doesn't mean simply stop running threads, it means redesign the code.  In his example, to be *most* efficient, instead of 5 threads that do 3 hours of work, you would want to build the code so you had 4 threads doing 3.75 hours of work each.  That's more efficient than both the example and counterexample (but is, of course, not always practicle).

I can't stress enough that the key word is *running*.  As soon as you start in with blocking I/O calls, especially if you have to wait for hard disk platters to spin, you change the entire argument.
 
-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government