News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Idea for multithread memfill.

Started by hutch--, April 20, 2008, 10:21:19 AM

Previous topic - Next topic

Rockoon

The current x64 multi-core paradigm really isnt made for this.

This is in contrast to the current GPU's paradigm, where there are so many cores that having one sitting around waiting for the instant something in particular can be done is no big loss (For example, the 8800GT sells for ~US$150 these days and has a whopping 112 streaming cores.)

Both Intel and AMD are rapidly moving towards 8 cores, with an additional 32+ smaller streaming cores. This is when things will start to get interesting for tightly coupled threads.

The good news?

At least for Intel's Larrabee, these smaller streaming cores will use a subset of the x86 instruction set, so ASM programmers will be able to dive in quickly. They will begin by offering their streaming cores as an add-in card, but the current official roadmap has a 6x16 (6 CPU x 16 Streaming) single-chip codenamed Nehalem coming out sometimes in 2009.

As far as AMDs fusion, these streaming cores will simply be beefed up ATI GPU cores, which kind of stinks for ASM programmers. Probably the biggest mistake AMD will ever make because the ATI GPU design isnt as good as NVidia's, meaning it can't compete on current graphics applications, while also being less than general purpose so it cannot compete against Larrabee either. Things don't look good for long-term AMD competitiveness.

Edited to add:

NVidia is also playing this game, sort of. Their CUDA could potential compete against Larrabee for the short term, but they can't integrate with the CPU so in the end Intel is going to completely monopolize high performance desktop computing.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

sinsi

Quote from: Rockoon on May 10, 2008, 08:10:58 AM
Edited to add:

NVidia is also playing this game, sort of. Their CUDA could potential compete against Larrabee for the short term, but they can't integrate with the CPU so in the end Intel is going to completely monopolize high performance desktop computing.

Intel to buy nvidia, like speculation a few months ago, like AMD buying ATI heh heh.

Let's face it, a GPU can kill a CPU (for certain things, fair enough), but look at gddr5 vs ddr3 - why can't intel/amd keep up with this?
Light travels faster than sound, that's why some people seem bright until you hear them.

codewarp

Quote from: Kernel_Gaddafi on May 10, 2008, 05:13:23 AM
codewarp, WaitForSingleObject() on the handle returned by CreateThread() ?

where does it say use WaitForSingleObject inside fill_thread()?

if i were gonna implement such a routine, i'd distribute the total size of buffer between N threads
otherwise there is no point really..1 routine is enough.
That's not the point.  The point is that every time you wait for a thread to do anything, you are adding 10's of thousands of clock cycles to your own thread, giving them away to another process that IS ready to run now.  There is no benefit at all to memfills < 64kb, when it takes 5-10 microsecs of latency before your "helper" gets up to speed.  It's like hiring someone to do a job, who takes a two hour coffee break at the beginning of every task.

I can't believe anyone would even debate this point.  Multi-threading a memory fill to acheive better performance is simply not a well formed concept.  No one in their right mind would do this.  Rather, this example is an attempt to answer the larger question:  How can we take advantage of additional CPUs to enhance the performance of synchronous subtasks?  This is a good question to ask, but the answer is not a very satisfying or resounding yes.  Effective solutions come from understanding and working with the reality of multiple CPUs, not the fantasy.  Ultimately, real multi-cpu performance within an application can only come from free-running fully occupied threads that only minimally interact.  With a good lock-free queuing mechanism, you can benefit from giving other threads tiny jobs to do.  But latency toll will still have to be paid--either from the WaitForStuff( ) calls, or from the jobs active on the queue remaining to be done at any given time.  And you still have to synchronize with the data state changes from other threads, before you can touch their results.

The resistance to this concept from the assembler point of view is understandable--queuing and related support mechanisms are complicated and substantial.  It's the kind of complexity that pushes software toward higher-level solutions and away from assembler.  Die-hard assembler programmers don't like to hear that (I used to be one, so I know).  Work with it and you can achieve, fight it and you won't.

c0d1f1ed

Quote from: Rockoon on May 10, 2008, 08:10:58 AM
For example, the 8800GT sells for ~US$150 these days and has a whopping 112 streaming cores.
That's 112 scalar units, not 112 cores. In fact it's 7 cores with SIMD units that are 16 elements wide.
QuoteBoth Intel and AMD are rapidly moving towards 8 cores...
Plus the SIMD units are getting wider and deeper. AVX extends the registers to 256-bit and SSE5 adds instructions that perform a multiplication and an addition in one. You do the math. :8)

What's still lacking though is parallel scatter/gather instructions. Writing and reading elements from different memory locations is quickly becoming the biggest bottleneck with arithmetic power going up. I'm curious when and how Intel/AMD will address this...

Rockoon

Quote from: sinsi on May 10, 2008, 08:50:05 AM
Let's face it, a GPU can kill a CPU (for certain things, fair enough), but look at gddr5 vs ddr3 - why can't intel/amd keep up with this?

This stuff has little benefit on CPU's.

Memory isnt accessed directly, instead memory is accessed from caches that already are close to keeping up with the CPU.

3 cycle latency or less from L1. DDRx doesnt impact this.
10 cycle latency or less from L2. DDRx doesnt impact this.

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

bozo

Quote from: c0dewarpThe point is that every time you wait for a thread to do anything, you are adding 10's of thousands of clock cycles to your own thread, giving them away to another process that IS ready to run now.

so how else do you suggest waiting for a thread to finish execution in these circumstances? or atleast know in a multi-threaded process when it finished? - normally i'd use WaitForSingleObject()/WaitForMultipleObjects() on the thread handle(s)..as in inside my thread(s) will be ExitThread() or just a RET to tell the operating system "i'm done here" which in turn notifies the main process.

it would be more efficient to do this than run a loop every N seconds checking a variable to see if the thread is terminated, because the operating system doesn't have to keep switching contexts.

Quote from: c0dewarpThere is no benefit at all to memfills < 64kb, when it takes 5-10 microsecs of latency before your "helper" gets up to speed

i don't recall ever saying that there was, or that the buffer being filled in hutchs' code was < 64kb - as you can see, its 100MB - the threads could be performing any number of operations on the same buffer, just at different offsets in different threads, not necessarily accessing the same data.

Quote from: c0dewarpDie-hard assembler programmers don't like to hear that (I used to be one, so I know)

i have a feeling you haven't programmed assembly since the old MS-DOS days.

u

Quote from: codewarp on May 10, 2008, 05:39:08 PM
The resistance to this concept from the assembler point of view is understandable--queuing and related support mechanisms are complicated and substantial.  It's the kind of complexity that pushes software toward higher-level solutions and away from assembler.  Die-hard assembler programmers don't like to hear that (I used to be one, so I know).  Work with it and you can achieve, fight it and you won't.
This just in: ASM programmers have been using MASM for the last decade, which stands for "Macro assembler". OOP, vectors, HLL constructs, garbage-collection, automated memory management, automated code-generation, thinking from ALL points of view/levels of the software (lowest to highest), being able to easily and reliably check performance thus choose the best route, and so on - WE HAVE IT. Read up, try, and THINK!

And from the start, you want us to present you with THE ultimate and only universal way to solve ALL types of problems of multicore? This stinks of too little programming experience. Prove me wrong.
Please use a smaller graphic in your signature.

c0d1f1ed

Quote from: Kernel_Gaddafi on May 11, 2008, 09:47:50 PMso how else do you suggest waiting for a thread to finish execution in these circumstances? or atleast know in a multi-threaded process when it finished? - normally i'd use WaitForSingleObject()/WaitForMultipleObjects() on the thread handle(s)..as in inside my thread(s) will be ExitThread() or just a RET to tell the operating system "i'm done here" which in turn notifies the main process.

it would be more efficient to do this than run a loop every N seconds checking a variable to see if the thread is terminated, because the operating system doesn't have to keep switching contexts.
I think c0dewarp's suggestion is to not wait at all. Design your algorithms in such a way that each thread always has something do to, and use lock-free synchronization to avoid the huge overhead of O.S. thread synchronization.

codewarp

Quote from: c0d1f1ed on May 12, 2008, 07:18:27 PM
I think c0dewarp's suggestion is to not wait at all. Design your algorithms in such a way that each thread always has something do to, and use lock-free synchronization to avoid the huge overhead of O.S. thread synchronization.

Exactly.

codewarp

Quote from: u on May 12, 2008, 12:31:55 AM
And from the start, you want us to present you with THE ultimate and only universal way to solve ALL types of problems of multicore? This stinks of too little programming experience. Prove me wrong.
See my thread "Multiple CPUs and the limits to assembler"

Mark Jones

Michael, I got this from your thread-test app:


2 us, mean waiting for end of time slice
2 us, mean using Sleep, 0
2 us, mean using event wait
Press any key to exit...


New box, AMD Athlon X2 (dual-core) 64-bit @4GHz, Win XP Pro x32
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Mark Jones

And here's Nightware's test:

Results of the tests :
Test 1 (default hardware), done in 7953689 cycles
Test 2 (parallelism attempt), done in 5320118 cycles
Test 3 (attempt with another algo), done in 5322003 cycles
Test 4, done in 11 cycles
Test 5, done in 11 cycles
Test 6, done in 10 cycles
Test 7, done in 11 cycles
Test 8, done in 11 cycles
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

NightWare

hi,
read http://www.masm32.com/board/index.php?topic=9255.0, results with severals threads are not representatives. mine included...  :red