Multicore theory proposal

c0d1f1ed · June 09, 2008, 03:24:07 PM

Replace /TC by /TP.

The difference is due to the division. Without it I get:

Quote
Milliseconds for 1 threads: 5055, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 2543, multi-thread speedup: 1.987810
Milliseconds for 3 threads: 1700, multi-thread speedup: 2.973529
Milliseconds for 4 threads: 1326, multi-thread speedup: 3.812217
Press any key to continue . . .

Interestingly it gets even closer to a 4x speedup, which was the whole point of the exercise anyway...

By the way, where's that scientific Reverse Hyper-Threading paper?

hutch-- · June 09, 2008, 03:37:13 PM

This is fine but we still don't have a buildable source. Is there some reason why you won't post your build information ? What happened to ANSI portable C ?

askm · June 09, 2008, 03:39:36 PM

How many total instructions are being executed on
either of the tests you all are running ?
cod~, are you using ~Express 2005 or ~Express 2008 or somewhere in between
whereas the 08 version does a better job, if (all else equal) the base compiler is better for that matter ?

Does anyone in the forum use the latest Intel compiler (as an option in the latest visual studio)
as I read its supposed to be of great use in multicore?

(I can only multidream, as my hardware+software+experience "is not there yet".)

c0d1f1ed · June 09, 2008, 04:05:51 PM

Quote from: hutch-- on June 09, 2008, 03:37:13 PM
This is fine but we still don't have a buildable source. Is there some reason why you won't post your build information ? What happened to ANSI portable C ?

What errors do you get now? I use Visual C++ and avoid all the hassle of command line options. Anyway, if it helps you: '/Ox /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt'. It's not C, it's C++.

hutch-- · June 09, 2008, 04:30:22 PM

Thanks,

It builds with "/TP". What I don't undertand is why it take 44 seconds (timed watching the system clock) to run a single thread on this box before the additional theads are run when the asm single thread test piece runs in about 4.5 seconds.

I tracked down why your test code was so slow, VC had made a mess of the timings with the "for" loop.

Replace the code as follows.

Code Select


// void hutchTask()
// {
//     int var = 12345678;
// 
//     for(unsigned int i = 0; i < 4000000000 / n; i++)
//     {
//         __asm
//         {
//             mov eax, var
//             mov ecx, var
//             mov edx, var
//         }
//     }
// }

void hutchTask()
  {
      int var = 12345678;
      __asm
      {
          push esi
          mov esi, 4000000000
        lbl0:
          mov eax, var
          mov ecx, var
          mov edx, var
          sub esi, 1
          jnz lbl0
          pop esi
      }
  }

This yields the following timings on my single core PIV which are predictable..

Code Select


Milliseconds for 1 threads: 4532, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 9062, multi-thread speedup: 0.500110
Milliseconds for 3 threads: 13531, multi-thread speedup: 0.334935
Milliseconds for 4 threads: 18032, multi-thread speedup: 0.251331

You are using a tail end synchronisation technique,

Code Select


SetEvent(done[*(int*)parameter]);    // each thread on exit
....
WaitForMultipleObjects(n, done, true, INFINITE);  // wait for all to finish

I have attached the fixed version of your test piece with a working binary so other people can test you code on either a dual core or a quad core.

[attachment deleted by admin]

c0d1f1ed · June 09, 2008, 04:44:45 PM

Quote from: hutch-- on June 09, 2008, 04:30:22 PM
I tracked down why your test code was so slow, VC had made a mess of the timings with the "for" loop.

Like I said, it's the division. In Visual C++, place a breakpoint at the loop and press Alt+F8 to see the disassembly during debugging. Do the division outside of the loop and on a Q6600 you get the last results I posted. Your version doesn't do the division at all so the speedup factor is wrong.

hutch-- · June 09, 2008, 04:57:45 PM

At least it gets the timings right. the quad core results should show the timings for all 4 tests with very close to the same timing.

Built as ANSI code with CL from the VCTOOLKIT as a C file. Disassembled with DUMPBIN from VC2005 with no magic libraries, abstraction or any other high level claptrap.

Now I wonder how well this open thread startup with a synchronised tail end to display the results scales to a task like repeatedly filling a buffer in one thread while writing to the buffer in a calling thread ? For 50 frames a second this needs to be done in 20ms. The more so if the two operations do not take the same time.

c0d1f1ed · June 09, 2008, 05:38:04 PM

Quote from: hutch-- on June 09, 2008, 04:57:45 PM
Now I wonder how well this open thread startup with a synchronised tail end to display the results scales to a task like repeatedly filling a buffer in one thread while writing to the buffer in a calling thread ? For 50 frames a second this needs to be done in 20ms. The more so if the two operations do not take the same time.

As I've been saying all along O.S. level synchronization is terribly slow and we shouldn't expect any revolutionary ring0 multi-threading solution. It just happens to work ok for this code because the task can be split into equal sized subtasks of several seconds. As soon as you do something more interesting this fails. Have you finally read chapter 1.1 of The Art of Multiprocessor Programming yet? The solution, which I've also been repeating over and over again, is to keep the threads running and schedule the tasks with lock-free or better even wait-free synchronization.

Reading and writing buffers that are larger than the cache may seem to defeat it. But the trick is to subdivide the buffers and treat the processing of the sections as separate tasks, and ensure that you perform as many tasks on a certain section before you go to the next. Dataflow programming paradigms are very useful here.

So can we finally come to a consensus that multi-core is very useful even though it takes some programming effort to maximize effiency?

hutch-- · June 09, 2008, 06:01:58 PM

My problem is not with multiple processor or even multicore processors, they have been around for a very long time, its with how useful it is with the vast range of code types that get written on a daily basis with the level of OS control in current OS versions. Non synched threads have a very limited range of tasks that they can perform and a vast range of application do not suit that type of seperation.

Multiple pipeline hardware already parallels instructions when scheduled correctly and this makes each thead faster on normal PC but I wil make the point again that vastly larger high processor count hardware uses dedicated hardware synchronisation to parallel up to 1024 Itaniums and they can produce massive throughput many times faster than a single Itanium. I have not read all of the tech data for the x86-64 cheapies from SGI but with the 8 core dual quad option I have no doubt the throughput is competitive for the core count.

What is missing in your example is the need for abstraction, magic libraries and the pile of claptrap that comes with bloated high level tools, its straight API code using tail end synchronisation to display the results.

I will be interested to see the results of other members with dual or quad core hardware.

sinsi · June 09, 2008, 09:55:04 PM

From the EXE in your attachment Hutch,

Code Select


Milliseconds for 1 threads: 5000, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 5016, multi-thread speedup: 0.996810
Milliseconds for 3 threads: 5000, multi-thread speedup: 1.000000
Milliseconds for 4 threads: 5000, multi-thread speedup: 1.000000

????????

GregL · June 09, 2008, 10:15:13 PM

Here's what I'm seeng with hutch's code (Pentium D 940, Vista SP1):

Code Select


Milliseconds for 1 threads: 3938, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 4203, multi-thread speedup: 0.936950
Milliseconds for 3 threads: 6250, multi-thread speedup: 0.630080
Milliseconds for 4 threads: 8203, multi-thread speedup: 0.480068

They're all distributed to both cores. Graph attached.

[attachment deleted by admin]

MichaelW · June 09, 2008, 11:17:35 PM

Quote from: c0d1f1ed on June 09, 2008, 03:24:07 PM
Replace /TC by /TP.

The difference is due to the division. Without it I get:

Quote
Milliseconds for 1 threads: 5055, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 2543, multi-thread speedup: 1.987810
Milliseconds for 3 threads: 1700, multi-thread speedup: 2.973529
Milliseconds for 4 threads: 1326, multi-thread speedup: 3.812217
Press any key to continue . . .

How can this be? Without the division each thread will do 4 billion iterations, so if each thread is running on a different core they should all complete in approximately the same time, independent of the number of threads.

And why exactly is it that you did not post an EXE and/or made it difficult for us to create our own EXE from your source? And while I'm asking questions, why not a source in the preferred language of this forum?

hutch-- · June 10, 2008, 12:47:04 AM

OK,

Here is my awake version in masm, abstraction free, magic library free and bloat free. First I must thank c0d1f1ed for providing some working code that broke the gabfest and other related waffle. I think I have the tail end synch working correctly and the times on my PIV reflect the increase work load from 1 to 2 to 4 threads.

These are the timings I get on a PIV which are predictable with a single core.

Code Select


===========================================
Run a single thread on fixed test procedure
===========================================
4516 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
9000 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
17969 MS Four thread timing

Press any key to continue ...

The dual thread code should produce the same timing as the single thread on a 2 core processor, the 4 thread code should produce the same timing as the single thread version on a quad core.

[attachment deleted by admin]

GregL · June 10, 2008, 01:12:46 AM

I seem to do best with a single thread. ??

(Pentium D 940, Vista SP1).

Code Select


===========================================
Run a single thread on fixed test procedure
===========================================
3891 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
4578 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
8531 MS Four thread timing

Same deal on the graph.

NightWare · June 10, 2008, 01:21:29 AM

Quote from: Greg on June 10, 2008, 01:12:46 AM
I seem to do best with a single thread. ??

of course... hardware always faster than software :bg

my results on core2duo T7300 2ghz :

Code Select

===========================================
Run a single thread on fixed test procedure
===========================================
5990 MS Single thread timing

=======================================
Run two threads on fixed test procedure
=======================================
6100 MS Two thread timing

========================================
Run four threads on fixed test procedure
========================================
12152 MS Four thread timing

Press any key to continue ...

News:

Multicore theory proposal

c0d1f1ed

c0d1f1ed

c0d1f1ed

c0d1f1ed