News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

codewarp

Quote from: johnsa on June 08, 2008, 11:45:23 AM
That being said, perhaps we should take Hutch's example code and possibly a few other similar ones which try to create load in different areas, processing, memory etc and try to multi-thread multi-core enable them and see what comes of it. Perhaps we can find some hybrid solutions to common programming tasks (after all there should be some good brains in this forum). Perhaps even start working towards putting together an additional library for MASMv11/v12 that provides some multi-core "helper" routines.

Hutch is right, his code cannot be helped by using MPs without changing the code, but so what?  Nobody is trying to do that using this technology, except for Hutch.  I'm afraid Hutch's example does not address any issues at all.  He insists on treating threads as synchronous tools for him to switch on and off, like calling a subroutine.  Since that doesn't work, and never will, he declares the technology worthless.  The truth is, that he just doesn't like it if you have to change the code--he states as much in his "rules" govering his example.  Too bad, you are just gonna have ta learn new programming models, that's just the way it is.  We did it in the structured programming days.  We did it in the object-oriented days.  We did it with networks, and with the Internet, and with multiple threads and other things, and now with widespread MPs.

The marketplace wants multiple processors, multiple processors require fundamental changes in the programming models at the application level, and the marketplace knows what to do with code and programmers who refuse to change.  Hutches rule: "You can't change this" is his own stumbling block, a limitation of his own that he is going to have to struggle with by himself.  Sorry Hutch, when the stakes are this high, we will be changing the code, rules are made to be broken.

Those that try to hold back the dawn are doomed to failure.  

c0d1f1ed

hutch,

I have done objective testing. It's four times faster on my quad-core.

Wait-free synchronization doesn't stop anything, that's why it's called wait-free. It takes a dozen cycles to do the atomic increment and that's it. Any approach in ring0 is going to take at least this long (or should I say short).

By the way, I'm not a gamer, I'm an engineer.

codewarp

Quote from: c0d1f1ed on June 08, 2008, 06:00:25 PM
hutch,

I have done objective testing. It's four times faster on my quad-core.

Wait-free synchronization doesn't stop anything, that's why it's called wait-free. It takes a dozen cycles to do the atomic increment and that's it. Any approach in ring0 is going to take at least this long (or should I say short).

By the way, I'm not a gamer, I'm an engineer.

If I might be so bold, but I don't think that is what Hutch is after.  He wants to see that one loop sped up with the application of MPs (multi-processors), without changing the code.  Of course, the whole challenge is a contrived set up.  It absurdly tries to hold MPs to a rediculous and laughable synchronous standard, like task switches in one cycle.  Besides, if nobody can change the code, then the code can't be written in the first place without violating its own rules.

Until you completely surrender to the reality that threads are asychronous and out of time with one another, you will remain ineffective with MPs.  Once this fundamental truth sinks in, you can start to compute on-the-fly, using programming models designed to tolerate asynchronous computation.  Those of you expecting synchronous behavior from MPs have seriously bet on the wrong horse, and you have to work extra hard to overcome your erroneous expectation.  No matter how hard and fast the clock speeds get, MPs and all the same issues are present all the way up--resistance is futile.

Holding MPs to a synchronous standard, is like that old joke: "Stop staring at your radio".  Hutch, if you want synchronous parallelism, learn about SSE, and stop staring at MPs.

hutch--

c0d1f1ed,

I labelled the approach as the same as a gamer brain for a reason, to synchronise your parallel threads you either use the OS which is far too slow or you run a spinlock wait loop, there is no sub millisecond method available to yield wasted time so you lock up the processor for the synchronisation wait time. Your notion of "wait free" is a myth here, it does not happen by magic.

Now I have heard you state that you have tested intensive code and it goes faster but we have yet to see it from you. Any processor intensive process will do that addresses memory, try a memory fill algo which is commonly used in the graphics/games area and see if you can get an identical algorith to go two or four times faster than its single thread timings using multiple cores. A ratio of 1.9 would be fine.

codewarp,

> The marketplace wants multiple processors

It already has them, you just don't like the starting price. What the market in fact wants is cheap high performance processors but PC processors are only in their infancy in multiple processors. One of the few things Itaniums are good for is multiple parallel processing if they have high performance dedicated hardware to support them.

Quote
We did it in the structured programming days.  We did it in the object-oriented days.  We did it with networks, and with the Internet, and with multiple threads and other things, and now with widespread MPs.

This tells me much more than the preceding anecdotal waffle. Flitting from one tend to another complete with the bloat, performance degradation and free lunch assumptions that go with it and now with the current hiatus in clock speeds, the free lunch movement are trying to flog the coming multicore hadware as the next free lunch for lousy slow bloated code. How many cores will you need to make the first terabyte "Hello World" run fast ?

> Those that try to hold back the dawn are doomed to failure.

Those who want the water to run back uphill to where it was in the past will be disappointed.

LATER: Here is an SGI cheapie built with x86-64 hardware.

http://www.sgi.com/products/servers/altix/xe/

Have a look at the spec sheet of diferent configurations, clustering and the like. The stuff you have in mind is kids stuff.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

codewarp

Quote from: hutch-- on June 08, 2008, 07:00:56 PM
This tells me much more than the preceding anecdotal waffle. Flitting from one tend to another complete with the bloat, performance degradation and free lunch assumptions that go with it and now with the current hiatus in clock speeds, the free lunch movement are trying to flog the coming multicore hadware as the next free lunch for lousy slow bloated code. How many cores will you need to make the first terabyte "Hello World" run fast ?

This is a technical topic, deserving of a serious discussion.  Hutch, you are obviously emotionally invested in being right about this.  The rest of the world disagrees with you.  Since we have heard nothing from you on the serious aspects of this discussion, and hear the same "can't get there from here" message over and over again, I would like you to excuse yourself from this thread until you have something more constructive to contribute.  The quote above makes this painfully plain and obvious for anyone to see.

hutch--

Any more wisecracks and I will excuse you from the thread. Its a case of put up or shut up and I have yet to see any working code from you.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Quote from: c0d1f1ed on June 08, 2008, 06:00:25 PM
I have done objective testing. It's four times faster on my quad-core.

Post some code, then I can test it on my quad...
Light travels faster than sound, that's why some people seem bright until you hear them.

c0d1f1ed

Quote from: hutch-- on June 08, 2008, 07:00:56 PM
I labelled the approach as the same as a gamer brain for a reason, to synchronise your parallel threads you either use the OS which is far too slow or you run a spinlock wait loop, there is no sub millisecond method available to yield wasted time so you lock up the processor for the synchronisation wait time. Your notion of "wait free" is a myth here, it does not happen by magic.

Now I have heard you state that you have tested intensive code and it goes faster but we have yet to see it from you. Any processor intensive process will do that addresses memory, try a memory fill algo which is commonly used in the graphics/games area and see if you can get an identical algorith to go two or four times faster than its single thread timings using multiple cores. A ratio of 1.9 would be fine.

No, I do not use the O.S. for sychronization, nor do I use a spinlock wait loop. Read chapter 1.1 of The Art of Multiprocessor Programming. The atomic increment is the only form of synchronization and it takes nanoseconds, not milliseconds. Wait-free synchronization is not a myth, it's backed by peer-reviewed research.

Didn't I tell you exactly how to trivially turn your code into multi-threaded code? You don't even need to bother with spin locks or lock-free or wait-free approaches. Anyway, for the lazy:

#include <windows.h>
#include <stdio.h>

int n;

HANDLE done[4];

void hutchTask()
{
    int var = 12345678;

    for(unsigned int i = 0; i < 4000000000 / n; i++)
    {
        __asm
        {
            mov eax, var
            mov ecx, var
            mov edx, var
        }
    }
}

unsigned long __stdcall threadRoutine(void *parameter)
{
    hutchTask();

    SetEvent(done[*(int*)parameter]);

    return 0;
}

int main()
{
    DWORD elapsedMilliseconds[4];

    for(int threads = 1; threads <= 4; threads++)
    {
        n = threads;

        for(int i = 0; i < n; i++)
        {
            done[i] = CreateEvent(0, FALSE, FALSE, 0);
        }

        HANDLE threadHandle[4];
        int parameter[4] = {0, 1, 2, 3};

        DWORD startTime = GetTickCount();

        for(int i = 0; i < n; i++)
        {
            threadHandle[i] = CreateThread(0, 0, threadRoutine, &parameter[i], 0, 0);
        }

        WaitForMultipleObjects(n, done, true, INFINITE);

        elapsedMilliseconds[n - 1] = GetTickCount() - startTime;

        for(int i = 0; i < n; i++)
        {
            CloseHandle(done[i]);
            CloseHandle(threadHandle[i]);
        }

        printf("Milliseconds for %d threads: %d, multi-thread speedup: %f\n", n, elapsedMilliseconds[n - 1], (float)elapsedMilliseconds[0] / elapsedMilliseconds[n - 1]);
    }
}


Running this on my Q6600 give me:

Quote
Milliseconds for 1 threads: 61090, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 30904, multi-thread speedup: 1.976767
Milliseconds for 3 threads: 21247, multi-thread speedup: 2.875229
Milliseconds for 4 threads: 17800, multi-thread speedup: 3.432022

Quote from: hutch--show us how you can run the identical code on 2 or 4 cores faster than this runs on a single core

Now you please show me a peer-reviewed article that shows Reversed Hyper-Threading is real. If you can't, stating that you might be wrong will do.

hutch--

Can you post your build info as I get this trying to build your example in the vctoolkit. Alternatively post the complete project with its makefile of project file so someone else can build it.


@echo off

set lib=h:\vctoolkit\lib\
set include=h:\vctoolkit\include\

if exist mproc.exe del mproc.exe
if exist mproc.obj del mproc.obj

h:\vctoolkit\bin\cl /c /G7 /O2 /Ot /GA /TC /W3 /FA mproc.c
h:\vctoolkit\bin\Link /SUBSYSTEM:WINDOWS /libpath:h:\vctoolkit\lib gdi32.lib kernel32.lib user32.lib mproc.obj

dir mproc.*

pause




Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3052 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

mproc.c
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2143: syntax error : missing ')' before 'type'
mproc.c(12) : error C2143: syntax error : missing ';' before 'type'
mproc.c(12) : error C2065: 'i' : undeclared identifier
mproc.c(12) : warning C4018: '<' : signed/unsigned mismatch
mproc.c(12) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(12) : error C2059: syntax error : ')'
mproc.c(13) : error C2143: syntax error : missing ';' before '{'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2143: syntax error : missing ')' before 'type'
mproc.c(36) : error C2143: syntax error : missing ';' before 'type'
mproc.c(36) : error C2065: 'threads' : undeclared identifier
mproc.c(36) : warning C4552: '<=' : operator has no effect; expected operator with side-effect
mproc.c(36) : error C2059: syntax error : ')'
mproc.c(37) : error C2143: syntax error : missing ';' before '{'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : error C2143: syntax error : missing ')' before 'type'
mproc.c(40) : error C2143: syntax error : missing ';' before 'type'
mproc.c(40) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(40) : error C2059: syntax error : ')'
mproc.c(41) : error C2143: syntax error : missing ';' before '{'
mproc.c(45) : error C2275: 'HANDLE' : illegal use of this type as an expression
        h:\vctoolkit\include\WinNT.h(342) : see declaration of 'HANDLE'
mproc.c(45) : error C2146: syntax error : missing ';' before identifier 'threadHandle'
mproc.c(45) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(45) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(45) : error C2143: syntax error : missing ';' before 'identifier'
mproc.c(45) : error C2065: 'threadHandle' : undeclared identifier
mproc.c(45) : error C2109: subscript requires array or pointer type
mproc.c(46) : error C2143: syntax error : missing ';' before 'type'
mproc.c(48) : error C2275: 'DWORD' : illegal use of this type as an expression
        h:\vctoolkit\include\WinDef.h(141) : see declaration of 'DWORD'
mproc.c(48) : error C2146: syntax error : missing ';' before identifier 'startTime'
mproc.c(48) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(48) : error C2144: syntax error : '<Unknown>' should be preceded by '<Unknown>'
mproc.c(48) : error C2143: syntax error : missing ';' before 'identifier'
mproc.c(48) : error C2065: 'startTime' : undeclared identifier
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : error C2143: syntax error : missing ')' before 'type'
mproc.c(50) : error C2143: syntax error : missing ';' before 'type'
mproc.c(50) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(50) : error C2059: syntax error : ')'
mproc.c(51) : error C2143: syntax error : missing ';' before '{'
mproc.c(52) : error C2109: subscript requires array or pointer type
mproc.c(52) : error C2065: 'parameter' : undeclared identifier
mproc.c(52) : error C2109: subscript requires array or pointer type
mproc.c(52) : error C2198: 'CreateThread' : too few arguments for call through pointer-to-function
mproc.c(55) : error C2065: 'true' : undeclared identifier
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : error C2143: syntax error : missing ')' before 'type'
mproc.c(59) : error C2143: syntax error : missing ';' before 'type'
mproc.c(59) : warning C4552: '<' : operator has no effect; expected operator with side-effect
mproc.c(59) : error C2059: syntax error : ')'
mproc.c(60) : error C2143: syntax error : missing ';' before '{'
mproc.c(62) : error C2109: subscript requires array or pointer type
mproc.c(62) : error C2198: 'CloseHandle' : too few arguments for call through pointer-to-function
Microsoft (R) Incremental Linker Version 7.10.3052
Copyright (C) Microsoft Corporation.  All rights reserved.

LINK : fatal error LNK1181: cannot open input file 'mproc.obj'
Volume in drive H is WIN2K_H
Volume Serial Number is 20E8-3719

Directory of H:\vctoolkit\multiproc

06/09/2008  11:46p               1,413 mproc.c
06/09/2008  11:59p               1,413 mproc.cpp
               2 File(s)          2,826 bytes
               0 Dir(s)  17,639,866,368 bytes free
Press any key to continue . . .
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php


hutch--

I didn't ask for a link, I asked for your build information.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed


hutch--

Spare us the smartarse wisecracks and just post your build information. This code looks like it will work, why the kiddies games ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

c0d1f1ed

Download Visual C++ Express, copy/paste the code and hit F5. Where's the problem?

hutch--

While we are waiting for your buildable source, here is the part conversion in masm.

Timing

4515 MS Single thread timing
Press any key to continue ...

5 year old PIV 2.8 gig Northwood.


Your timings on quad core Intel.

Milliseconds for 1 threads: 61090, multi-thread speedup: 1.000000
Milliseconds for 2 threads: 30904, multi-thread speedup: 1.976767
Milliseconds for 3 threads: 21247, multi-thread speedup: 2.875229
Milliseconds for 4 threads: 17800, multi-thread speedup: 3.432022


Why is there such a major timing difference between a 5 year old PIV and your quad core ?

Source

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    threadRoutine PROTO :DWORD

    .data
      elapsedMilliseconds dd 0,1,2,3
      done dd 0,0,0,0

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc

  ; --------------------
  ; single thread timing
  ; --------------------
    invoke GetTickCount
    push eax

    invoke threadRoutine,0

    invoke GetTickCount
    pop ecx
    sub eax, ecx

    print str$(eax)," MS Single thread timing",13,10

    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

threadRoutine proc parameter:DWORD

    LOCAL var   :DWORD

    mov var, 12345678

    push esi
    mov esi, 4000000000

  align 16
  @@:
    mov eax, var
    mov ecx, var
    mov edx, var
    sub esi, 1
    jnz @B

    pop esi

    invoke SetEvent,parameter
    mov done, eax

    ret

threadRoutine endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php