News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Core Blimey

Started by oex, February 13, 2010, 01:46:36 AM

Previous topic - Next topic

qWord

Quote from: oex on February 13, 2010, 09:08:50 PM.... it should be possible to write a macro that replaces the invoke call with something like invokethread proc,args....

something like this one (not much tested):
include masm32rt.inc

.code

ThreadJob macro FncName:req,args:VARARG

IFNDEF ThreadJob_proc
ThreadJob_proc proto pVoid:DWORD
ENDIF
cntr = argcount(args)
REPEAT argcount(args)
IF @InStr(1,<%reparg(getarg(cntr,args))>,<ADDR >)
lea eax, @SubStr(<%reparg(getarg(cntr,args))>,5)
push eax
ELSE
push reparg(getarg(cntr,args))
ENDIF
cntr = cntr - 1
ENDM
push argcount(args)
push OFFSET FncName
invoke GlobalAlloc,GPTR,argcount(args)*4+4+4
push eax
lea edx,[esp+4]
invoke MemCopy,edx,eax,argcount(args)*4+4+4
pop edx
invoke CreateThread,0,0,OFFSET ThreadJob_proc,edx,0,0
add esp,4*argcount(args) + 8

endm

start:

ThreadJob MessageBox,0,"Hallo World","Hi",0
inkey "Press any key to continue ..."
ret

ThreadJob_proc proc pVoid:DWORD
mov esi,pVoid
mov edx,esi
mov ecx,[esi+4]
lea eax,[ecx*4]
sub esp,eax
add esi,8
mov edi,esp
cld
rep movsd
call DWORD ptr [edx]
push eax
invoke GlobalFree,pVoid
call ExitThread
ThreadJob_proc endp
end start
FPU in a trice: SmplMath
It's that simple!

oex

:) nice 1.... I was about half way through :lol, not much of a macro person.... You may need something like this in there:
      IF issize(var, 1)
         IF isregister(var)
            mov [esi], var
            inc esi
            inc edi
         ELSE
            mov   al, var
            mov [esi], al
            inc esi
            inc edi
         ENDIF
      ENDIF


EDIT: Yeah it's turning into a large macro but not a bad one methinks
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

oex

You have some quirky code in there like

mov edi,esp
cld
rep movsd

I like it :bg

EDIT: Just noticed that PUSH and POP work only with WORD/DWORD values not BYTE....

In that case when you have
Blah PROC Blah2:DWORD, Blah3:BYTE
is a WORD pushed for second arg?
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

qWord

You should keep in mind, that the threads handles need to be closed (CloseHandle). This isn't done by my macro.

Quote from: oex on February 14, 2010, 12:31:13 AM
In that case when you have
Blah PROC Blah2:DWORD, Blah3:BYTE
is a WORD pushed for second arg?
no, a DWORD is pushed - the stack must be aliened on 4 byte.
FPU in a trice: SmplMath
It's that simple!

oex

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

you guys are gonna make it too easy - lol
with my method, you had to understand more stuff to use it   :red

qWord

the QueueUserWorkItem function got in my mind - replacing CreateThread with it will solve the problem of the closing Threads-Handel:
replace invoke CreateThread,0,0,OFFSET ThreadJob_proc,edx,0,0
with
invoke QueueUserWorkItem,OFFSET ThreadJob_proc,edx,WT_EXECUTEDEFAULT

and the new thread proc:
ThreadJob_proc proc pVoid:DWORD
mov esi,pVoid
mov edx,esi
mov ecx,[esi+4]
lea eax,[ecx*4]
sub esp,eax
add esi,8
mov edi,esp
cld
rep movsd
call DWORD ptr [edx]
invoke GlobalFree,pVoid
ret
ThreadJob_proc endp

qWord
FPU in a trice: SmplMath
It's that simple!

dedndave

i thought the handle was closed when the thread terminated
when you use CloseHandle, test the result to see if there is an error

sinsi

Quote from: MSDN CreateThreadThe thread object remains in the system until the thread has terminated and all handles to it have been closed through a call to CloseHandle.
In their example they wait until the thread finishes then use CloseHandle on it.
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

thanks Sinsi   :U
that's one document i don't have to read - lol
i tell ya, after researching CPUID, i feel i have read my share

Hagrid

Hello everyone.

Just wanted to throw out some general information about cores and threads and hyperthreading and affinity and so on.

Windows has had proper time-sliced multi-threading sinces Windows NT version 3.1 (yes, it did exist) and has supported multiple processors since that time.  We are talking 1994 vintage here and symmetric multiprocessor ("SMP") were not entirely common.  I assembled my first personal SMP machine in about 1995 with a pair of 90mhz Pentiums.

The thread scheduling model for Windows is based on the concept of "executing the highest priority runnable thread".  Of course, in an SMP or multi-core system, this becomes plural.  A thread that is blocked because it is waiting on something (e.g. I/O) is not in a runnable state and, therefore, is not a candidate for time scheduling.  Of the remaining threads, the highest priority thread is the one that gets the CPU.

Threads that are waiting in the background for time will have their priority incremented from time to time until they get a CPU time slice after which their priority returns to normal.  Don't be tempted to muck with thread priorities directly by increasing your thread's priority as this will have consequences that can be devastating to performance.  Increasing thread priority should be reserved for threads that require a fast response time to an external event (i.e. response time) - the task of such a high priority thread is to gather whatever information is needed to deal with the event and queue this to be handled by a lower priority thread outside of "real time".  Heavy computational work should *never* be done in a high priority thread.  Sermon over.

As you know, each CPU core has a cache or two to help avoid doing unnecessary trips out to main memory.  Over a period of time, the cache fills with information relevant to the thread that is running on that core.  In a multi-core system, you want to try to keep a particular thread running on the same core if possible.  If a thread needs to jump to a different core, the cache on the new core needs to fill from memory (as required) so there is a performance hit.  This is where thread affinity fits in.  Once a thread starts to run on a particular core, it has an affinity with that core (thanks to the cache) and will return to the same core (if possible) for the next time slice.

Switching between threads on the same core is an expensive exercise as it involves a switch out of a large register set.  Intel introduced hyperthreading as a kind of duplicate register set to speed up switching between two particular threads.  In order to handle this as a new scheduling option, Windows shows a hyperthreaded CPU as two cores.  This allows Windows to distinguish between a context switch that involves hyperthreading and one that needs to do the full register swap out switch.  The fast switching of threads in Hyperthreading helps reduce overheads when you have two competing priority threads, but isn't going to help much when you have a lot of threads that are competing.  I often encountered "thread madness" in my work where programmers were kicking off threads all over the place to do the tiniest tasks.  Server software was often written as "one thread per client" until programmers learnt that this was not a free lunch.

So, for a high compute load worker thread, there is no point in running parallel threads on a single-core HT processor as you will only incur thread-switching penalties.

Running an SMP system or a proper dual-core/quad-core/etc system is an entirely different cup of coffee.

When designing compute-intensive algorithms for such systems, you basically want to keep the threads from stepping on each others toes.  Concurrent access to disk drives has been mentioned, but this is actually less of a problem than you might think - especially with current generation HDD's.  Where the hardware/drivers permitted, Windows NT (through to current) has always supported methods such as elevator seeking where IO requests are resequenced to minimise seek times.  This used only to be functional on high-end SCSI drives, but I believe that the current generation of SATA drives supporting native command queueing allow this as well.

Any single resource required by multiple threads will become a performance choke point - and these need to be avoided.

For high performance compute threads, it is much more important to keep them out of each others memory.  Multiple cores accessing the same memory location invalidates the local cache for the cores and you lose the performance of the cache.  This might be less of an issue for multi-core CPUs (I haven't checked), but for old-school SMP boxes, it was a big issue.  I don't think the principle has changed much.

Keep the number of compute threads equal to the number of cores you have.  An i7 has four cores (not 8 as shown in task manager - the HT thing happening there), so four compute threads is exactly right.  Give each thread its own big block of data to play with and let it get on with it.  As each thread finishes processing its block, give it another one (recycle the thread - don't keep starting and stopping threads).

FWIW,
Hagrid

hutch--

Hi Hagrid,

Welcome on board and thanks for an interesting post full of useful data.  :U
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

oex

Thank you that was most useful, I learnt some things :bg
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

BlackVortex

Interesting, I learned some things as well. From what I gather, the moral of the story is to NOT get aggressive with the threads, the priorities or the HDD IO. Trust in Windows  :toothy

Hagrid

Quote from: BlackVortex on February 19, 2010, 08:00:00 AM
Interesting, I learned some things as well. From what I gather, the moral of the story is to NOT get aggressive with the threads, the priorities or the HDD IO. Trust in Windows  :toothy
Short answer is yes.  Its more a matter of understanding the rules of the game so that you can exploit them to your advantage.  I cannot be sure that all of my information is still correct as I have been out of the systems programming game for quite a few years.  Vista and Windows 7, along with 64-bitness and the multicore processors have all come into existance since I was consulting in this area, so there may have been some alterations to the scheduling strategy sicne then - although I suspect not.

It used to be that the key differences between scheduling models of Windows NT (and subsequent versions) and the *nixes was that Windows was scheduling for UI responsiveness whereas the *nixes were scheduling for fairness.  The unix derivatives grew from multi-user roots where the system had to ensure that all users got a fair go.

This also affects Windows and, in particular, the "server" class versions.  You can expect that Windows 2000 Advance Server, Server 2003, and Server 2008 will behave differently in the scheduling model as these are multi-user platforms.
Hagrid