The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: oex on February 13, 2010, 01:46:36 AM

Title: Core Blimey
Post by: oex on February 13, 2010, 01:46:36 AM
Hey guys,

I was just thinking about multicore (something I dont have but many now do).... I'm assuming if I create a thread it will be automatically executed on a free core by windows? Or do I somehow have to specify which core to use?.... Also is there a quick and dirty way of finding out if a system is multicore? I did see a way in dedndaves CPUID prog but I didnt get what affinity was? I'm assuming threads only have a minor overhead but I'd rather not make an app multithreaded if it will only have a negative impact on execution.... Finally assuming the above assumptions are correct on windows management of threads, will threads be executed on a single core or cross core? If I have an app with 2 threads, core 0 executes main app code and cores 1 and 2 execute 2 threads so is core 3 just sitting there twiddling it's bits just waiting for a kick up the arse from the main app or will it help out cores 0,1 and 2?
Title: Re: Core Blimey
Post by: dedndave on February 13, 2010, 03:01:50 AM
in the version i am working on, i use GetProcAddress of SetProcessAffinityMask to test if the OS supports multiple cores
it is more reliable than getting the OS version - supposedly, NT4 and up support multiple cores
the tricky one is windows CE, for embedded systems
i found different answers to "does CE support multiple cores" - i suspect newer versions may - older ones may not
some versions of CE are "buildable" and the OEM may be able to eliminate the API's

once you see that the OS supports it, you can use GetProcessAffinityMask to get 2 masks; 1 for the process and 1 for the system
if you want to know how many (enabled) logical cores are in the system, just count the bits in the system affinity mask
they could be hyper-thread cores, physical cores in the same package and/or multiple packages

as for the thread assignment, the OS will assign a core for a new thread
there is no documented guarantee that it won't switch cores but if it does, i doubt it happens very often
for single-core or multi-core machines, the thread is given time-slices, just like any other process
unless you bind a thread to a specific core, it can operate on whichever core the OS sees fit
for threads, there are also GetThreadAffinityMask and SetThreadAffinityMask API's

remember there are other processes running all the time, so a "free core" probably doesn't happen very often
if the OS is busy or if you have multiple programs running - who knows how it will assign threads to cores
you may have 3 threads all running on the same core
but, they get seperate registers and stacks, etc (i.e context)
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 03:39:25 AM
ty that helps solidify my understanding, my apps are rather intensive so being able to use multiple cores if available is a big plus
Title: Re: Core Blimey
Post by: jj2007 on February 13, 2010, 10:29:40 AM
Make sure to avoid multiple threads fighting for the reading head of your hard disk...
Title: Re: Core Blimey
Post by: dedndave on February 13, 2010, 11:19:14 AM
yah - it probably makes sense to let the OS finish it's work on files one at a time for large reads or writes
although, if you have an app that sparsely reads or writes small sections of different files, threading might make sense
Title: Re: Core Blimey
Post by: hutch-- on February 13, 2010, 12:52:34 PM
If you run an app like this you will see that the OS tends to distribute the load across the different cores anyway. JJ is right that you should not let different threads assault a disk at the same time, try and do that from one thread alone as it will be faster if the disk id not being thrashed between two or more threads. The thing that spreads the load around is normal OS time slicing so depending on the thread duration(s) as one thread fionishes the core will be re-used in the next time slice to carry the load of the othet threads that are still running.

Something worth uderstanding is that the more threads you start, the harder each core must work so you should keep the number of non suspended threads down to the core count if possible.
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 02:18:15 PM
The multithreading is mainly for in memory compression/decompression so this works for me, ty for the input guys
Title: Re: Core Blimey
Post by: brethren on February 13, 2010, 03:45:20 PM
maybe the code in examples/exampl10/threads/mprocasm will be helpful

heres what it says in the comment
Quote
        The original design for this example was written by
        "c0d1f1ed" in Microsoft C++.

        It has been ported to MASM with a number of corrections
        and has been simplified to test on 1, 2 and 4 core
        processors. It is also one tenth of the size as is
        consistent with pure assembler programming.

        The design is to sequentially start 1 2 and 4 thread
        without using leading or interactive operating system
        thread synchronisation methods which removes a major

        timing delay and it uses an operating system
        synchronisation method on thread exit so the results
        can be displayed when all threads have terminated.

        On a single core machine the results of the two and four
        thread tests should be two and 4 times longer.

        On a dual core machine the two thread test should run in
        much the same time as the single thread test and the four
        thread test should be two times longer.

        On a quad core machine all three tests should have a
        similar timing.
*
Title: Re: Core Blimey
Post by: dedndave on February 13, 2010, 04:43:51 PM
i do find that interesting
i have a dual core (hyperthreaded - not two seperate cores)
the results from that test indicate that i have a single core
i would not want to use the timing method to map cores to affinity mask bits,
but i may be able to adapt something like it to verify whatever method i do come up with
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 07:02:39 PM
Quote from: dedndave on February 13, 2010, 04:43:51 PM
i do find that interesting
i have a dual core (hyperthreaded - not two seperate cores)
the results from that test indicate that i have a single core

That sounds bad, sounds like there is no way to force windows to use an idle core for a thread? Any ideas on the logic used for multi core tasking? Maybe this example is just not multicore with hyperthreading compatible for some reason?
Title: Re: Core Blimey
Post by: jj2007 on February 13, 2010, 07:11:31 PM
Quote from: oex on February 13, 2010, 07:02:39 PM
That sounds bad, sounds like there is no way to force windows to use an idle core for a thread?

No, it seems there is no idle core with HT because there is only one core - it is just a bit more efficiently used, about 30% or so. Windows will choose an idle core automatically for you.
Title: Re: Core Blimey
Post by: dedndave on February 13, 2010, 07:14:44 PM
no - that sounds like it ought to sound
my hyper-threaded core is really a single core and the test reveals that
a hyper-threaded core is essentially an additional set of registers and context - 2 "logical" cores sharing a single "physical" core

as for thread scheduling, the best you can do is divide your thread requirements equally amongst the cores that are present
you can use the SetThreadAffinityMask API for that
in reality, you can probably ignore affinity altogether and let the OS schedule them for you
it will probably do as well or better than you can

EDIT - i can see a case where i might want to manually control scheduling
let's say i have one thread that is extremely processor intensive
and a few other threads that are more-or-less "background" threads
i might bind the intensive thread to one core and the others to a different core
the only reason taking control makes sense is because i know in advance that one thread is intensive
the OS cannot make that kind of prediction
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 07:17:21 PM
ok ty for that info I thought I'd wasted an evening :D
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 09:08:50 PM
hmm just had a thought.... it should be possible to write a macro that replaces the invoke call with something like invokethread proc,args.... This could be invaluable for multicore machines.... I'm rather busy atm got a deadline next week but I know some of you enjoy a challenge.... Not thought it all through and not sure how to get size of macro argsso got stopped at the first hurdle but passing the data something like:


Invoke Macro:

invokeThread MACRO FuncName:REQ,args:VARARG

mov esi, alloc(32)

arg equ <invoke FuncName>

mov [esi], ADDR FuncName
add esi, 4

FOR var,<args>
IF issize(var, 1)
mov [esi], var
inc esi
ENDIF
IF issize(var, 2)
mov [esi], var
add esi, 2
ENDIF
IF issize(var, 4)
mov [esi], var
add esi, 4
ENDIF
ENDM

invoke CreateThread, 0, 0, offset MyThread, esi, 0, offset ThreadID
ENDM


And then reading them back off in MyThread should be quite easy if you can get (and pass) arg sizes

I dont like to waste these little sparks of inspiration ;) far better to share them
Title: Re: Core Blimey
Post by: hutch-- on February 13, 2010, 11:11:28 PM
The late PIVs were reasonably sophisticated for a single core, hyperthreading worked OK on a late enough OS but the availability of multiple core processors produced far better threaded performance. The next generation i7 series do both, hyperthreading AND multiple cores and from the tesing I have seen it makes a big difference again with multithreaded code. The real problem with later PIVs was the pipelie length, if you coded carefully for them you could get them to perform OK but tangle your instruction sequence and you took some big performance hits for it.
Title: Re: Core Blimey
Post by: qWord on February 13, 2010, 11:45:45 PM
Quote from: oex on February 13, 2010, 09:08:50 PM.... it should be possible to write a macro that replaces the invoke call with something like invokethread proc,args....

something like this one (not much tested):
include masm32rt.inc

.code

ThreadJob macro FncName:req,args:VARARG

IFNDEF ThreadJob_proc
ThreadJob_proc proto pVoid:DWORD
ENDIF
cntr = argcount(args)
REPEAT argcount(args)
IF @InStr(1,<%reparg(getarg(cntr,args))>,<ADDR >)
lea eax, @SubStr(<%reparg(getarg(cntr,args))>,5)
push eax
ELSE
push reparg(getarg(cntr,args))
ENDIF
cntr = cntr - 1
ENDM
push argcount(args)
push OFFSET FncName
invoke GlobalAlloc,GPTR,argcount(args)*4+4+4
push eax
lea edx,[esp+4]
invoke MemCopy,edx,eax,argcount(args)*4+4+4
pop edx
invoke CreateThread,0,0,OFFSET ThreadJob_proc,edx,0,0
add esp,4*argcount(args) + 8

endm

start:

ThreadJob MessageBox,0,"Hallo World","Hi",0
inkey "Press any key to continue ..."
ret

ThreadJob_proc proc pVoid:DWORD
mov esi,pVoid
mov edx,esi
mov ecx,[esi+4]
lea eax,[ecx*4]
sub esp,eax
add esi,8
mov edi,esp
cld
rep movsd
call DWORD ptr [edx]
push eax
invoke GlobalFree,pVoid
call ExitThread
ThreadJob_proc endp
end start
Title: Re: Core Blimey
Post by: oex on February 13, 2010, 11:55:21 PM
:) nice 1.... I was about half way through :lol, not much of a macro person.... You may need something like this in there:
      IF issize(var, 1)
         IF isregister(var)
            mov [esi], var
            inc esi
            inc edi
         ELSE
            mov   al, var
            mov [esi], al
            inc esi
            inc edi
         ENDIF
      ENDIF


EDIT: Yeah it's turning into a large macro but not a bad one methinks
Title: Re: Core Blimey
Post by: oex on February 14, 2010, 12:31:13 AM
You have some quirky code in there like

mov edi,esp
cld
rep movsd

I like it :bg

EDIT: Just noticed that PUSH and POP work only with WORD/DWORD values not BYTE....

In that case when you have
Blah PROC Blah2:DWORD, Blah3:BYTE
is a WORD pushed for second arg?
Title: Re: Core Blimey
Post by: qWord on February 14, 2010, 12:55:19 AM
You should keep in mind, that the threads handles need to be closed (CloseHandle). This isn't done by my macro.

Quote from: oex on February 14, 2010, 12:31:13 AM
In that case when you have
Blah PROC Blah2:DWORD, Blah3:BYTE
is a WORD pushed for second arg?
no, a DWORD is pushed - the stack must be aliened on 4 byte.
Title: Re: Core Blimey
Post by: oex on February 14, 2010, 01:48:26 AM
ah kk nice 1 ty
Title: Re: Core Blimey
Post by: dedndave on February 14, 2010, 02:07:47 AM
you guys are gonna make it too easy - lol
with my method, you had to understand more stuff to use it   :red
Title: Re: Core Blimey
Post by: qWord on February 14, 2010, 02:18:26 AM
the QueueUserWorkItem function got in my mind - replacing CreateThread with it will solve the problem of the closing Threads-Handel:
replace invoke CreateThread,0,0,OFFSET ThreadJob_proc,edx,0,0
with
invoke QueueUserWorkItem,OFFSET ThreadJob_proc,edx,WT_EXECUTEDEFAULT

and the new thread proc:
ThreadJob_proc proc pVoid:DWORD
mov esi,pVoid
mov edx,esi
mov ecx,[esi+4]
lea eax,[ecx*4]
sub esp,eax
add esi,8
mov edi,esp
cld
rep movsd
call DWORD ptr [edx]
invoke GlobalFree,pVoid
ret
ThreadJob_proc endp

qWord
Title: Re: Core Blimey
Post by: dedndave on February 14, 2010, 02:22:32 AM
i thought the handle was closed when the thread terminated
when you use CloseHandle, test the result to see if there is an error
Title: Re: Core Blimey
Post by: sinsi on February 14, 2010, 02:39:37 AM
Quote from: MSDN CreateThreadThe thread object remains in the system until the thread has terminated and all handles to it have been closed through a call to CloseHandle.
In their example they wait until the thread finishes then use CloseHandle on it.
Title: Re: Core Blimey
Post by: dedndave on February 14, 2010, 02:55:56 AM
thanks Sinsi   :U
that's one document i don't have to read - lol
i tell ya, after researching CPUID, i feel i have read my share
Title: Re: Core Blimey
Post by: Hagrid on February 18, 2010, 10:16:56 PM
Hello everyone.

Just wanted to throw out some general information about cores and threads and hyperthreading and affinity and so on.

Windows has had proper time-sliced multi-threading sinces Windows NT version 3.1 (yes, it did exist) and has supported multiple processors since that time.  We are talking 1994 vintage here and symmetric multiprocessor ("SMP") were not entirely common.  I assembled my first personal SMP machine in about 1995 with a pair of 90mhz Pentiums.

The thread scheduling model for Windows is based on the concept of "executing the highest priority runnable thread".  Of course, in an SMP or multi-core system, this becomes plural.  A thread that is blocked because it is waiting on something (e.g. I/O) is not in a runnable state and, therefore, is not a candidate for time scheduling.  Of the remaining threads, the highest priority thread is the one that gets the CPU.

Threads that are waiting in the background for time will have their priority incremented from time to time until they get a CPU time slice after which their priority returns to normal.  Don't be tempted to muck with thread priorities directly by increasing your thread's priority as this will have consequences that can be devastating to performance.  Increasing thread priority should be reserved for threads that require a fast response time to an external event (i.e. response time) - the task of such a high priority thread is to gather whatever information is needed to deal with the event and queue this to be handled by a lower priority thread outside of "real time".  Heavy computational work should *never* be done in a high priority thread.  Sermon over.

As you know, each CPU core has a cache or two to help avoid doing unnecessary trips out to main memory.  Over a period of time, the cache fills with information relevant to the thread that is running on that core.  In a multi-core system, you want to try to keep a particular thread running on the same core if possible.  If a thread needs to jump to a different core, the cache on the new core needs to fill from memory (as required) so there is a performance hit.  This is where thread affinity fits in.  Once a thread starts to run on a particular core, it has an affinity with that core (thanks to the cache) and will return to the same core (if possible) for the next time slice.

Switching between threads on the same core is an expensive exercise as it involves a switch out of a large register set.  Intel introduced hyperthreading as a kind of duplicate register set to speed up switching between two particular threads.  In order to handle this as a new scheduling option, Windows shows a hyperthreaded CPU as two cores.  This allows Windows to distinguish between a context switch that involves hyperthreading and one that needs to do the full register swap out switch.  The fast switching of threads in Hyperthreading helps reduce overheads when you have two competing priority threads, but isn't going to help much when you have a lot of threads that are competing.  I often encountered "thread madness" in my work where programmers were kicking off threads all over the place to do the tiniest tasks.  Server software was often written as "one thread per client" until programmers learnt that this was not a free lunch.

So, for a high compute load worker thread, there is no point in running parallel threads on a single-core HT processor as you will only incur thread-switching penalties.

Running an SMP system or a proper dual-core/quad-core/etc system is an entirely different cup of coffee.

When designing compute-intensive algorithms for such systems, you basically want to keep the threads from stepping on each others toes.  Concurrent access to disk drives has been mentioned, but this is actually less of a problem than you might think - especially with current generation HDD's.  Where the hardware/drivers permitted, Windows NT (through to current) has always supported methods such as elevator seeking where IO requests are resequenced to minimise seek times.  This used only to be functional on high-end SCSI drives, but I believe that the current generation of SATA drives supporting native command queueing allow this as well.

Any single resource required by multiple threads will become a performance choke point - and these need to be avoided.

For high performance compute threads, it is much more important to keep them out of each others memory.  Multiple cores accessing the same memory location invalidates the local cache for the cores and you lose the performance of the cache.  This might be less of an issue for multi-core CPUs (I haven't checked), but for old-school SMP boxes, it was a big issue.  I don't think the principle has changed much.

Keep the number of compute threads equal to the number of cores you have.  An i7 has four cores (not 8 as shown in task manager - the HT thing happening there), so four compute threads is exactly right.  Give each thread its own big block of data to play with and let it get on with it.  As each thread finishes processing its block, give it another one (recycle the thread - don't keep starting and stopping threads).

FWIW,
Hagrid
Title: Re: Core Blimey
Post by: hutch-- on February 18, 2010, 10:31:57 PM
Hi Hagrid,

Welcome on board and thanks for an interesting post full of useful data.  :U
Title: Re: Core Blimey
Post by: oex on February 18, 2010, 10:39:09 PM
Thank you that was most useful, I learnt some things :bg
Title: Re: Core Blimey
Post by: BlackVortex on February 19, 2010, 08:00:00 AM
Interesting, I learned some things as well. From what I gather, the moral of the story is to NOT get aggressive with the threads, the priorities or the HDD IO. Trust in Windows  :toothy
Title: Re: Core Blimey
Post by: Hagrid on February 19, 2010, 08:29:57 AM
Quote from: BlackVortex on February 19, 2010, 08:00:00 AM
Interesting, I learned some things as well. From what I gather, the moral of the story is to NOT get aggressive with the threads, the priorities or the HDD IO. Trust in Windows  :toothy
Short answer is yes.  Its more a matter of understanding the rules of the game so that you can exploit them to your advantage.  I cannot be sure that all of my information is still correct as I have been out of the systems programming game for quite a few years.  Vista and Windows 7, along with 64-bitness and the multicore processors have all come into existance since I was consulting in this area, so there may have been some alterations to the scheduling strategy sicne then - although I suspect not.

It used to be that the key differences between scheduling models of Windows NT (and subsequent versions) and the *nixes was that Windows was scheduling for UI responsiveness whereas the *nixes were scheduling for fairness.  The unix derivatives grew from multi-user roots where the system had to ensure that all users got a fair go.

This also affects Windows and, in particular, the "server" class versions.  You can expect that Windows 2000 Advance Server, Server 2003, and Server 2008 will behave differently in the scheduling model as these are multi-user platforms.
Hagrid
Title: Re: Core Blimey
Post by: sinsi on February 19, 2010, 08:48:13 AM
I think that you need 2 threads for your GUI program - the main one with the user aspect ("why can't I click cancel") and the one that does the work.

As far as "as many threads as the number of cpu's", this doesn't matter (my fresh install of XP has 400+ threads going now). A lot of them are blocking but I think
a blind "4 cores = 4 threads" is misleading. FWIW, I have been looking around at threads/async IO and still don't have a clear idea (:

>Trust in Windows
:bdg
Title: Re: Core Blimey
Post by: Hagrid on February 21, 2010, 03:52:13 AM
Quote from: sinsi on February 19, 2010, 08:48:13 AM
As far as "as many threads as the number of cpu's", this doesn't matter (my fresh install of XP has 400+ threads going now). A lot of them are blocking but I think
a blind "4 cores = 4 threads" is misleading. FWIW, I have been looking around at threads/async IO and still don't have a clear idea (:

You're right about "4 cores = 4 threads" being misleading.  That isn't actually what I said.  I suggested "four compute threads".  You can have as many threads as you want in your application (each has its own overhead).  Threads that are blocked on IO are fine (although there are likely better ways) as such threads are not contending for CPU time.  Having more runnable compute threads than CPU cores is a different deal and this is where unnecessary thread switching will eat into efficiency.

If your app does a lot of IO (network, disk, etc.) then you should be considering asynchronous IO combined with an IOCompletionPort.  The IOCP acts as a worker thread dispatch point.  You can create as many worker threads as you want and as an IO completes, the last thread to wait on the IOCP will be released to process the response.  IOCPs are designed to restrict the number of running threads to equal the CPUs automatically.  As you can post your own completion notifications to an IOCP, it also is a nifty method of queueing work for compute threads.

The LIFO management of worker threads with IOCP is intended to also prevent unnecessary thread switches - a worker thread that calls into the IOCP will return immediately if a completion notification is ready.  A FIFO strategey would guarantee a thread switch on every IO.

Hagrid
Title: Re: Core Blimey
Post by: hutch-- on February 22, 2010, 02:00:20 AM
This has been a very interesting discussion, like everyone else I have seen a mountain of software over the last 10 or so years that pelted threads around all over the place and the apps were characteristically laggy and slow for exactly the reason that massive active thread counts added far too much overhead to many applications. Multicore procesors have relieved this problem by a long way and the later i7 series with hyperthreading appears to have relieved it further but the fundamentals of the problem are the same, the more active threads you have competing for processor time, the higher the overhead to task switch them will be.

Quads are now common and we are entering the era of many core processors which opens up some exciting possibilities if the rest of the package is developed as well. Asynchronous parallel processing has been with us since win95 and many core procesors will make this type of code faster simply by spreading the load across more processors but the more interesting stuff will be when x86 catches up to the Itanium capacity of running cores in synch. Synchronous parallel processing will see big gains in processing power as the increase in core count can be used in different ways.
Title: Re: Core Blimey
Post by: redskull on February 22, 2010, 02:40:47 AM
Quote from: sinsi on February 19, 2010, 08:48:13 AM
FWIW, I have been looking around at threads/async IO and still don't have a clear idea (:

An IOCP is a way that async I/O can be distributed amongst several worker threads, while controlling the number which are running at once.  A good analogy for an IOCP is an IT office: having a dedicated IT employee (a worker thread) assigned to help each other non-IT employee (an I/O operation) would be easy, but wasteful and ineffcieint; since you would probably have more IT staff than desks and computers (CPU's), much of the staff would sit around waiting for one to become free.  However, having a single IT employee to serve everybody is no good, because the work just piles up behind him, and the other desks and computers go to waste (in this analogy, an IT member can't work on other projects while waiting on something to finish for the current one).  The most efficient method is to have one manager, and one IT member for each desk; requests go to the manager, who distributes the jobs to the employees one at a time.  That way, no desks go unused, and no employees sit around waiting.  When an employee finishes a task, he goes back to the manager to get another one.

Each IT employee is a worker thread, and the manager is the IOCP.  Your multithreaded program is MOST efficient when you have a single thread per CPU (all the other threads in the system are out of your control, so you can only worry about how efficient *your* program is).  The general idea is that a worker thread starts an I/O task, and then "checks back in" to the IOCP.  Eventually, when the I/O finishes, the IOCP recieves the notification and starts up another worker thread to handle whatever needs to be done.  By having enough worker threads 'queued up', waiting for I/O operations to complete, you get the maximum efficiency from your program.  The trick is that you can configure the IOCP to 'throttle' the number of worker threads it starts at once; normally, 1 per CPU.

For example, imagine you have lots of disk accesses to do; reading the info from the user is an I/O opeartion, so whenever the read completes, the IOCP will wake up a thread, which will parse the input and start the disk read, and then go back to waiting.  Whenever the disk read completes, the IOCP wakes up the next worker thread, who deals with the results of the disk read (sending them back to the user, etc).  Obviously, the more pending I/O operations you have, the bigger the payoffs.  The problem with programming with IOCP's is that since each worker thread is identical, they all have be 'smart' enough to deal with any of the results, isntead of each thread dedicated to one task.

Obviously, the more I/O intensive your app is and the more CPU's in the computer you intend to run on, the graeter the benefits will be; such a set up is idea for something like an SQL server, which runs on servers with dozens of CPU's, whose sole purpose is to read from a network, read from a disk, write to a network, and repeat on a staggering scale.

-r

Title: Re: Core Blimey
Post by: dedndave on February 22, 2010, 03:01:12 AM
in a way, it makes US work harder as programmers
if we want high performance software, we may have to carefully design it to take advantage of what the machine has to offer
i.e., our code has to adapt, which could make it a bit complicated
this is something that may be more prevelent in the near future, as more and more machines have multiple cores
soon, we will be able to say "most machines"
but, that could be anywhere from 2 to lord knows how many cores
we may want to run a few baseline tests in the laboratory
let's wait til one of the members gets a dual package i9   :bg
Title: Re: Core Blimey
Post by: oex on February 23, 2010, 06:14:20 PM
AMD has been reading my posts, they have a section on getting core count with CPU ID in this months newsletter :lol