I remember a thread recently producing a faster memory fill algo and at the time I made a suggestion that seperating the memory fill from the work done on the memory may be faster if one could be done in a seperate thread from the other.
Below is the rogh idea of what I was suggesting, it is a long way from being fast enough due to crteating a new thread for each memory fill but there may be a way to keep the fill thread running and signalling it by inter thread communication of when to fill a buffer.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
tdata STRUCT
pbuffer dd ? ; buffer address
bufsize dd ? ; buffer size
tdata ENDS
start_fill_thread PROTO :DWORD
fill_thread PROTO :DWORD
.data?
signal dd ?
loaded dd ?
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
call main
inkey
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
memcnt equ <1024*1024*100>
main proc
LOCAL pmem1 :DWORD
LOCAL pmem2 :DWORD
LOCAL tdat :tdata
mov pmem1, alloc(memcnt) ; alocate 2 blocks of memory
mov pmem2, alloc(memcnt)
; ||||||||||||||||||||||||||||||||||||||||||||||||
; fill pmem1 while working on pmem2
; ||||||||||||||||||||||||||||||||||||||||||||||||
mov eax, pmem1
mov DWORD PTR tdat.pbuffer, eax
mov DWORD PTR tdat.bufsize, memcnt ; 100 meg buffer
invoke start_fill_thread,ADDR tdat ; start new thread
; ************************************
; do your own processing here on pmem2
; ************************************
spinlock0:
invoke SleepEx,1,0
cmp signal, 0 ; test if thread is finished yet
jne spinlock0
print "Thread finished",13,10
; ||||||||||||||||||||||||||||||||||||||||||||||||
; fill pmem2 while working on pmem1
; ||||||||||||||||||||||||||||||||||||||||||||||||
mov eax, pmem2
mov DWORD PTR tdat.pbuffer, eax
mov DWORD PTR tdat.bufsize, memcnt ; 100 meg buffer
invoke start_fill_thread,ADDR tdat ; start new thread
; ************************************
; do your own processing here on pmem1
; ************************************
spinlock1:
invoke SleepEx,1,0
cmp signal, 0 ; test if thread is finished yet
jne spinlock1
print "Thread finished",13,10
; ||||||||||||||||||||||||||||||||||||||||||||||||
free pmem2
free pmem1
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
start_fill_thread proc pstruct:DWORD
LOCAL tID :DWORD
mov loaded, 0 ; clear the loaded flag
invoke CreateThread,0,0,fill_thread,pstruct,0,ADDR tID
spinlock:
invoke SleepEx,1,0
cmp loaded, 1
jne spinlock
print "Thread started",13,10
mov eax, tID
ret
start_fill_thread endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
fill_thread proc pstruct:DWORD
LOCAL pbuf :DWORD
LOCAL sbuf :DWORD
push ebx
push esi
push edi
mov signal, 1 ; set the signal flag
mov edi, pstruct
lea ecx, (tdata PTR [edi]).pbuffer ; copy structure data to LOCAL
mov ecx, [ecx]
mov pbuf, ecx
lea ecx, (tdata PTR [edi]).bufsize ; copy structure data to LOCAL
mov ecx, [ecx]
mov sbuf, ecx
mov loaded, 1 ; set the loaded flag for the calling proc
invoke memfill,pbuf,sbuf,0
pop edi
pop esi
pop ebx
mov signal, 0 ; clear the signal flag
ret
fill_thread endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
[attachment deleted by admin]
By the time the thread starts, the first thread would have filled dozens of megabytes.
Triple-buffering, with FIFO control of a constantly-staying second thread (only on multicore systems) is the way.
By the way, why are we zeroing so much memory again and again, I see too few cases when zeroing MBs at a time like 100 times/second is relevant :P. Usually, just overwriting memory is enough.
:bg
In ancient times (10 years ago) I remember sprites starting as blanks with progressve blitting onto the sprite and finally blitting the sprite to screen. The general idea with the example is to utilise multicore processors if a way can be found that is fast enough to communicate with a slave thread.
Quote from: Ultrano on April 20, 2008, 01:36:36 PM
Usually, just overwriting memory is enough.
yep, it's always interesting to see another approach, but i think attribute a specific task to a cpu will certainly produce better result... but i'm going to try it... just in case... :wink
EDIT :
hutch,
lea ecx,(tdata PTR [edi]).pbuffer ; copy structure data to LOCAL
mov ecx,[ecx]
can be reduced to :
mov ecx,DWORD PTR (tdata PTR [edi]).pbuffer
or (shorter) :
mov ecx,(tdata PTR [edi]).pbuffer
1 instruction less, + 1 stall avoided. and you can also directly use :
invoke memfill,(tdata PTR [edi]).pbuffer,(tdata PTR [edi]).bufsize,0
now, concerning multithread, i think it's not a good solution... i mean not in this form... create a thread is too much slow... so it have to be created previously, but just one thread (or one extra thread), where all the "background" tasks will be defined... (not fill only) and you jump on thoses tasks when needed, so you should have something like :
ThreadBackgroundTasks PROC
; some init
BackgroundTasksLoop:
NotYet:
cmp BackgroundThread,TRUE
jne NotYet
;here your tasks :
; mov eax,TasksFlags ; 32 bits so 32 tasks possible, it's certainly enough
; and eax,TaskNumberX ; keep the corresponding bit
; jz AvoidTaskNumberX ; if the bit isn't here, avoid the task
;TaskX:
;
;
;AvoidTaskNumberX:
cmp BackgroundThreadEnd,TRUE ; end of the loop
jne BackgroundTasksLoop
; here, to be sure the thread is finished before the main thread/program
invoke SendMessage,WinMainCopy,WM_CLOSE,NULL,NULL
; ret not needed
ThreadBackgroundTasks ENDP
until your program run, here BackgroundThread (and BackgroundThreadEnd) and TasksFlags have to be defined by your program or your main thread (and they must be global variables to be constantly accessible), unfortunately, the loop work until it's ok (so there is a useless cpu usage... unless someone know a solution to avoid this...)
You have two APIs SuspendThread() and resumeThread() that would be useful to turn a thread on and off but I have not yet seen a fast signalling process so a master process can signal a slave thread to do something.
The attachment is an attempt to determine how long it takes for a thread to respond to a global flag. Typical results on my Windows 2000 P3 system:
61139 us, mean waiting for end of time slice
20 us, mean using Sleep, 0
6 us, mean using event wait
[attachment deleted by admin]
here an attempt for parallelism... since you can't use 2 thread with same data (or you have to use CreateProcess+CreateRemoteThread),
i've just used one thread to work with the main program, in all cases the cpus are 100% used.
Results of the tests :
Test 1 (default hardware), done in 4373819 cycles
Test 2 (parallelism attempt), done in 4584532 cycles
Test 3 (attempt with another algo), done in 4583756 cycles
Test 4, done in 12 cycles
Test 5, done in 12 cycles
Test 6, done in 12 cycles
Test 7, done in 12 cycles
Test 8, done in 12 cycles
Press ENTER to quit...
[attachment deleted by admin]
Hutch,
First time back to this board after several years. During those years, I have been doing a great deal of work to achieve maximum performance out of multi-core CPUs. The conclusion over and over again, is to minimize interaction between CPUs by keeping them in fully occupied in separate memory to the greatest extent possible. Sychronizing threads is necessary for logic, but a killer on performance. So initializing data in a separate thread could really only benefit performance if you really have something else to do in the mean time, and often enough to make a difference. Minimizing the rate of bus-locked instructions by making them unnecessary with better logic--this is the goal.
I am glad to see this site is still active.
Quote from: hutch-- on April 21, 2008, 10:21:31 AM
You have two APIs SuspendThread() and resumeThread() that would be useful to turn a thread on and off but I have not yet seen a fast signalling process so a master process can signal a slave thread to do something.
using events might do the trick, or a mutex?... either way, the delay caused in hopping to the other thread probably kills any possible benefit you might gain for having a seperate proc doing the memfill... even using suspendthread/resumethread...
From what I have seen so far the only place where multithreading seems to offer an advantage is with multiple processes that are reasonably sparse in terms of processor usage. Doing multiple downloads in seperate threads is typical of this capacity.
Now with a memory intensive operation being performed on one core, I get the impression that the other core may not have a completely free access at different memory addresses even if they are far enough apart. I know on a larger scale between two seperate machines if you use very large data sets you don't get any real problem with TCP/IP but mechanisms like this are far too slow to be useful at close range with processor intensive work.
Ideas of this type may have to wait for different hardware with diferent capacity before they can be realised as the current signalling methods between threads is too clunky and too slow to be of any real use.
Hutch,
To use dual CPUs for performance you must (1) set them up to work independently, and (2) balance their loads so that they finish at the same time. Of course (1) makes (2) difficult and elusive. If they don't finish at the same time, you're back to single-processor performance for the remainder of the task--we don't want that.
My solution to this gets the performance equivalent of 1.9 CPUs :dance: (Opteron 170/2800). I pair up two threads, and give them a queue of actions to perform. The threads start from opposite ends of the queue, working their way toward the middle, so they both finish within one logical action of each other, no matter how uneven the workload is. The beauty of this is that the entire thing is managed by single, lock-free counter (of remaining actions). The first thread to finish gets a count=0, the 2nd thread finishes with count=-1. The only processor contention is on the counter, and none on the queue.
Forget about calling system thread scheduling functions for managing anything resembling fine-grained computation, because they consume far too many clock cycles to be viable. You must observe (1) and (2) if you want dual-processor performance from your own code. The way you are doing it works fine, as long as you can tolerate the initial 6-8 microsecs of thread startup latency :eek.
There is a fundamental principle at work with broad relevance here: Giving your synchronous duties away to asynchonous entities buys you a lot of waiting around.
Quote from: hutch-- on May 08, 2008, 11:14:19 AM
Ideas of this type may have to wait for different hardware with diferent capacity before they can be realised as the current signalling methods between threads is too clunky and too slow to be of any real use.
hmm... no and yes..., no the ideas don't have to wait, i explain why after... and yes signaling methods are deadly slow... now, reading you make me try a different approach... the use of diffrerent memory locations slowdown the process because of the perpetuals l2 shared cache exchanges... so i tried to do one task (3mb fillmem) but with a work shared by the main program and a thread... (with 4 dwords of decay), and i have obtained a speed up of aproximatively 10% in speed test against the same work made by the full equivalent algo... of course, here i just avoid the l2 cache exchanges (so, there is benefit only in some case...), but it's a start... unfortunatly, instructions like rep stosd/stosd/movsd/movsb can't be used in this case...
Instead of using signalling, couldn't you use WaitForSingleObject() on the handle returned by CreateThread() ?
Or add ExitThread() to the fill_thread() routine? then use WaitForMultipleObjects() which would ensure all threads finish before continuing execution.
Quote from: Kernel_Gaddafi on May 09, 2008, 02:09:33 AM
Instead of using signalling, couldn't you use WaitForSingleObject() on the handle returned by CreateThread() ?
Or add ExitThread() to the fill_thread() routine? then use WaitForMultipleObjects() which would ensure all threads finish before continuing execution.
WaitForAnyPrimitiveYouCanSyncOn( ) reduces your computational transaction speed to the speed of the Windows thread schduler. To get faster, you need to be running already and use lock-free methods. Organize your methods so that the only time you wait is when you have nothing to do.
codewarp, WaitForSingleObject() on the handle returned by CreateThread() ?
where does it say use WaitForSingleObject inside fill_thread()?
if i were gonna implement such a routine, i'd distribute the total size of buffer between N threads
otherwise there is no point really..1 routine is enough.
The current x64 multi-core paradigm really isnt made for this.
This is in contrast to the current GPU's paradigm, where there are so many cores that having one sitting around waiting for the instant something in particular can be done is no big loss (For example, the 8800GT sells for ~US$150 these days and has a whopping 112 streaming cores.)
Both Intel and AMD are rapidly moving towards 8 cores, with an additional 32+ smaller streaming cores. This is when things will start to get interesting for tightly coupled threads.
The good news?
At least for Intel's Larrabee, these smaller streaming cores will use a subset of the x86 instruction set, so ASM programmers will be able to dive in quickly. They will begin by offering their streaming cores as an add-in card, but the current official roadmap has a 6x16 (6 CPU x 16 Streaming) single-chip codenamed Nehalem coming out sometimes in 2009.
As far as AMDs fusion, these streaming cores will simply be beefed up ATI GPU cores, which kind of stinks for ASM programmers. Probably the biggest mistake AMD will ever make because the ATI GPU design isnt as good as NVidia's, meaning it can't compete on current graphics applications, while also being less than general purpose so it cannot compete against Larrabee either. Things don't look good for long-term AMD competitiveness.
Edited to add:
NVidia is also playing this game, sort of. Their CUDA could potential compete against Larrabee for the short term, but they can't integrate with the CPU so in the end Intel is going to completely monopolize high performance desktop computing.
Quote from: Rockoon on May 10, 2008, 08:10:58 AM
Edited to add:
NVidia is also playing this game, sort of. Their CUDA could potential compete against Larrabee for the short term, but they can't integrate with the CPU so in the end Intel is going to completely monopolize high performance desktop computing.
Intel to buy nvidia, like speculation a few months ago, like AMD buying ATI heh heh.
Let's face it, a GPU can kill a CPU (for certain things, fair enough), but look at gddr5 vs ddr3 - why can't intel/amd keep up with this?
Quote from: Kernel_Gaddafi on May 10, 2008, 05:13:23 AM
codewarp, WaitForSingleObject() on the handle returned by CreateThread() ?
where does it say use WaitForSingleObject inside fill_thread()?
if i were gonna implement such a routine, i'd distribute the total size of buffer between N threads
otherwise there is no point really..1 routine is enough.
That's not the point. The point is that every time you wait for a thread to do anything, you are adding 10's of thousands of clock cycles to your own thread, giving them away to another process that IS ready to run now. There is no benefit at all to memfills < 64kb, when it takes 5-10 microsecs of latency before your "helper" gets up to speed. It's like hiring someone to do a job, who takes a two hour coffee break at the beginning of every task.
I can't believe anyone would even debate this point. Multi-threading a memory fill to acheive better performance is simply not a well formed concept. No one in their right mind would do this. Rather, this example is an attempt to answer the larger question: How can we take advantage of additional CPUs to enhance the performance of synchronous subtasks? This is a good question to ask, but the answer is not a very satisfying or resounding yes. Effective solutions come from understanding and working with the reality of multiple CPUs, not the fantasy. Ultimately, real multi-cpu performance within an application can only come from free-running fully occupied threads that only minimally interact. With a good lock-free queuing mechanism, you can benefit from giving other threads tiny jobs to do. But latency toll will still have to be paid--either from the WaitForStuff( ) calls, or from the jobs active on the queue remaining to be done at any given time. And you still have to synchronize with the data state changes from other threads, before you can touch their results.
The resistance to this concept from the assembler point of view is understandable--queuing and related support mechanisms are complicated and substantial. It's the kind of complexity that pushes software toward higher-level solutions and away from assembler. Die-hard assembler programmers don't like to hear that (I used to be one, so I know). Work with it and you can achieve, fight it and you won't.
Quote from: Rockoon on May 10, 2008, 08:10:58 AM
For example, the 8800GT sells for ~US$150 these days and has a whopping 112 streaming cores.
That's 112 scalar units, not 112 cores. In fact it's 7 cores with SIMD units that are 16 elements wide.
QuoteBoth Intel and AMD are rapidly moving towards 8 cores...
Plus the SIMD units are getting wider and deeper. AVX (http://softwareprojects.intel.com/avx/) extends the registers to 256-bit and SSE5 (http://developer.amd.com/cpu/SSE5/Pages/default.aspx) adds instructions that perform a multiplication and an addition in one. You do the math. :8)
What's still lacking though is parallel scatter/gather instructions. Writing and reading elements from different memory locations is quickly becoming the biggest bottleneck with arithmetic power going up. I'm curious when and how Intel/AMD will address this...
Quote from: sinsi on May 10, 2008, 08:50:05 AM
Let's face it, a GPU can kill a CPU (for certain things, fair enough), but look at gddr5 vs ddr3 - why can't intel/amd keep up with this?
This stuff has little benefit on CPU's.
Memory isnt accessed directly, instead memory is accessed from caches that already are close to keeping up with the CPU.
3 cycle latency or less from L1. DDRx doesnt impact this.
10 cycle latency or less from L2. DDRx doesnt impact this.
Quote from: c0dewarpThe point is that every time you wait for a thread to do anything, you are adding 10's of thousands of clock cycles to your own thread, giving them away to another process that IS ready to run now.
so how else do you suggest waiting for a thread to finish execution in these circumstances? or atleast know in a multi-threaded process when it finished? - normally i'd use WaitForSingleObject()/WaitForMultipleObjects() on the thread handle(s)..as in inside my thread(s) will be ExitThread() or just a RET to tell the operating system "i'm done here" which in turn notifies the main process.
it would be more efficient to do this than run a loop every N seconds checking a variable to see if the thread is terminated, because the operating system doesn't have to keep switching contexts.
Quote from: c0dewarpThere is no benefit at all to memfills < 64kb, when it takes 5-10 microsecs of latency before your "helper" gets up to speed
i don't recall ever saying that there was, or that the buffer being filled in hutchs' code was < 64kb - as you can see, its 100MB - the threads could be performing any number of operations on the same buffer, just at different offsets in different threads, not necessarily accessing the same data.
Quote from: c0dewarpDie-hard assembler programmers don't like to hear that (I used to be one, so I know)
i have a feeling you haven't programmed assembly since the old MS-DOS days.
Quote from: codewarp on May 10, 2008, 05:39:08 PM
The resistance to this concept from the assembler point of view is understandable--queuing and related support mechanisms are complicated and substantial. It's the kind of complexity that pushes software toward higher-level solutions and away from assembler. Die-hard assembler programmers don't like to hear that (I used to be one, so I know). Work with it and you can achieve, fight it and you won't.
This just in: ASM programmers have been using MASM for the last decade, which stands for "Macro assembler". OOP, vectors, HLL constructs, garbage-collection, automated memory management, automated code-generation, thinking from ALL points of view/levels of the software (lowest to highest), being able to easily and reliably check performance thus choose the best route, and so on - WE HAVE IT. Read up, try, and THINK!
And from the start, you want us to present you with THE ultimate and only universal way to solve ALL types of problems of multicore? This stinks of too little programming experience. Prove me wrong.
Quote from: Kernel_Gaddafi on May 11, 2008, 09:47:50 PMso how else do you suggest waiting for a thread to finish execution in these circumstances? or atleast know in a multi-threaded process when it finished? - normally i'd use WaitForSingleObject()/WaitForMultipleObjects() on the thread handle(s)..as in inside my thread(s) will be ExitThread() or just a RET to tell the operating system "i'm done here" which in turn notifies the main process.
it would be more efficient to do this than run a loop every N seconds checking a variable to see if the thread is terminated, because the operating system doesn't have to keep switching contexts.
I think c0dewarp's suggestion is to not wait at all. Design your algorithms in such a way that each thread always has something do to, and use lock-free synchronization (http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms) to avoid the huge overhead of O.S. thread synchronization.
Quote from: c0d1f1ed on May 12, 2008, 07:18:27 PM
I think c0dewarp's suggestion is to not wait at all. Design your algorithms in such a way that each thread always has something do to, and use lock-free synchronization (http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms) to avoid the huge overhead of O.S. thread synchronization.
Exactly.
Quote from: u on May 12, 2008, 12:31:55 AM
And from the start, you want us to present you with THE ultimate and only universal way to solve ALL types of problems of multicore? This stinks of too little programming experience. Prove me wrong.
See my thread
"Multiple CPUs and the limits to assembler"
Michael, I got this from your thread-test app:
2 us, mean waiting for end of time slice
2 us, mean using Sleep, 0
2 us, mean using event wait
Press any key to exit...
New box, AMD Athlon X2 (dual-core) 64-bit @4GHz, Win XP Pro x32
And here's Nightware's test:
Results of the tests :
Test 1 (default hardware), done in 7953689 cycles
Test 2 (parallelism attempt), done in 5320118 cycles
Test 3 (attempt with another algo), done in 5322003 cycles
Test 4, done in 11 cycles
Test 5, done in 11 cycles
Test 6, done in 10 cycles
Test 7, done in 11 cycles
Test 8, done in 11 cycles
hi,
read http://www.masm32.com/board/index.php?topic=9255.0, results with severals threads are not representatives. mine included... :red