Idea for multithread memfill.

hutch-- · April 20, 2008, 10:21:19 AM

I remember a thread recently producing a faster memory fill algo and at the time I made a suggestion that seperating the memory fill from the work done on the memory may be faster if one could be done in a seperate thread from the other.

Below is the rogh idea of what I was suggesting, it is a long way from being fast enough due to crteating a new thread for each memory fill but there may be a way to keep the fill thread running and signalling it by inter thread communication of when to fill a buffer.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    tdata STRUCT
      pbuffer dd ?      ; buffer address
      bufsize dd ?      ; buffer size
    tdata ENDS

    start_fill_thread PROTO :DWORD
    fill_thread  PROTO :DWORD

    .data?
      signal dd ?
      loaded dd ?

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    call main
    inkey
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    memcnt equ <1024*1024*100>

main proc

    LOCAL pmem1 :DWORD
    LOCAL pmem2 :DWORD
    LOCAL tdat  :tdata

    mov pmem1, alloc(memcnt)                        ; alocate 2 blocks of memory
    mov pmem2, alloc(memcnt)

  ; ||||||||||||||||||||||||||||||||||||||||||||||||
  ; fill pmem1 while working on pmem2
  ; ||||||||||||||||||||||||||||||||||||||||||||||||

    mov eax, pmem1
    mov DWORD PTR tdat.pbuffer, eax
    mov DWORD PTR tdat.bufsize, memcnt              ; 100 meg buffer

    invoke start_fill_thread,ADDR tdat              ; start new thread

  ; ************************************
  ; do your own processing here on pmem2
  ; ************************************

  spinlock0:
    invoke SleepEx,1,0
    cmp signal, 0                                   ; test if thread is finished yet
    jne spinlock0
    print "Thread finished",13,10

  ; ||||||||||||||||||||||||||||||||||||||||||||||||
  ; fill pmem2 while working on pmem1
  ; ||||||||||||||||||||||||||||||||||||||||||||||||

    mov eax, pmem2
    mov DWORD PTR tdat.pbuffer, eax
    mov DWORD PTR tdat.bufsize, memcnt              ; 100 meg buffer

    invoke start_fill_thread,ADDR tdat              ; start new thread

  ; ************************************
  ; do your own processing here on pmem1
  ; ************************************

  spinlock1:
    invoke SleepEx,1,0
    cmp signal, 0                                   ; test if thread is finished yet
    jne spinlock1
    print "Thread finished",13,10

  ; ||||||||||||||||||||||||||||||||||||||||||||||||

    free pmem2
    free pmem1

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

start_fill_thread proc pstruct:DWORD

    LOCAL tID   :DWORD

    mov loaded, 0                       ; clear the loaded flag

    invoke CreateThread,0,0,fill_thread,pstruct,0,ADDR tID

  spinlock:
    invoke SleepEx,1,0
    cmp loaded, 1
    jne spinlock

    print "Thread started",13,10

    mov eax, tID

    ret

start_fill_thread endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

fill_thread proc pstruct:DWORD

    LOCAL pbuf  :DWORD
    LOCAL sbuf  :DWORD

    push ebx
    push esi
    push edi

    mov signal, 1                       ; set the signal flag
    mov edi, pstruct

    lea ecx, (tdata PTR [edi]).pbuffer  ; copy structure data to LOCAL
    mov ecx, [ecx]
    mov pbuf, ecx

    lea ecx, (tdata PTR [edi]).bufsize  ; copy structure data to LOCAL
    mov ecx, [ecx]
    mov sbuf, ecx

    mov loaded, 1                       ; set the loaded flag for the calling proc

    invoke memfill,pbuf,sbuf,0

    pop edi
    pop esi
    pop ebx

    mov signal, 0                       ; clear the signal flag

    ret

fill_thread endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

[attachment deleted by admin]

u · April 20, 2008, 01:36:36 PM

By the time the thread starts, the first thread would have filled dozens of megabytes.
Triple-buffering, with FIFO control of a constantly-staying second thread (only on multicore systems) is the way.

By the way, why are we zeroing so much memory again and again, I see too few cases when zeroing MBs at a time like 100 times/second is relevant :P. Usually, just overwriting memory is enough.

hutch-- · April 20, 2008, 02:38:50 PM

:bg

In ancient times (10 years ago) I remember sprites starting as blanks with progressve blitting onto the sprite and finally blitting the sprite to screen. The general idea with the example is to utilise multicore processors if a way can be found that is fast enough to communicate with a slave thread.

NightWare · April 20, 2008, 09:41:12 PM

Quote from: Ultrano on April 20, 2008, 01:36:36 PM
Usually, just overwriting memory is enough.

yep, it's always interesting to see another approach, but i think attribute a specific task to a cpu will certainly produce better result... but i'm going to try it... just in case... :wink

EDIT :

hutch,

Code Select

    lea ecx,(tdata PTR [edi]).pbuffer  ; copy structure data to LOCAL
    mov ecx,[ecx]

can be reduced to :

Code Select

mov ecx,DWORD PTR (tdata PTR [edi]).pbuffer
or (shorter) :

Code Select

mov ecx,(tdata PTR [edi]).pbuffer
1 instruction less, + 1 stall avoided. and you can also directly use :

Code Select

invoke memfill,(tdata PTR [edi]).pbuffer,(tdata PTR [edi]).bufsize,0

now, concerning multithread, i think it's not a good solution... i mean not in this form... create a thread is too much slow... so it have to be created previously, but just one thread (or one extra thread), where all the "background" tasks will be defined... (not fill only) and you jump on thoses tasks when needed, so you should have something like :

Code Select

ThreadBackgroundTasks PROC
; some init

BackgroundTasksLoop:

NotYet:
	cmp BackgroundThread,TRUE
	jne NotYet

;here your tasks :

;	mov eax,TasksFlags	; 32 bits so 32 tasks possible, it's certainly enough
;	and eax,TaskNumberX	; keep the corresponding bit
;	jz AvoidTaskNumberX	; if the bit isn't here, avoid the task
;TaskX:
;
;
;AvoidTaskNumberX:	


	cmp BackgroundThreadEnd,TRUE	; end of the loop
	jne BackgroundTasksLoop

; here, to be sure the thread is finished before the main thread/program
	invoke SendMessage,WinMainCopy,WM_CLOSE,NULL,NULL
;	ret not needed
ThreadBackgroundTasks ENDP

until your program run, here BackgroundThread (and BackgroundThreadEnd) and TasksFlags have to be defined by your program or your main thread (and they must be global variables to be constantly accessible), unfortunately, the loop work until it's ok (so there is a useless cpu usage... unless someone know a solution to avoid this...)

hutch-- · April 21, 2008, 10:21:31 AM

You have two APIs SuspendThread() and resumeThread() that would be useful to turn a thread on and off but I have not yet seen a fast signalling process so a master process can signal a slave thread to do something.

MichaelW · April 21, 2008, 04:09:28 PM

The attachment is an attempt to determine how long it takes for a thread to respond to a global flag. Typical results on my Windows 2000 P3 system:

Code Select


61139 us, mean waiting for end of time slice
20 us, mean using Sleep, 0
6 us, mean using event wait

[attachment deleted by admin]

NightWare · April 23, 2008, 01:56:03 AM

here an attempt for parallelism... since you can't use 2 thread with same data (or you have to use CreateProcess+CreateRemoteThread),
i've just used one thread to work with the main program, in all cases the cpus are 100% used.

Code Select

Results of the tests :


Test 1 (default hardware), done in 4373819 cycles


Test 2 (parallelism attempt), done in 4584532 cycles


Test 3 (attempt with another algo), done in 4583756 cycles


Test 4, done in 12 cycles


Test 5, done in 12 cycles


Test 6, done in 12 cycles


Test 7, done in 12 cycles


Test 8, done in 12 cycles


Press ENTER to quit...

[attachment deleted by admin]

codewarp · May 07, 2008, 10:46:12 PM

Hutch,

First time back to this board after several years. During those years, I have been doing a great deal of work to achieve maximum performance out of multi-core CPUs. The conclusion over and over again, is to minimize interaction between CPUs by keeping them in fully occupied in separate memory to the greatest extent possible. Sychronizing threads is necessary for logic, but a killer on performance. So initializing data in a separate thread could really only benefit performance if you really have something else to do in the mean time, and often enough to make a difference. Minimizing the rate of bus-locked instructions by making them unnecessary with better logic--this is the goal.

I am glad to see this site is still active.

evlncrn8 · May 08, 2008, 06:38:00 AM

Quote from: hutch-- on April 21, 2008, 10:21:31 AM
You have two APIs SuspendThread() and resumeThread() that would be useful to turn a thread on and off but I have not yet seen a fast signalling process so a master process can signal a slave thread to do something.

using events might do the trick, or a mutex?... either way, the delay caused in hopping to the other thread probably kills any possible benefit you might gain for having a seperate proc doing the memfill... even using suspendthread/resumethread...

hutch-- · May 08, 2008, 11:14:19 AM

From what I have seen so far the only place where multithreading seems to offer an advantage is with multiple processes that are reasonably sparse in terms of processor usage. Doing multiple downloads in seperate threads is typical of this capacity.

Now with a memory intensive operation being performed on one core, I get the impression that the other core may not have a completely free access at different memory addresses even if they are far enough apart. I know on a larger scale between two seperate machines if you use very large data sets you don't get any real problem with TCP/IP but mechanisms like this are far too slow to be useful at close range with processor intensive work.

Ideas of this type may have to wait for different hardware with diferent capacity before they can be realised as the current signalling methods between threads is too clunky and too slow to be of any real use.

codewarp · May 08, 2008, 04:52:42 PM

Hutch,

To use dual CPUs for performance you must (1) set them up to work independently, and (2) balance their loads so that they finish at the same time. Of course (1) makes (2) difficult and elusive. If they don't finish at the same time, you're back to single-processor performance for the remainder of the task--we don't want that.

My solution to this gets the performance equivalent of 1.9 CPUs :dance: (Opteron 170/2800). I pair up two threads, and give them a queue of actions to perform. The threads start from opposite ends of the queue, working their way toward the middle, so they both finish within one logical action of each other, no matter how uneven the workload is. The beauty of this is that the entire thing is managed by single, lock-free counter (of remaining actions). The first thread to finish gets a count=0, the 2nd thread finishes with count=-1. The only processor contention is on the counter, and none on the queue.

Forget about calling system thread scheduling functions for managing anything resembling fine-grained computation, because they consume far too many clock cycles to be viable. You must observe (1) and (2) if you want dual-processor performance from your own code. The way you are doing it works fine, as long as you can tolerate the initial 6-8 microsecs of thread startup latency :eek.

There is a fundamental principle at work with broad relevance here: Giving your synchronous duties away to asynchonous entities buys you a lot of waiting around.

NightWare · May 09, 2008, 01:25:46 AM

Quote from: hutch-- on May 08, 2008, 11:14:19 AM
Ideas of this type may have to wait for different hardware with diferent capacity before they can be realised as the current signalling methods between threads is too clunky and too slow to be of any real use.

hmm... no and yes..., no the ideas don't have to wait, i explain why after... and yes signaling methods are deadly slow... now, reading you make me try a different approach... the use of diffrerent memory locations slowdown the process because of the perpetuals l2 shared cache exchanges... so i tried to do one task (3mb fillmem) but with a work shared by the main program and a thread... (with 4 dwords of decay), and i have obtained a speed up of aproximatively 10% in speed test against the same work made by the full equivalent algo... of course, here i just avoid the l2 cache exchanges (so, there is benefit only in some case...), but it's a start... unfortunatly, instructions like rep stosd/stosd/movsd/movsb can't be used in this case...

bozo · May 09, 2008, 02:09:33 AM

Instead of using signalling, couldn't you use WaitForSingleObject() on the handle returned by CreateThread() ?
Or add ExitThread() to the fill_thread() routine? then use WaitForMultipleObjects() which would ensure all threads finish before continuing execution.

codewarp · May 09, 2008, 02:26:31 AM

Quote from: Kernel_Gaddafi on May 09, 2008, 02:09:33 AM
Instead of using signalling, couldn't you use WaitForSingleObject() on the handle returned by CreateThread() ?
Or add ExitThread() to the fill_thread() routine? then use WaitForMultipleObjects() which would ensure all threads finish before continuing execution.

WaitForAnyPrimitiveYouCanSyncOn( ) reduces your computational transaction speed to the speed of the Windows thread schduler. To get faster, you need to be running already and use lock-free methods. Organize your methods so that the only time you wait is when you have nothing to do.

bozo · May 10, 2008, 05:13:23 AM

codewarp, WaitForSingleObject() on the handle returned by CreateThread() ?

where does it say use WaitForSingleObject inside fill_thread()?

if i were gonna implement such a routine, i'd distribute the total size of buffer between N threads
otherwise there is no point really..1 routine is enough.

News:

Idea for multithread memfill.

codewarp

codewarp

codewarp