News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Multicore theory proposal

Started by johnsa, May 23, 2008, 09:24:41 AM

Previous topic - Next topic

c0d1f1ed

Quote from: hutch-- on May 27, 2008, 11:23:02 AM
Having bothered to read the Intel data on the LOCK prefix, the range of instructions that it can be used as a prefix to in an increasing number of cases with later hardware are already locked for the duration of the instruction which excludes close range parallel memory operations.

Could you rephrase that? Anyway, you might also want to read up about CAS.

QuoteThe notion that there is such a thing as LOCK free operation is simply incorect.

Are you referring to lock-free algorithms? As has been said a dozen times before, it has nothing to do with the x86 lock prefix. It's about not using mutual exclusion, i.e. not having to aquire and release synchronization locks.

QuoteUntil the hardware supports close range parallel computing in a single thread algorithm, the vast majority of computer programs are not going to get much faster.

You're dead wrong. The vast majority of software has a lot of opportunity to execute multiple tasks in parallel. A typical function contains multiple calls to other functions, and those that are independent of each other can execute in parallel. Also as soon as you have a loop in which you process independent elements, multiple iterations could execute simultaneously. And for algorithms that at first seem serial there are more often than not ways to rewrite them as a parallel algorithm.

QuoteCurrently the only real gain in single thread performance is faster memory which does reduce one of the major bottlenecks in performance.

Faster memory isn't going to help one bit if you're bottlenecked by arithmetic work. Think about encoding a movie. Several GB/s of RAM bandwidth is plenty to load in uncompressed frames and output the compressed stream, and the cache is large enough to hold the entire working set. Cache is crazy fast so no problems there. The problem is the sheer number of block compares you need to do for motion prediction. With X cores you can simply work on X regions of the frame simultaneously. No increase in clock speed can keep up with that.

QuoteThere are much faster substrates than silicon that have been around for years but cost is the major factor in not going in this direction but this would allow running higher clock speeds than the current heat based limits.

Please define "much faster". Do you mean signal speed, switching speed, some other parameter? For your information, switching speeds of transistors on 45 nm are 0.5 picoseconds or less. That's 2 terraherz. So that's not the limiting factor. Faster switching substrates won't help us at all.

Quoteyou appear to have missed the context of the statement

No I haven't. You were referring to NetBurst. Core 2 lacks not a single feature offered by NetBurst that could make it faster at running serial code.

QuoteAnything else is trivial, it is already being done with dula processor machines and later multicore machines.

The free lunch is over hutch. Your software is not going to get significantly faster on future architectures, unless you take full advantage of multi-core by rethinking your software design.

johnsa

Sofar I agree with both c0d1f1ed and hutch.
If multi-threading is approached from the traditional angle, c0d1f1ed is spot on.

I agree that in a lot of cases it is possible to design an algorithm which doesn't require any sort of lock at all.
In the few remainding cases where a lock is unavoidable it would pay to have a selection of the fastest and more effective locking constructs possible.

I suggest what we should do is put together some sort of task that can be parallelized and try various approaches (some group coding) and see what we can get out of it?
In terms of having some section which can be done in parallel without locking , and one component which requires a lock.

Anyone keen?

To start with I've been looking at spin-lock implementations (for x86) and here is the quickest basic version I've been able to come up with sofar.. this of course could be improved in a real-world scenario by
possibly using the MAX_SPINS to determine when to either go on with other work, sleep (put waiting thread in a queue) or switch to some sort if signalling mechanism. We can also assume that certain code can make use of a
read write lock where only write operations require locking.



spinlock_t STRUCT
_lock dd 1
spinlock_t ENDS
mylock spinlock_t <1>
MAX_SPINS equ 1000000

spinlock PROTO :DWORD
spinunlock PROTO :DWORD

align 4
spinlock PROC lockPtr:DWORD
push edi
mov edi,lockPtr
acquirelock:
xor ebx,ebx
lock bts dword ptr [edi],0
jc short spinloop
pop edi
ret
align 4
spinloop:
dw 0f390h              ;use cpu PAUSE instruction as a spin loop hint
inc ebx
cmp ebx,MAX_SPINS           ;simply stop trying to acquire the lock at MAX_SPINS
jge short endlock
test dword ptr [edi],1
jne short spinloop
jmp short acquirelock
endlock:
pop edi
ret
spinlock ENDP

align 4
spinunlock PROC lockPtr:DWORD
mov edi,lockPtr
xor eax,eax
mov [edi],eax     ;MESI protocol, its safe to clear the lock without an atomic operation
ret
spinunlock ENDP



in main thread code....
invoke spinlock, ADDR mylock
invoke spinunlock,ADDR mylock





hutch--

Reference from the PIV manual LOCK instruction.
Quote
Note that in later IA-32 processors (including the Pentium 4, Intel Xeon, and P6 family processors),
locking may occur without the LOCK# signal being asserted. See IA-32 Architecture
Compatibility below.

This means directly that at least some of the instruction that LOCK can be prefixed to ALREADY lock for their operation. Unless you are targetting legacy Intel hardware the use of LOCK may simply be redundant. By using the instructions WITHOUT a LOCK prefix, you already have a lock on the address for the duration of the instruction.

Now RE John's code,


lock bts dword ptr [edi],0


If the hardware you are running is not legacy Intel, it may be worth a try if you hae a testbed set up to test the speed of the spinlock to simple remove the LOCK prefix and see if its any faster. You may also be able to substitute any of the lockable instructions for BTS to see if they are faster, something like NEG / NEG for non destructie results.

RE: The free lunch. The slogan has been around for years as has repeated change from hardware change so change in itself is nothing new, its one of the few garrantees of any developing field. My comments on the changes being proposed is that they will not matter as multicore hardware develops out of its sub infancy but this much, the solution to true parallelism will be in hardware, not trying to emulate win3 software multitasking.

I have to go somewhere, will be back later.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Thinking about it there is another option that I have not given much consideration to for a long time and that is the INT instruction. It can only be used at ring 0 but it will start and stop a core and from memory it can be done quickly at ring 0 level access. I know Linux still uses interrupts so I would imagine the technique is viable at least in some contexts.

My problem with such methods including spinlocks  is they knock the core it is pointed at stone dead so it is not yielding to any other process that may need processor time. A high level ring 3 spinlock that yields to other processes while waiting is very easy to write and they ae ery efficient in terms of processor usage but such high level notions are close to useless in close range algorithm processing as they are far too slow.

SUBSTRATES: A long time ago the base for transistors was germanium, lower resistance but higher leakage. Silicon has higher resistance but much lower leakage, its limiting factor on current hardware is heat from the resistance. There is no problem with winding up the clock speed on silicon except for the heat and this is why current processors are stuck at just under 4 gig.

The military and specialised instruments have had ICs based of other substrates for many years, something you use for parts of guidance systems for missiles and the like but the factor so far against using substrates with both lower resistance and lower leakage is a factor of cost, not performance. Sooner or later silicon will run out of puff, it already has in terms of heat and while smaller tracks help reduce this problem in conjunction with lower voltage, this too has its limits which are not far off.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

Quote from: johnsa on May 27, 2008, 06:46:48 PM    dw 0f390h              ;use cpu PAUSE instruction as a spin loop hint
That should be    db 0f3h,90h ;or dw 90f3h
Light travels faster than sound, that's why some people seem bright until you hear them.

johnsa

good spot! my bad... db 0f3h, 90h ... bad endian :) haha.. in any event i've just replaced it with the "pause" mnemonic which assembles under ML9 .. not sure about 8 or lower.
In any event with the pause instruction in the spin loop code takes 3x as long, on my test it went from 2ms to 6ms.

c0d1f1ed

Quote from: hutch-- on May 28, 2008, 12:58:45 AM
Reference from the PIV manual LOCK instruction.
QuoteNote that in later IA-32 processors (including the Pentium 4, Intel Xeon, and P6 family processors), locking may occur without the LOCK# signal being asserted.

This means directly that at least some of the instruction that LOCK can be prefixed to ALREADY lock for their operation. Unless you are targetting legacy Intel hardware the use of LOCK may simply be redundant. By using the instructions WITHOUT a LOCK prefix, you already have a lock on the address for the duration of the instruction.

You got it all wrong. The manual merely refers to a multi-processor system's ability to perform memory operations atomically without locking the bus. In particular, this is possible when the addressed memory is in the cache and in certain states of the MESI coherency protocol. Please read section 7.1.4 of the Sytem Programming Guide.

Note that this is a very effective optimization that ensure there is no unnecessary blocking. Today's multi-core CPUs can access the same memory at full speed and even perform atomic operations at optimal efficiency.

QuoteRE: The free lunch. The slogan has been around for years as has repeated change from hardware change so change in itself is nothing new, its one of the few garrantees of any developing field. My comments on the changes being proposed is that they will not matter as multicore hardware develops out of its sub infancy but this much, the solution to true parallelism will be in hardware, not trying to emulate win3 software multitasking.

You're being totally blind for the revolutionary change that mainstream multi-core CPUs bring (or in denial). Previously, you were guaranteed a steady increase of your software's performance from one CPU generation to the next. Nowadays, single-threaded software runs no faster on a Q6600 than on an E6600! However, by properly redesigning your software you can make it up to 90% faster for each doubling of the cores. So, contrary to the past, it takes effort, but you do get a lot in return.

c0d1f1ed

Quote from: johnsa on May 27, 2008, 06:46:48 PM
use cpu PAUSE instruction as a spin loop hint

It's best to avoid it. Reasons here: Multicore processor code.

johnsa

I've tried using xchg, cmpxchg in place of the bts and they all perform the same. From what I read in the intel manuals only xchg automatically asserts a lock signal, all other instructions must have a lock prefix. The only real thing to bear in mind is that if the locked address is in the cache the CPU avoid asserting the bus lock signal which makes it considerably faster.

The pause instruction definately seems to be pretty useless.

hutch--

John,

Somewhere along the line I remember that you can use the normal db 90h (nop) or a number of them instead of pause. I have often found in the middle of an algo that 2 or 3 nops did not slow it down any so it may be worth a try.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Tried using nops... while MUCH faster than a pause instruction it does add a fair amount of overhead to the spin loop...
Question is.. does it really matter.. tight loop with no nop/pause or nop in there.. either way the result will be the same and it will still take as much time as is necessary for the lock to become available..assuming this thread is running on a different core from the one which owns the lock brings another question up.. do we care that that specific core jams up running the loop at full-tilt while waiting for the lock?

And.. if it does matter why, and what would be a good option.. keep the max_spins option as I have it currently and do a thread sleep.. or a thread yield?

johnsa

Going back to my original post.. I still hold that threads are pretty pointless for what we want to achieve with them unless each thread runs on it's own core..
I can't imagine any algorithm being split into two threads (for example) on a single processor being any more efficient (probably less) than a single thread.. (assuming no Hyperthreading either).

So.. assuming that if one looks at the overhead of creating threads on a per-task basis, it makes far more sense to me to allocate a thread-per-core... and then in each thread routine implement some sort of workload designator which calls other routines as it needs.

I do understand how that approach doesn't really scale well in terms of a variable number of cores possibly, but unless you're creating a task which is essentially going to act as a template and all threads are instances of that same code (IE: like a socket server, web server etc) then implementing algorithms which are not only optimal but can handle a variable number of cores is almost impossible.

I would imagine that it might work reasonably well in some cases to follow my approach, but say you have 4 cores, create 6 or 8 initial threads (possibly 2 on a particular core..) should you be able to (at a high level) determine that you can create additional modules which could run in those threads and then only move those threads off the duplicate cores onto their own cores should more cores be available (perhaps a dual-quad core machine.. or in the future).

johnsa

http://cache-www.intel.com/cd/00/00/01/76/17689_w_spinlock.pdf

Very good reason, highly recommend!!

Subject to this I've made a few changes to my basic algorithm, including re-use of the pause (based on their explanation) and alignment / cache-line issues around the synchronizing variable / memory address.

johnsa

Ok... next question.. I'm presuming that any data specified in the .data or .data? sections would be considered/allocated by the OS as Write-Back and NOT Write-Combining...

1) It turns out that no atomic/locked operations should operate on variables stored in Write-Combining memory.

2) If that is the case, what would be the best way to align a variable to a cache-line, assuming 128byte cache-line size?
It's easy to accomplish if you dynamically allocate the memory by increasing the allocation size and ANDing off the low 7 bits.
as in:



.data

spinlock_t STRUCT
_lock dd 1
spinlock_t ENDS

; how to align this on 128byte boundary ?
mylock spinlock_t <1>



Next question.. when and why would you use VirtualAlloc to allocate a write - combining memory block? (would it improve the performance of writes including streaming stores to data stored in that area) - assumed that no data
in this area would act as a synchronization lock. And do the streaming store instructions automatically override the memory setting (assuming you stream-store to a memory block which is Write Back) it becomes Write Combine for the  store?

So many question... so few answers :)