I've asked this question in other places and got no adequate answers. Hope someone can help me here ;)
Iwant to perform an atomic 'fetch-and-and' operation with respect to other processors on IA-32.
; processor 0
lea edx, var
mov ecx, mask
mov eax, [edx]
lock and [edx], ecx
; processor 1
lea edx, var
mov eax, 0xff
xchg [edx], eax
I'm not sure if it's possible that the store (mov eax, 0xff, xchg [edx], eax) to 'var' by processor 1 can or cannot occure between
the load (mov eax, [edx]) and the store (lock and [edx], ecx) to 'var' by processor 0. So, is this working or do I need to spin lock like this:
; processor 0
push ebx
lea edx, var
mov ecx, mask
@@loop:
mov ebx, [edx]
mov eax, ebx
and eax, ecx
lock cmpxchg [edx], eax
cmp eax, ebx
jne @@loop
pop ebx
As a side note: I'm ussing the preload (mov eax, [edx]) value in processor for further processing and need to be sure that it's the value the 'and' operation was performed on.
Thanks for any help and best regards from germany ;)
I would be suprised if you could do what you are after without multi-threading so that each core had a separate thread to work with. Then you need to synch the two or more threads. There is an API to specify which thread works with which core, something you will probably need to do for performance reasons.
You have to assume the instructions of the two processors will become interleaved, in any permutation (if they are both executing at the same time.)
; processor 0
lea edx, var
mov ecx, mask ;ecx = 7
mov eax, [edx] ;eax = [var] = 123
; processor 1
lea edx, var
mov eax, 0xff ;eax = 255
xchg [edx], eax ;[var] = 255, eax = 123
; processor 0
lock and [edx], ecx ;[var] = 255 & 7 = 7 (should've been: 123 & 7 = 3)
LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.
Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.
Quote from: hutch-- on February 05, 2011, 11:51:10 AMI would be suprised if you could do what you are after without multi-threading
Oh sry, I tought it would be clear. The snippets above are just some kind of pseudo code illustrating my issue ;)
I'll try to explain this with other words:
Assuming an arbitrary processor is executing 'processor0's' code and an other processor is executing 'processor1's' code, I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4). I need to be sure, that this is true. If it's not, I need to check it by myself and loop (as you can see in last code snippet).
Quote from: Tedd on February 05, 2011, 01:19:19 PM
LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.
Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.
Thanks for your answer. Yes, I know what LOCK does, but it's presence has an effect to the ordering.
Quote from: FrEEzE2046 on February 05, 2011, 01:21:53 PM
I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4).
It's not guaranteed - see my example.
Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.
Quote from: Tedd on February 05, 2011, 01:39:16 PMNo, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access
Believe me, it does. If you don't trust me, please refer to the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A System Programming Guide (Section 8.2.3.8).
Quote from: Tedd on February 05, 2011, 01:39:16 PM
Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.
It's going to have an effect on write combining buffers, caching, memory fencing and synchronization issues.
It doesn't change the order in which instructions are functionally executed, if it did, the program wouldn't make sense. Obviously micro-ops mean instructions can be executed 'out of order' as long as the semantic result is the same. And caching and other memory effects are (mostly) automatically handled to keep things coherent.
With regard to this problem, the lock isn't going to affect the overall execution order of the instructions, except for memory access around the AND instruction.
The point still remains: it's not safe and you need proper synchronisation.
Quote from: FrEEzE2046 on February 05, 2011, 08:35:57 AM
So, is this working or do I need to spin lock like this:
AFAIK spinlock is required:
...
.data?
lkflg dd ?
.code
...
@@: ; spinlock
lock bts [lkflg],0
jc @B
; processor 0
lea edx, var
mov ecx, mask
mov eax, [edx]
lock and [edx], ecx
and [lkflg],0 ; quit spinlock
...
@@: ; spinlock
lock bts [lkflg],0
jc @B
; processor 1
lea edx, var
mov eax, 0xff
xchg [edx], eax
and [lkflg],0 ; quit spinlock
You normally only use a spinlock for very short durations to synchronise two or more threads. For any real duration you pass it to the OS to suspend then reactivate the thread when needed.
For non critical spinlock style synchronisation the SleepEx() API is useful where for more time ctitical applications the PAUSE instruction (from memory) is better suited to the task although it is slower than a non delayed spinlock, you trade speed for core usage.
Thanks for your answers. Is spinlock really required? I thought something like this would be faster:
_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov ecx, _mask$[esp]
push ebx
__loop:
mov ebx, [edx]
mov eax, ebx
push ebx
and ebx, ecx
lock cmpxchg [edx], ebx
pop ebx
cmp eax, ebx
jne __loop
pop ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS
It's mostly that what WinAPI is doing:
FORCEINLINE
LONGLONG
InterlockedAnd64 (
__inout LONGLONG volatile *Destination,
__in LONGLONG Value
)
{
LONGLONG Old;
do {
Old = *Destination;
} while (InterlockedCompareExchange64(Destination,
Old & Value,
Old) != Old);
return Old;
}
i thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage
Quote from: evlncrn8 on February 06, 2011, 11:52:04 AMi thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage
rep nop produces same opcode as pause (in order to be backward compatible with all IA-32 processors prior Pentium 4).
Quote from: FrEEzE2046 on February 06, 2011, 05:12:37 AM
Thanks for your answers. Is spinlock really required? I thought something like this would be faster:
_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov ecx, _mask$[esp]
push ebx
__loop:
mov ebx, [edx]
mov eax, ebx
push ebx
and ebx, ecx
lock cmpxchg [edx], ebx
pop ebx
cmp eax, ebx
jne __loop
pop ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS
This is may looks like a paranoia, but this "atomic and" may cause hang of thread using it.
If other thread(s) constantly change the value of variable used to atomical and, then this piece may just go to infinite loop.
Maybe in a user-mode multitasking environment this hang will not be a permanent, but this is not generic solution code, especially in true highly-multithreaded code, which will (or may) change variable constantly. In a realtime driver threads this may cause hang of a CPU/core.
Anyway this is not effective.
"BTS-like" spinlocks are very cheap in therms of atomical operations. Since they are guaranteely gaining control on only one code path in a time. Code above not gaining control, but just change value, and change again if it has been changed somewhere inbetween, and change again, and ... :bg
That only possible if the other modifying processor doesn't execute any other code but storing to that memory location. Something like:
lea offset, var
mov ecx, -1
__loop:
lock xchg [edi], ecx
dec ecx
jnz __loop
Any other situation wouldn't be a problem. My code just loops if the memory changed between the load and store:
- mov ebx, [edx]
| mov eax, ebx
| push ebx
| and ebx, ecx
- lock cmpxchg [edx], ebx
And, like I already said, it's just what WinAPI and Intels TBB does. But, I know you're right. It could be a infinite loop ;)
i didn't think you needed to use LOCK with XCHG reg,mem - that it was implied
well - i have been doing it that way with semaphores and have had no trouble
in this case, if the semaphore = 0, it means that it is being queried or is locked
xor eax,eax
xchg eax,Semaphore
or eax,eax
jz busy
push eax
;do stuff
pop eax
xchg eax,Semaphore ;restore original value
Quote from: dedndave on February 07, 2011, 12:41:03 PMi didn't think you needed to use LOCK with XCHG reg,mem - that it was implied
Yes, I know that, it's meaningless. But, I like to prefix xchg too, in order to be sure that everyone how reads my code will understand what happens.
Quote from: dedndave on February 07, 2011, 12:41:03 PMi have been doing it that way with semaphores and have had no trouble
I don't know what this is related to. The question is "do we need a 'lock' (not the LOCK Signal) or 'mutex' to ensure 'atomic_and' ist a) atomic and b) wait-free (e.g. doesn't block other threads). I think my loop-solution is arguable, because I only promise that this function is 'atomic', not 'wait-free'. If the user of this function want's to ensure that the access to that memory location is also 'wait-free' he can locks the access by himself.
Quote from: FrEEzE2046 on February 07, 2011, 05:27:54 AM
And, like I already said, it's just what WinAPI and Intels TBB does.
Yes, WinAPI is obviously user mode code which just cannot execute pieces without being interruped.
Quote from: Antariy on February 07, 2011, 03:39:31 PM
Quote from: FrEEzE2046 on February 07, 2011, 05:27:54 AM
And, like I already said, it's just what WinAPI and Intels TBB does.
Yes, WinAPI is obviously user mode code which just cannot execute pieces without being interruped
I think it is ;) Thanks for your answer.
you could just do something like this, which is a spinlock.
lock: ; The lock variable. 1 = locked, 0 = unlocked.
dd 0
spin_lock:
mov eax, 1 ; Set the EAX register to 1.
loop:
xchg eax, [lock] ; Atomically swap the EAX register with
; the lock variable.
; This will always store 1 to the lock, leaving
; previous value in the EAX register.
test eax, eax ; Test EAX with itself. Among other things, this will
; set the processor's Zero Flag if EAX is 0.
; If EAX is 0, then the lock was unlocked and
; we just locked it.
; Otherwise, EAX is 1 and we didn't acquire the lock.
jnz loop ; Jump back to the XCHG instruction if the Zero Flag is
; not set, the lock was locked, and we need to spin.
do your code here.........
ret ; The lock has been acquired, return to the calling
; function.
spin_unlock:
mov eax, 0 ; Set the EAX register to 0.
xchg eax, [lock] ; Atomically swap the EAX register with
; the lock variable.
ret
spinlock code taken from wiki. it is not atomic but deals with interleaving/multithreading properly
Which code should take precedence? The '0xff' or the 'and'?
Wrap both in a spinlock but let one control 'var'.