The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: FrEEzE2046 on February 05, 2011, 08:35:57 AM

Title: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 05, 2011, 08:35:57 AM
I've asked this question in other places and got no adequate answers. Hope someone can help me here ;)

Iwant to perform an atomic 'fetch-and-and' operation with respect to other processors on IA-32.

; processor 0
lea  edx, var
mov  ecx, mask
mov  eax, [edx]
lock and [edx], ecx

; processor 1
lea  edx, var
mov  eax, 0xff
xchg [edx], eax

I'm not sure if it's possible that the store (mov eax, 0xff, xchg [edx], eax) to 'var' by processor 1 can or cannot occure between
the load (mov  eax, [edx]) and the store (lock and [edx], ecx) to 'var' by processor 0. So, is this working or do I need to spin lock like this:

; processor 0
push ebx
lea  edx, var
mov  ecx, mask
@@loop:
mov  ebx, [edx]
mov  eax, ebx
and  eax, ecx
lock cmpxchg [edx], eax
cmp  eax, ebx
jne  @@loop
pop  ebx


As a side note: I'm ussing the preload (mov eax, [edx]) value in processor for further processing and need to be sure that it's the value the 'and' operation was performed on.


Thanks for any help and best regards from germany ;)
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: hutch-- on February 05, 2011, 11:51:10 AM
I would be suprised if you could do what you are after without multi-threading so that each core had a separate thread to work with. Then you need to synch the two or more threads. There is an API to specify which thread works with which core, something you will probably need to do for performance reasons.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Tedd on February 05, 2011, 01:19:19 PM
You have to assume the instructions of the two processors will become interleaved, in any permutation (if they are both executing at the same time.)


; processor 0
lea  edx, var
mov  ecx, mask         ;ecx = 7
mov  eax, [edx]        ;eax = [var] = 123

; processor 1
lea  edx, var
mov  eax, 0xff         ;eax = 255
xchg [edx], eax        ;[var] = 255, eax = 123

; processor 0
lock and [edx], ecx    ;[var] = 255 & 7 = 7 (should've been: 123 & 7 = 3)


LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.

Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 05, 2011, 01:21:53 PM
Quote from: hutch-- on February 05, 2011, 11:51:10 AMI would be suprised if you could do what you are after without multi-threading

Oh sry, I tought it would be clear. The snippets above are just some kind of pseudo code illustrating my issue ;)

I'll try to explain this with other words:
Assuming an arbitrary processor is executing 'processor0's' code and an other processor is executing 'processor1's' code, I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4). I need to be sure, that this is true. If it's not, I need to check it by myself and loop (as you can see in last code snippet).


Quote from: Tedd on February 05, 2011, 01:19:19 PM
LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.
Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.

Thanks for your answer. Yes, I know what LOCK does, but it's presence has an effect to the ordering.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Tedd on February 05, 2011, 01:39:16 PM
Quote from: FrEEzE2046 on February 05, 2011, 01:21:53 PM
I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4).
It's not guaranteed - see my example.

Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 05, 2011, 02:19:58 PM
Quote from: Tedd on February 05, 2011, 01:39:16 PMNo, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access

Believe me, it does. If you don't trust me, please refer to the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A System Programming Guide (Section 8.2.3.8).
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: clive on February 05, 2011, 02:27:39 PM
Quote from: Tedd on February 05, 2011, 01:39:16 PM
Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.

It's going to have an effect on write combining buffers, caching, memory fencing and synchronization issues.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Tedd on February 05, 2011, 03:23:56 PM
It doesn't change the order in which instructions are functionally executed, if it did, the program wouldn't make sense. Obviously micro-ops mean instructions can be executed 'out of order' as long as the semantic result is the same. And caching and other memory effects are (mostly) automatically handled to keep things coherent.

With regard to this problem, the lock isn't going to affect the overall execution order of the instructions, except for memory access around the AND instruction.

The point still remains: it's not safe and you need proper synchronisation.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Antariy on February 05, 2011, 10:55:56 PM
Quote from: FrEEzE2046 on February 05, 2011, 08:35:57 AM
So, is this working or do I need to spin lock like this:

AFAIK spinlock is required:

...
.data?
lkflg            dd   ?
.code
...
@@: ; spinlock
lock bts [lkflg],0
jc @B

; processor 0
lea  edx, var
mov  ecx, mask
mov  eax, [edx]
lock and [edx], ecx

and [lkflg],0 ; quit spinlock

...

@@: ; spinlock
lock bts [lkflg],0
jc @B


; processor 1
lea  edx, var
mov  eax, 0xff
xchg [edx], eax

and [lkflg],0 ; quit spinlock

Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: hutch-- on February 06, 2011, 04:14:14 AM
You normally only use a spinlock for very short durations to synchronise two or more threads. For any real duration you pass it to the OS to suspend then reactivate the thread when needed.

For non critical spinlock style synchronisation the SleepEx() API is useful where for more time ctitical applications the PAUSE instruction (from memory) is better suited to the task although it is slower than a non delayed spinlock, you trade speed for core usage.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 06, 2011, 05:12:37 AM
Thanks for your answers. Is spinlock really required? I thought something like this would be faster:

_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov  ecx, _mask$[esp]
push ebx

__loop:
mov  ebx, [edx]
mov  eax, ebx
push ebx

and  ebx, ecx
lock cmpxchg [edx], ebx

pop  ebx
cmp  eax, ebx
jne  __loop

pop  ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS


It's mostly that what WinAPI is doing:
FORCEINLINE
LONGLONG
InterlockedAnd64 (
    __inout LONGLONG volatile *Destination,
    __in    LONGLONG Value
    )
{
    LONGLONG Old;

    do {
        Old = *Destination;
    } while (InterlockedCompareExchange64(Destination,
                                          Old & Value,
                                          Old) != Old);

    return Old;
}
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: evlncrn8 on February 06, 2011, 11:52:04 AM
i thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 06, 2011, 01:14:45 PM
Quote from: evlncrn8 on February 06, 2011, 11:52:04 AMi thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage

rep nop produces same opcode as pause (in order to be backward compatible with all IA-32 processors prior Pentium 4).
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Antariy on February 07, 2011, 02:40:12 AM
Quote from: FrEEzE2046 on February 06, 2011, 05:12:37 AM
Thanks for your answers. Is spinlock really required? I thought something like this would be faster:

_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov  ecx, _mask$[esp]
push ebx

__loop:
mov  ebx, [edx]
mov  eax, ebx
push ebx

and  ebx, ecx
lock cmpxchg [edx], ebx

pop  ebx
cmp  eax, ebx
jne  __loop

pop  ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS


This is may looks like a paranoia, but this "atomic and" may cause hang of thread using it.
If other thread(s) constantly change the value of variable used to atomical and, then this piece may just go to infinite loop.
Maybe in a user-mode multitasking environment this hang will not be a permanent, but this is not generic solution code, especially in true highly-multithreaded code, which will (or may) change variable constantly. In a realtime driver threads this may cause hang of a CPU/core.
Anyway this is not effective.

"BTS-like" spinlocks are very cheap in therms of atomical operations. Since they are guaranteely gaining control on only one code path in a time. Code above not gaining control, but just change value, and change again if it has been changed somewhere inbetween, and change again, and ... :bg
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 07, 2011, 05:27:54 AM
That only possible if the other modifying processor doesn't execute any other code but storing to that memory location. Something like:

lea   offset, var
mov  ecx, -1
__loop:
lock xchg [edi], ecx
dec  ecx
jnz  __loop


Any other situation wouldn't be a problem. My code just loops if the memory changed between the load and store:

- mov  ebx, [edx]
| mov  eax, ebx
| push ebx

| and  ebx, ecx
- lock cmpxchg [edx], ebx


And, like I already said, it's just what WinAPI and Intels TBB does. But, I know you're right. It could be a infinite loop ;)
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: dedndave on February 07, 2011, 12:41:03 PM
i didn't think you needed to use LOCK with XCHG reg,mem - that it was implied
well - i have been doing it that way with semaphores and have had no trouble

in this case, if the semaphore = 0, it means that it is being queried or is locked
        xor     eax,eax
        xchg    eax,Semaphore
        or      eax,eax
        jz      busy

        push    eax
;do stuff
        pop     eax

        xchg    eax,Semaphore   ;restore original value
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 07, 2011, 12:53:49 PM
Quote from: dedndave on February 07, 2011, 12:41:03 PMi didn't think you needed to use LOCK with XCHG reg,mem - that it was implied

Yes, I know that, it's meaningless. But, I like to prefix xchg too, in order to be sure that everyone how reads my code will understand what happens.

Quote from: dedndave on February 07, 2011, 12:41:03 PMi have been doing it that way with semaphores and have had no trouble

I don't know what this is related to. The question is "do we need a 'lock' (not the LOCK Signal) or 'mutex' to ensure 'atomic_and' ist a) atomic and b) wait-free (e.g. doesn't block other threads). I think my loop-solution is arguable, because I only promise that this function is 'atomic', not 'wait-free'. If the user of this function want's to ensure that the access to that memory location is also 'wait-free' he can locks the access by himself.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Antariy on February 07, 2011, 03:39:31 PM
Quote from: FrEEzE2046 on February 07, 2011, 05:27:54 AM
And, like I already said, it's just what WinAPI and Intels TBB does.

Yes, WinAPI is obviously user mode code which just cannot execute pieces without being interruped.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: FrEEzE2046 on February 07, 2011, 04:14:15 PM
Quote from: Antariy on February 07, 2011, 03:39:31 PM
Quote from: FrEEzE2046 on February 07, 2011, 05:27:54 AM
And, like I already said, it's just what WinAPI and Intels TBB does.
Yes, WinAPI is obviously user mode code which just cannot execute pieces without being interruped

I think it is ;) Thanks for your answer.
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: Slugsnack on February 08, 2011, 12:02:34 PM
you could just do something like this, which is a spinlock.

lock:                        ; The lock variable. 1 = locked, 0 = unlocked.
     dd      0

spin_lock:
     mov     eax, 1          ; Set the EAX register to 1.

loop:
     xchg    eax, [lock]     ; Atomically swap the EAX register with
                             ;  the lock variable.
                             ; This will always store 1 to the lock, leaving
                             ;  previous value in the EAX register.

     test    eax, eax        ; Test EAX with itself. Among other things, this will
                             ;  set the processor's Zero Flag if EAX is 0.
                             ; If EAX is 0, then the lock was unlocked and
                             ;  we just locked it.
                             ; Otherwise, EAX is 1 and we didn't acquire the lock.

     jnz     loop            ; Jump back to the XCHG instruction if the Zero Flag is
                             ;  not set, the lock was locked, and we need to spin.

     do your code here.........

     ret                     ; The lock has been acquired, return to the calling
                             ;  function.

spin_unlock:
     mov     eax, 0          ; Set the EAX register to 0.

     xchg    eax, [lock]     ; Atomically swap the EAX register with
                             ;  the lock variable.

     ret


spinlock code taken from wiki. it is not atomic but deals with interleaving/multithreading properly
Title: Re: IA-32 | reordering issue in a multiprocessor environment
Post by: sinsi on February 08, 2011, 12:39:22 PM
Which code should take precedence? The '0xff' or the 'and'?
Wrap both in a spinlock but let one control 'var'.