IA-32 | reordering issue in a multiprocessor environment

Started by FrEEzE2046, February 05, 2011, 08:35:57 AM

Previous topic - Next topic

FrEEzE2046

I've asked this question in other places and got no adequate answers. Hope someone can help me here ;)

Iwant to perform an atomic 'fetch-and-and' operation with respect to other processors on IA-32.

; processor 0
lea  edx, var
mov  ecx, mask
mov  eax, [edx]
lock and [edx], ecx

; processor 1
lea  edx, var
mov  eax, 0xff
xchg [edx], eax

I'm not sure if it's possible that the store (mov eax, 0xff, xchg [edx], eax) to 'var' by processor 1 can or cannot occure between
the load (mov  eax, [edx]) and the store (lock and [edx], ecx) to 'var' by processor 0. So, is this working or do I need to spin lock like this:

; processor 0
push ebx
lea  edx, var
mov  ecx, mask
@@loop:
mov  ebx, [edx]
mov  eax, ebx
and  eax, ecx
lock cmpxchg [edx], eax
cmp  eax, ebx
jne  @@loop
pop  ebx


As a side note: I'm ussing the preload (mov eax, [edx]) value in processor for further processing and need to be sure that it's the value the 'and' operation was performed on.


Thanks for any help and best regards from germany ;)

hutch--

I would be suprised if you could do what you are after without multi-threading so that each core had a separate thread to work with. Then you need to synch the two or more threads. There is an API to specify which thread works with which core, something you will probably need to do for performance reasons.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Tedd

You have to assume the instructions of the two processors will become interleaved, in any permutation (if they are both executing at the same time.)


; processor 0
lea  edx, var
mov  ecx, mask         ;ecx = 7
mov  eax, [edx]        ;eax = [var] = 123

; processor 1
lea  edx, var
mov  eax, 0xff         ;eax = 255
xchg [edx], eax        ;[var] = 255, eax = 123

; processor 0
lock and [edx], ecx    ;[var] = 255 & 7 = 7 (should've been: 123 & 7 = 3)


LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.

Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.
No snowflake in an avalanche feels responsible.

FrEEzE2046

Quote from: hutch-- on February 05, 2011, 11:51:10 AMI would be suprised if you could do what you are after without multi-threading

Oh sry, I tought it would be clear. The snippets above are just some kind of pseudo code illustrating my issue ;)

I'll try to explain this with other words:
Assuming an arbitrary processor is executing 'processor0's' code and an other processor is executing 'processor1's' code, I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4). I need to be sure, that this is true. If it's not, I need to check it by myself and loop (as you can see in last code snippet).


Quote from: Tedd on February 05, 2011, 01:19:19 PM
LOCK only ensures the execution of that one instruction is atomic - that no other bus reads/writes occur during read-modify-write for the instruction.
Yes, you do need some kind of synchronisation, so you can get the old value and update the new value without being interrupted.

Thanks for your answer. Yes, I know what LOCK does, but it's presence has an effect to the ordering.

Tedd

Quote from: FrEEzE2046 on February 05, 2011, 01:21:53 PM
I'm not sure if it's guaranteed that the value I'm storing to EAX (P0, line 3) is the same I'm performing the conjunction on (line 4).
It's not guaranteed - see my example.

Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.
No snowflake in an avalanche feels responsible.

FrEEzE2046

Quote from: Tedd on February 05, 2011, 01:39:16 PMNo, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access

Believe me, it does. If you don't trust me, please refer to the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A System Programming Guide (Section 8.2.3.8).

clive

Quote from: Tedd on February 05, 2011, 01:39:16 PM
Quote
Yes, I know what LOCK does, but it's presence has an effect to the ordering.
No, it doesn't change the ordering of the instructions. It only means that single instruction will not be disrupted by another memory access.

It's going to have an effect on write combining buffers, caching, memory fencing and synchronization issues.
It could be a random act of randomness. Those happen a lot as well.

Tedd

It doesn't change the order in which instructions are functionally executed, if it did, the program wouldn't make sense. Obviously micro-ops mean instructions can be executed 'out of order' as long as the semantic result is the same. And caching and other memory effects are (mostly) automatically handled to keep things coherent.

With regard to this problem, the lock isn't going to affect the overall execution order of the instructions, except for memory access around the AND instruction.

The point still remains: it's not safe and you need proper synchronisation.
No snowflake in an avalanche feels responsible.

Antariy

Quote from: FrEEzE2046 on February 05, 2011, 08:35:57 AM
So, is this working or do I need to spin lock like this:

AFAIK spinlock is required:

...
.data?
lkflg            dd   ?
.code
...
@@: ; spinlock
lock bts [lkflg],0
jc @B

; processor 0
lea  edx, var
mov  ecx, mask
mov  eax, [edx]
lock and [edx], ecx

and [lkflg],0 ; quit spinlock

...

@@: ; spinlock
lock bts [lkflg],0
jc @B


; processor 1
lea  edx, var
mov  eax, 0xff
xchg [edx], eax

and [lkflg],0 ; quit spinlock


hutch--

You normally only use a spinlock for very short durations to synchronise two or more threads. For any real duration you pass it to the OS to suspend then reactivate the thread when needed.

For non critical spinlock style synchronisation the SleepEx() API is useful where for more time ctitical applications the PAUSE instruction (from memory) is better suited to the task although it is slower than a non delayed spinlock, you trade speed for core usage.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

FrEEzE2046

Thanks for your answers. Is spinlock really required? I thought something like this would be faster:

_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov  ecx, _mask$[esp]
push ebx

__loop:
mov  ebx, [edx]
mov  eax, ebx
push ebx

and  ebx, ecx
lock cmpxchg [edx], ebx

pop  ebx
cmp  eax, ebx
jne  __loop

pop  ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS


It's mostly that what WinAPI is doing:
FORCEINLINE
LONGLONG
InterlockedAnd64 (
    __inout LONGLONG volatile *Destination,
    __in    LONGLONG Value
    )
{
    LONGLONG Old;

    do {
        Old = *Destination;
    } while (InterlockedCompareExchange64(Destination,
                                          Old & Value,
                                          Old) != Old);

    return Old;
}

evlncrn8

i thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage

FrEEzE2046

Quote from: evlncrn8 on February 06, 2011, 11:52:04 AMi thought spinlocks should use repz nop, which 'tells' the os its a spinlock and reduces its cpu usage

rep nop produces same opcode as pause (in order to be backward compatible with all IA-32 processors prior Pentium 4).

Antariy

Quote from: FrEEzE2046 on February 06, 2011, 05:12:37 AM
Thanks for your answers. Is spinlock really required? I thought something like this would be faster:

_TEXT SEGMENT
_addr$ = 04h ; size = 4
_mask$ = 08h ; size = 4
_ATOMIC_AND32 PROC NEAR
mov edx, _addr$[esp]
mov  ecx, _mask$[esp]
push ebx

__loop:
mov  ebx, [edx]
mov  eax, ebx
push ebx

and  ebx, ecx
lock cmpxchg [edx], ebx

pop  ebx
cmp  eax, ebx
jne  __loop

pop  ebx
ret
_ATOMIC_AND32 ENDP
_TEXT ENDS


This is may looks like a paranoia, but this "atomic and" may cause hang of thread using it.
If other thread(s) constantly change the value of variable used to atomical and, then this piece may just go to infinite loop.
Maybe in a user-mode multitasking environment this hang will not be a permanent, but this is not generic solution code, especially in true highly-multithreaded code, which will (or may) change variable constantly. In a realtime driver threads this may cause hang of a CPU/core.
Anyway this is not effective.

"BTS-like" spinlocks are very cheap in therms of atomical operations. Since they are guaranteely gaining control on only one code path in a time. Code above not gaining control, but just change value, and change again if it has been changed somewhere inbetween, and change again, and ... :bg

FrEEzE2046

That only possible if the other modifying processor doesn't execute any other code but storing to that memory location. Something like:

lea   offset, var
mov  ecx, -1
__loop:
lock xchg [edi], ecx
dec  ecx
jnz  __loop


Any other situation wouldn't be a problem. My code just loops if the memory changed between the load and store:

- mov  ebx, [edx]
| mov  eax, ebx
| push ebx

| and  ebx, ecx
- lock cmpxchg [edx], ebx


And, like I already said, it's just what WinAPI and Intels TBB does. But, I know you're right. It could be a infinite loop ;)