How does a mutex lock and unlock functions prevents CPU reordering?

As far as I know, a function call acts as a compiler barrier, but not as a CPU barrier.

This tutorial says the following:

acquiring a lock implies acquire semantics, while releasing a lock implies release semantics! All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries.

I assume that the above quote is talking about CPU reordering and not about compiler reordering.

But I don't understand how does a mutex lock and unlock causes the CPU to give these functions acquire and release semantics.

For example, if we have the following C code:

pthread_mutex_lock(&lock);
i = 10;
j = 20;
pthread_mutex_unlock(&lock);

The above C code is translated into the following (pseudo) assembly instructions:

push the address of lock into the stack
call pthread_mutex_lock()
mov 10 into i
mov 20 into j
push the address of lock into the stack
call pthread_mutex_unlock()

Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock() or to below call pthread_mutex_unlock()?

If it is the call instruction that prevents the CPU from doing the reordering, then why is the tutorial I quoted makes it seem like it is the mutex lock and unlock functions that prevents the CPU reordering, why the tutorial I quoted didn't say that any function call will prevent the CPU reordering?

My question is about the x86 architecture.


Solution 1:

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, and it is this dynamic instruction trace you have to consider for the purposes of reordering - not the static sequence you see in an assembly listing.

On x86 specifically, you probably won't find explicit, standalone memory barriers inside those methods, since you'll already have lock-prefixed instructions in order to perform the actual locking and unlocking atomically, and these instructions imply a full memory barrier, which prevents the CPU reordering you are concerned about.

For example, on my Ubuntu 16.04 system with glibc 2.23, pthread_mutex_lock is implemented using a lock cmpxchg (compare-and-exchange) and pthread_mutex_unlock is implemented using lock dec (decrement), both of which have full barrier semantics.

Solution 2:

If i and j are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.

But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.

So for example, if int i; is a local variable, after sscanf("0", "%d", &i); its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.

See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory") being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.


I assume that the above quote is talking about CPU reordering and not about compiler reordering.

It's talking about both, because both are necessary for correctness.

This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)

pthread_mutex_lock being a non-inline function call takes care of compile-time reordering, and the fact that it does a locked operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call instruction itself, though, just the code in the function body.) This gives it acquire semantics.

Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)

On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb (data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.


Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()

This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.

x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call is a store (pushing the return address).

So mov [i], 10 must appear in the global store between the stores done by the call instruction.

Of course in a normal program, nobody is observing the call stack of other threads, just the xchg to take the mutex or the release-store to release it in pthread_mutex_unlock.