How do memory_order_seq_cst and memory_order_acq_rel differ?
http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst
. Essentially memory_order_acq_rel
provides read and write orderings relative to the atomic variable, while memory_order_seq_cst
provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.
The example boils down to this:
bool x= false;
bool y= false;
int z= 0;
a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }
// kick off a, b, c, d, join all threads
assert(z!=0);
Operations on z
are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z
is always incremented.
On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:
-
seq_cst
stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible. -
acquire
orrelease
do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)But x86 atomic RMW operations always get promoted to
seq_cst
because the x86 asmlock
prefix is a full memory barrier. Other ISAs can do relaxed oracq_rel
RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)
https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov
+ mfence
vs. plain mov
in x86 asm. In practice xchg
is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov
+mfence
)
Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst
loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst
and release
stores both use STLR).
With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.
Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.
Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel
or mo_acq_rel
more cheaply than mo_seq_cst
, but their seq_cst
can't be as cheap as AArch64; seq-cst stores need a full barrier.
So AArch64 is an exception to the rule that seq_cst
stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.
Try to build Dekkers or Petersons algorithm with just acquire/release semantics.
That won't work because acquire/release semantics doesn't provide [StoreLoad] fence.
In case of Dekkers algorithm:
flag[self]=1 <-- STORE
while(true){
if(flag[other]==0) { <--- LOAD
break;
}
flag[self]=0;
while(turn==other);
flag[self]=1
}
Without [StoreLoad] fence the store could jump in front of the load and then the algorithm would break. 2 threads at the same time would see that the other lock is free, set their own lock and continue. And now you have 2 threads within the critical section.
Still use the definition and example from memory_order. But replace memory_order_seq_cst with memory_order_release in store and memory_order_acquire in load.
Release-Acquire ordering guarantees everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load. But in our example, nothing happens before store in both thread0 and thread1.
x.store(true, std::memory_order_release); // thread0
y.store(true, std::memory_order_release); // thread1
Further more, without memory_order_seq_cst, the sequential ordering of thread2 and thread3 are not guaranteed. You can imagine they becomes:
if (y.load(std::memory_order_acquire)) { ++z; } // thread2, load y first
while (!x.load(std::memory_order_acquire)); // and then, load x
if (x.load(std::memory_order_acquire)) { ++z; } // thread3, load x first
while (!y.load(std::memory_order_acquire)); // and then, load y
So, if thread2 and thread3 are executed before thread0 and thread1, that means both x and y stay false, thus, ++z is never touched, z stay 0 and the assert fires.
However, if memory_order_seq_cst enters the picture, it establishes a single total modification order of all atomic operations that are so tagged. Thus, in thread2, x.load then y.load; in thread3, y.load then x.load are sure things.