Atomic operations, std::atomic<> and ordering of writes
GCC compiles this:
#include <atomic>
std::atomic<int> a;
int b(0);
void func()
{
b = 2;
a = 1;
}
to this:
func():
mov DWORD PTR b[rip], 2
mov DWORD PTR a[rip], 1
mfence
ret
So, to clarify things for me:
- Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
- Why does the MFENCE happen after the write to ‘a’ not before.
- Is the write to ‘a’ guaranteed to be an atomic (in the narrow, non C++ sense) operation anyway, and does that apply for all intel processors? I assume so from this output code.
Also, clang (v3.5.1 -O3)does this:
mov dword ptr [rip + b], 2
mov eax, 1
xchg dword ptr [rip + a], eax
ret
Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?
I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b
) two atomic variables. I also used a.store(1, memory_order_release);
instead of a = 1;
to avoid getting more ordering than needed, so it's just a simple store on x86.
See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.
The easy question first:
Is the write to ‘a’ guaranteed to be an atomic?
Yes, any thread reading a
will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t
on 32bit.) Thus, on many systems, this happens to be true for b
as well, the way most compilers would generate code.
There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic
of course guarantees whatever alignment is necessary.
Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3
done in multiple threads at once will always produce a += 3000
. You'd potentially get fewer if a
wasn't atomic.
Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Yes, because of the guarantees provided by std::atomic
.
Now we're getting into the memory model of the language, and the hardware it runs on.
C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1
out of a loop that also writes to b
.)
Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.
Why does the MFENCE happen after the write to ‘a’ not before.
StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b
before the store to a
to implement the source code ordering.
Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.
x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE
, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE
are only useful for weakly-ordered operations like movnt
.))
Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.
The ARM32 asm for b=2; a=1;
is:
# get pointers and constants into registers
str r1, [r3] # store b=2
dmb sy # Data Memory Barrier: full memory barrier to order the stores.
# I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that. Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1 (a is 4 bytes after b)
dmb sy # full memory barrier to order this store wrt. all following loads and stores.
I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2]
, but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw
(move word) / movt
(move top).
Release / Acquire
The release
and acquire
naming for one-way memory barriers comes from locks:
- One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
- Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)
Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).
Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.
A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd
is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.
Using weaker but still sufficient ordering
Writes to std::atomic
types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order
.
In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a
. store(memory_order_release)
includes this and more. std::atomic_thread_fence(memory_order_release)
is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.
Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.
Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire)
.
Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore
on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)
memory_order_consume
isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume
.
consume
would be enough if your flag was a pointer to b
, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume
as acquire
. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.
The producer still needs to use release
semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.
It's not a bad idea to write code using consume
, if you're sure you understand the implications and don't depend on anything that consume
doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).
Just keep in mind that you can't actually test memory_order_consume
code experimentally, because current compilers are giving you stronger ordering than the code requests.
It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.
references:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with
b=2; a.store(1, MO_release); b=3;
producinga=1;b=3
on x86, rather thanb=3; a=1;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two
dmb sy
in a row fora=1; a=1;
, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if anmfence
between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.
What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/
- Data-dependency barriers (only Alpha needs explicit barriers of this type, but C++ potentially needs them to prevent the compiler doing speculative loads): http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#360
Control-dependency barriers: http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#592
Doug Lea says x86 only needs
LFENCE
for data that was written with "streaming" writes likemovntdqa
ormovnti
. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.http://preshing.com/20120913/acquire-and-release-semantics/
http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).
Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.