What exactly is std::atomic?
I understand that std::atomic<>
is an atomic object. But atomic to what extent? To my understanding an operation can be atomic. What exactly is meant by making an object atomic? For example if there are two threads concurrently executing the following code:
a = a + 12;
Then is the entire operation (say add_twelve_to(int)
) atomic? Or are changes made to the variable atomic (so operator=()
)?
Solution 1:
Each instantiation and full specialization of std::atomic<> represents a type that different threads can simultaneously operate on (their instances), without raising undefined behavior:
Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by
std::memory_order
.
std::atomic<>
wraps operations that, in pre-C++ 11 times, had to be performed using (for example) interlocked functions with MSVC or atomic bultins in case of GCC.
Also, std::atomic<>
gives you more control by allowing various memory orders that specify synchronization and ordering constraints. If you want to read more about C++ 11 atomics and memory model, these links may be useful:
- C++ atomics and memory ordering
- Comparison: Lockless programming with atomics in C++ 11 vs. mutex and RW-locks
- C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
- Concurrency in C++11
Note that, for typical use cases, you would probably use overloaded arithmetic operators or another set of them:
std::atomic<long> value(0);
value++; //This is an atomic op
value += 5; //And so is this
Because operator syntax does not allow you to specify the memory order, these operations will be performed with std::memory_order_seq_cst
, as this is the default order for all atomic operations in C++ 11. It guarantees sequential consistency (total global ordering) between all atomic operations.
In some cases, however, this may not be required (and nothing comes for free), so you may want to use more explicit form:
std::atomic<long> value {0};
value.fetch_add(1, std::memory_order_relaxed); // Atomic, but there are no synchronization or ordering constraints
value.fetch_add(5, std::memory_order_release); // Atomic, performs 'release' operation
Now, your example:
a = a + 12;
will not evaluate to a single atomic op: it will result in a.load()
(which is atomic itself), then addition between this value and 12
and a.store()
(also atomic) of final result. As I noted earlier, std::memory_order_seq_cst
will be used here.
However, if you write a += 12
, it will be an atomic operation (as I noted before) and is roughly equivalent to a.fetch_add(12, std::memory_order_seq_cst)
.
As for your comment:
A regular
int
has atomic loads and stores. Whats the point of wrapping it withatomic<>
?
Your statement is only true for architectures that provide such guarantee of atomicity for stores and/or loads. There are architectures that do not do this. Also, it is usually required that operations must be performed on word-/dword-aligned address to be atomic std::atomic<>
is something that is guaranteed to be atomic on every platform, without additional requirements. Moreover, it allows you to write code like this:
void* sharedData = nullptr;
std::atomic<int> ready_flag = 0;
// Thread 1
void produce()
{
sharedData = generateData();
ready_flag.store(1, std::memory_order_release);
}
// Thread 2
void consume()
{
while (ready_flag.load(std::memory_order_acquire) == 0)
{
std::this_thread::yield();
}
assert(sharedData != nullptr); // will never trigger
processData(sharedData);
}
Note that assertion condition will always be true (and thus, will never trigger), so you can always be sure that data is ready after while
loop exits. That is because:
-
store()
to the flag is performed aftersharedData
is set (we assume thatgenerateData()
always returns something useful, in particular, never returnsNULL
) and usesstd::memory_order_release
order:
memory_order_release
A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable
-
sharedData
is used afterwhile
loop exits, and thus afterload()
from flag will return a non-zero value.load()
usesstd::memory_order_acquire
order:
std::memory_order_acquire
A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread.
This gives you precise control over the synchronization and allows you to explicitly specify how your code may/may not/will/will not behave. This would not be possible if only guarantee was the atomicity itself. Especially when it comes to very interesting sync models like the release-consume ordering.
Solution 2:
std::atomic
exists because many ISAs have direct hardware support for it
What the C++ standard says about std::atomic
has been analyzed in other answers.
So now let's see what std::atomic
compiles to to get a different kind of insight.
The main takeaway from this experiment is that modern CPUs have direct support for atomic integer operations, for example the LOCK prefix in x86, and std::atomic
basically exists as a portable interface to those intructions: What does the "lock" instruction mean in x86 assembly? In aarch64, LDADD would be used.
This support allows for faster alternatives to more general methods such as std::mutex
, which can make more complex multi-instruction sections atomic, at the cost of being slower than std::atomic
because std::mutex
it makes futex
system calls in Linux, which is way slower than the userland instructions emitted by std::atomic
, see also: Does std::mutex create a fence?
Let's consider the following multi-threaded program which increments a global variable across multiple threads, with different synchronization mechanisms depending on which preprocessor define is used.
main.cpp
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
size_t niters;
#if STD_ATOMIC
std::atomic_ulong global(0);
#else
uint64_t global = 0;
#endif
void threadMain() {
for (size_t i = 0; i < niters; ++i) {
#if LOCK
__asm__ __volatile__ (
"lock incq %0;"
: "+m" (global),
"+g" (i) // to prevent loop unrolling
:
:
);
#else
__asm__ __volatile__ (
""
: "+g" (i) // to prevent he loop from being optimized to a single add
: "g" (global)
:
);
global++;
#endif
}
}
int main(int argc, char **argv) {
size_t nthreads;
if (argc > 1) {
nthreads = std::stoull(argv[1], NULL, 0);
} else {
nthreads = 2;
}
if (argc > 2) {
niters = std::stoull(argv[2], NULL, 0);
} else {
niters = 10;
}
std::vector<std::thread> threads(nthreads);
for (size_t i = 0; i < nthreads; ++i)
threads[i] = std::thread(threadMain);
for (size_t i = 0; i < nthreads; ++i)
threads[i].join();
uint64_t expect = nthreads * niters;
std::cout << "expect " << expect << std::endl;
std::cout << "global " << global << std::endl;
}
GitHub upstream.
Compile, run and disassemble:
comon="-ggdb3 -O3 -std=c++11 -Wall -Wextra -pedantic main.cpp -pthread"
g++ -o main_fail.out $common
g++ -o main_std_atomic.out -DSTD_ATOMIC $common
g++ -o main_lock.out -DLOCK $common
./main_fail.out 4 100000
./main_std_atomic.out 4 100000
./main_lock.out 4 100000
gdb -batch -ex "disassemble threadMain" main_fail.out
gdb -batch -ex "disassemble threadMain" main_std_atomic.out
gdb -batch -ex "disassemble threadMain" main_lock.out
Extremely likely "wrong" race condition output for main_fail.out
:
expect 400000
global 100000
and deterministic "correct" output of the others:
expect 400000
global 400000
Disassembly of main_fail.out
:
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: mov 0x29b5(%rip),%rcx # 0x5140 <niters>
0x000000000000278b <+11>: test %rcx,%rcx
0x000000000000278e <+14>: je 0x27b4 <threadMain()+52>
0x0000000000002790 <+16>: mov 0x29a1(%rip),%rdx # 0x5138 <global>
0x0000000000002797 <+23>: xor %eax,%eax
0x0000000000002799 <+25>: nopl 0x0(%rax)
0x00000000000027a0 <+32>: add $0x1,%rax
0x00000000000027a4 <+36>: add $0x1,%rdx
0x00000000000027a8 <+40>: cmp %rcx,%rax
0x00000000000027ab <+43>: jb 0x27a0 <threadMain()+32>
0x00000000000027ad <+45>: mov %rdx,0x2984(%rip) # 0x5138 <global>
0x00000000000027b4 <+52>: retq
Disassembly of main_std_atomic.out
:
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: cmpq $0x0,0x29b4(%rip) # 0x5140 <niters>
0x000000000000278c <+12>: je 0x27a6 <threadMain()+38>
0x000000000000278e <+14>: xor %eax,%eax
0x0000000000002790 <+16>: lock addq $0x1,0x299f(%rip) # 0x5138 <global>
0x0000000000002799 <+25>: add $0x1,%rax
0x000000000000279d <+29>: cmp %rax,0x299c(%rip) # 0x5140 <niters>
0x00000000000027a4 <+36>: ja 0x2790 <threadMain()+16>
0x00000000000027a6 <+38>: retq
Disassembly of main_lock.out
:
Dump of assembler code for function threadMain():
0x0000000000002780 <+0>: endbr64
0x0000000000002784 <+4>: cmpq $0x0,0x29b4(%rip) # 0x5140 <niters>
0x000000000000278c <+12>: je 0x27a5 <threadMain()+37>
0x000000000000278e <+14>: xor %eax,%eax
0x0000000000002790 <+16>: lock incq 0x29a0(%rip) # 0x5138 <global>
0x0000000000002798 <+24>: add $0x1,%rax
0x000000000000279c <+28>: cmp %rax,0x299d(%rip) # 0x5140 <niters>
0x00000000000027a3 <+35>: ja 0x2790 <threadMain()+16>
0x00000000000027a5 <+37>: retq
Conclusions:
-
the non-atomic version saves the global to a register, and increments the register.
Therefore, at the end, very likely four writes happen back to global with the same "wrong" value of
100000
. -
std::atomic
compiles tolock addq
. The LOCK prefix makes the followinginc
fetch, modify and update memory atomically. -
our explicit inline assembly LOCK prefix compiles to almost the same thing as
std::atomic
, except that ourinc
is used instead ofadd
. Not sure why GCC choseadd
, considering that our INC generated a decoding 1 byte smaller.
ARMv8 could use either LDAXR + STLXR or LDADD in newer CPUs: How do I start threads in plain C?
Tested in Ubuntu 19.10 AMD64, GCC 9.2.1, Lenovo ThinkPad P51.
Solution 3:
I understand that
std::atomic<>
makes an object atomic.
That's a matter of perspective... you can't apply it to arbitrary objects and have their operations become atomic, but the provided specialisations for (most) integral types and pointers can be used.
a = a + 12;
std::atomic<>
does not (use template expressions to) simplify this to a single atomic operation, instead the operator T() const volatile noexcept
member does an atomic load()
of a
, then twelve is added, and operator=(T t) noexcept
does a store(t)
.