The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.


This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.

Here's a simplified description:

The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.

Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).

So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.

relaxed - no constraints.

consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.

acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.

release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.

acq_rel - acquire and release combined.

seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.


The other answers explain what operations can or can't be reordered relative to various kinds of atomic operations, but I want to provide an alternative, more high-level explanation: what various memory orders can be actually used for.


Things to ignore:

memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.

A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.

It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.


Data races: Writing to a non-atomic variable from one thread, and simultaneously reading/writing to it from a different thread is called a data race, and causes undefined behavior.


memory_order_relaxed is the weakest and supposedly the fastest memory order.

Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).

Relaxed operations can propagate between threads in weird order, and with varying unpredictable (yet tiny) delays, relaxed changes to different variables can become visible to different threads in different order, and so on.

The only rules are:

  • Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
  • If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).

Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.


Fences: atomic_thread_fence(relaxed); does nothing.


memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).

Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they don't have to.

Those let you synchronize threads:

  • Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).

  • Then thread 1 performs a release store to a variable A. Then it stops touching that memory.

  • If thread 2 then performs an acquire load of the same variable A, this load is said to synchronize with the corresponding store in thread 1.

  • Now thread 2 can safely read/write to that memory M.

You only synchronize with the latest writer, not preceding writers.

You can chain synchronizations across multiple threads.

There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.

In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable X (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)

If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, it too would synchronize with thread 1, and conversely, if it used release or acq_rel, the thread 3 would synchonize with 2 in addition to 1.


Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).

Here's why: imagine that thread 1 reads/writes to ptr->..., then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.

Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.


Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.

If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.

Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.

An acq_rel fence combines the effect of acquire and release fences.


Similarity with other standard library features:

Several standard library features also cause a similar synchronizes with relation ship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.


memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.

seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.

seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).

seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.

seq_cst has an extra property compared to acquire/release: all seq_cst reads and writes in the whole program happen in a single global order.

Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).


Example use:

atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}

Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.


Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, plus an additional feature.

Let's say:

  • Thread 1 accesses variable X using using seq_cst order, then
  • thread 2 thread accesses the same variable X using any order (not necessarily causing a synchronization), then
  • thread 2 does atomic_thread_fence(seq_cst).

Then any operation in thread 2 after the fence will happen after the seq-cst access in thread 1 (but non-seq-cst operations before that seq-cst operation in thread 1 won't necessarily happen before the fence in thread 2).

The converse also works:

  • Thread 1 does atomic_thread_fence(seq_cst), then
  • thread 1 accesses a variable X using any order, then
  • thread 2 accesses the same variable X using seq_cst order.

Then any operations before the fence in thread 1 will happen before that seq-cst operation in thread 2, but not necessarily before subsequent non-seq-cst operations in thread 2.

You can have fences at both sides as well:

  • Thread 1 does atomic_thread_fence(seq_cst), then
  • thread 1 accesses a variable X using any order, then
  • thread 2 accesses the same variable X using any order, then
  • thread 2 does atomic_thread_fence(seq_cst)

Then anything in thread 1 before the fence will happen before anything in thread 2 after the fence.


Interoperation between different orders

Summarizing the above:

relaxed write release write seq-cst write
relaxed load - - -
acquire load - synchronizes with synchronizes with*
seq-cst load - synchronizes with* synchronizes with

* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.


Does using a stronger memory order makes data transfer between threads faster?

No, it seems not.


Sequental consistency for data-race-free programs

The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.