How are atomic operations implemented at a hardware level?

I get that at the assembly language level instruction set architectures provide compare and swap and similar operations. However, I don't understand how the chip is able to provide these guarantees.

As I imagine it, the execution of the instruction must

  1. Fetch a value from memory
  2. Compare the value
  3. Depending on the comparison, possibly store another value in memory

What prevents another core from accessing the memory address after the first has fetched it but before it sets the new value? Does the memory controller manage this?

edit: If the x86 implementation is secret, I'd be happy to hear how any processor family implements it.


Solution 1:

Here is an article over at software.intel.com on that sheds little light on user level locks:

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. [...] On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance. Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines. Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

So what prevents another core from accessing the memory address? The cache coherency protocol already manages access rights for cache lines. So if a core has (temporal) exclusive access rights to a cache line, no other core can access that cache line. To access that cache line the other core has to obtain access rights first, and the protocol to obtain those rights involves the current owner. In effect, the cache coherency protocol prevents other cores from accessing the cache line silently.

If the locked access is not bound to a single cache line things get more complicated. There are all kinds of nasty corner cases, like locked accesses over page boundaries, etc. Intel does not tell details and they probably use all kinds of tricks to make locks faster.

Solution 2:

An example implementation of this is LL/SC where a processor will actually have extra instructions that are used to complete atomic operations. On the memory side of it is cache coherency. One of the most popular cache coherency protocols is the MESI Protocol. .

Solution 3:

Cache coherency protocol by itself is not sufficient to implement atomic operations. Lets say you want to implement an atomic increment. Below are the steps involved

  1. Load the value from the cache into a register
  2. Increment the value loaded into the register
  3. Store the updated value back to the cache

So in order to implement the above 3 instructions in an atomic fashion, we should first get exclusive access to the cacheline which contains the required value. Once we get exclusive access, we should not relinquish exclusive access on this cacheline until the "store" operation is completed. This means the CPU executing the atomic instructions should not respond to any cache coherency protocol messages for this cacheline in the mean time. While the devil is in the details of how this is implemented, at-least it gives us a mental model

Below is what linus torvalds mentioned about atomic instructions

Atomic instructions bypass the store buffer or at least they act as if they do - they likely actually use the store buffer, but they flush it and the instruction pipeline before the load and wait for it to drain after, and have a lock on the cacheline that they take as part o the load, and release as part of the store - all to make sure that the cacheline doesn't go away in between and that nobody else can see the store buffer contents while this is going on.