How does a compare and swap loop achieve atomicity?

If the target value changes to something else we need to read in oldValue again or we will spin forever.

However the point of the CAS construction is you cannot ever observe an intermediate value in the shared location. A tear is impossible; shared.load() prevents it. This is implemented in hardware.

"What happens if we just write newValue to memory?" Then you don't have atomic access. Always follow the pattern.

"non-aligned value" if shared is non-aligned you have already introduced undefined behavior into your code even before talking about std::atomic. Non-aligned pointers cannot be safely de-referenced. For a normal * you just took a dependency on byte-addressable architecture, but this is a std::atomic. If it's not aligned you can fault even on x86.


Why this would not work?

uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier){
    uint32_t oldValue = shared.load();
    uint32_t newValue = oldValue * multiplier;
    shared.store(newValue);
    return oldValue;
}

Because between load and store, another thread may modify the value of shared.

Consider the problem:

std::atomic<uint32_t> shared{1};
std::thread t1{ fetch_multiply, std::ref(shared), 2 };
std::thread t2{ fetch_multiply, std::ref(shared), 2 };
t1.join();
t2.join();
std::cout << shared;

With the above implementation, the possible output of this program is 2. While the correct one (provided fetch_multiply should be synchronized) must be 4. The problem occurs when both threads first load the initial value 1. Then, they both store their local result 2.