Atomic double floating point or SSE/AVX vector load/store on x86_64

C++ doesn't support something like lock-free std::atomic<double>

Actually, C++11 std::atomic<double> is lock-free on typical C++ implementations, and does expose nearly everything you can do in asm for lock-free programming with float/double on x86 (e.g. load, store, and CAS are enough to implement anything: Why isn't atomic double fully implemented). Current compilers don't always compile atomic<double> efficiently, though.

C++11 std::atomic doesn't have an API for Intel's transactional-memory extensions (TSX) (for FP or integer). TSX could be a game-changer especially for FP / SIMD, since it would remove all overhead of bouncing data between xmm and integer registers. If the transaction doesn't abort, whatever you just did with double or vector loads/stores happens atomically.

Some non-x86 hardware supports atomic add for float/double, and C++ p0020 is a proposal to add fetch_add and operator+= / -= template specializations to C++'s std::atomic<float> / <double>.

Hardware with LL/SC atomics instead of x86-style memory-destination instruction, such as ARM and most other RISC CPUs, can do atomic RMW operations on double and float without a CAS, but you still have to get the data from FP to integer registers because LL/SC is usually only available for integer regs, like x86's cmpxchg. However, if the hardware arbitrates LL/SC pairs to avoid/reduce livelock, it would be significantly more efficient than with a CAS loop in very-high-contention situations. If you've designed your algorithms so contention is rare, there's maybe only a small code-size difference between an LL/add/SC retry-loop for fetch_add vs. a load + add + LL/SC CAS retry loop.

x86 natually-aligned loads and stores are atomic up to 8 bytes, even x87 or SSE. (For example movsd xmm0, [some_variable] is atomic, even in 32-bit mode). In fact, gcc uses x87 fild/fistp or SSE 8B loads/stores to implement std::atomic<int64_t> load and store in 32-bit code.

Ironically, compilers (gcc7.1, clang4.0, ICC17, MSVC CL19) do a bad job in 64-bit code (or 32-bit with SSE2 available), and bounce data through integer registers instead of just doing movsd loads/stores directly to/from xmm regs (see it on Godbolt):

#include <atomic>
std::atomic<double> ad;

void store(double x){
    ad.store(x, std::memory_order_release);
}
//  gcc7.1 -O3 -mtune=intel:
//    movq    rax, xmm0               # ALU xmm->integer
//    mov     QWORD PTR ad[rip], rax
//    ret

double load(){
    return ad.load(std::memory_order_acquire);
}
//    mov     rax, QWORD PTR ad[rip]
//    movq    xmm0, rax
//    ret

Without -mtune=intel, gcc likes to store/reload for integer->xmm. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 and related bugs I reported. This is a poor choice even for -mtune=generic. AMD has high latency for movq between integer and vector regs, but it also has high latency for a store/reload. With the default -mtune=generic, load() compiles to:

//    mov     rax, QWORD PTR ad[rip]
//    mov     QWORD PTR [rsp-8], rax   # store/reload integer->xmm
//    movsd   xmm0, QWORD PTR [rsp-8]
//    ret

Moving data between xmm and integer register brings us to the next topic:

Atomic read-modify-write (like fetch_add) is another story: there is direct support for integers with stuff like lock xadd [mem], eax (see Can num++ be atomic for 'int num'? for more details). For other things, like atomic<struct> or atomic<double>, the only option on x86 is a retry loop with cmpxchg (or TSX).

Atomic compare-and-swap (CAS) is usable as a lock-free building-block for any atomic RMW operation, up to the max hardware-supported CAS width. On x86-64, that's 16 bytes with cmpxchg16b (not available on some first-gen AMD K8, so for gcc you have to use -mcx16 or -march=whatever to enable it).

gcc makes the best asm possible for exchange():

double exchange(double x) {
    return ad.exchange(x); // seq_cst
}
    movq    rax, xmm0
    xchg    rax, QWORD PTR ad[rip]
    movq    xmm0, rax
    ret
  // in 32-bit code, compiles to a cmpxchg8b retry loop


void atomic_add1() {
    // ad += 1.0;           // not supported
    // ad.fetch_or(-0.0);   // not supported
    // have to implement the CAS loop ourselves:

    double desired, expected = ad.load(std::memory_order_relaxed);
    do {
        desired = expected + 1.0;
    } while( !ad.compare_exchange_weak(expected, desired) );  // seq_cst
}

    mov     rax, QWORD PTR ad[rip]
    movsd   xmm1, QWORD PTR .LC0[rip]
    mov     QWORD PTR [rsp-8], rax    # useless store
    movq    xmm0, rax
    mov     rax, QWORD PTR [rsp-8]    # and reload
.L8:
    addsd   xmm0, xmm1
    movq    rdx, xmm0
    lock cmpxchg    QWORD PTR ad[rip], rdx
    je      .L5
    mov     QWORD PTR [rsp-8], rax
    movsd   xmm0, QWORD PTR [rsp-8]
    jmp     .L8
.L5:
    ret

compare_exchange always does a bitwise comparison, so you don't need to worry about the fact that negative zero (-0.0) compares equal to +0.0 in IEEE semantics, or that NaN is unordered. This could be an issue if you try to check that desired == expected and skip the CAS operation, though. For new enough compilers, memcmp(&expected, &desired, sizeof(double)) == 0 might be a good way to express a bitwise comparison of FP values in C++. Just make sure you avoid false positives; false negatives will just lead to an unneeded CAS.

Hardware-arbitrated lock or [mem], 1 is definitely better than having multiple threads spinning on lock cmpxchg retry loops. Every time a core gets access to the cache line but fails its cmpxchg is wasted throughput compared to integer memory-destination operations that always succeed once they get their hands on a cache line.

Some special cases for IEEE floats can be implemented with integer operations. e.g. absolute value of an atomic<double> could be done with lock and [mem], rax (where RAX has all bits except the sign bit set). Or force a float / double to be negative by ORing a 1 into the sign bit. Or toggle its sign with XOR. You could even atomically increase its magnitude by 1 ulp with lock add [mem], 1. (But only if you can be sure it wasn't infinity to start with... nextafter() is an interesting function, thanks to the very cool design of IEEE754 with biased exponents that makes carry from mantissa into exponent actually work.)

There's probably no way to express this in C++ that will let compilers do it for you on targets that use IEEE FP. So if you want it, you might have to do it yourself with type-punning to atomic<uint64_t> or something, and check that FP endianness matches integer endianness, etc. etc. (Or just do it only for x86. Most other targets have LL/SC instead of memory-destination locked operations anyway.)

can't yet support something like atomic AVX/SSE vector because it's CPU-dependent

Correct. There's no way to detect when a 128b or 256b store or load is atomic all the way through the cache-coherency system. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490). Even a system with atomic transfers between L1D and execution units can get tearing between 8B chunks when transferring cache-lines between caches over a narrow protocol. Real example: a multi-socket Opteron K10 with HyperTransport interconnects appears to have atomic 16B loads/stores within a single socket, but threads on different sockets can observe tearing.

But if you have a shared array of aligned doubles, you should be able to use vector loads/stores on them without risk of "tearing" inside any given double.

Per-element atomicity of vector load/store and gather/scatter?

I think it's safe to assume that an aligned 32B load/store is done with non-overlapping 8B or wider loads/stores, although Intel doesn't guarantee that. For unaligned ops, it's probably not safe to assume anything.

If you need a 16B atomic load, your only option is to lock cmpxchg16b, with desired=expected. If it succeeds, it replaces the existing value with itself. If it fails, then you get the old contents. (Corner-case: this "load" faults on read-only memory, so be careful what pointers you pass to a function that does this.) Also, the performance is of course horrible compared to actual read-only loads that can leave the cache line in Shared state, and that aren't full memory barriers.

16B atomic store and RMW can both use lock cmpxchg16b the obvious way. This makes pure stores much more expensive than regular vector stores, especially if the cmpxchg16b has to retry multiple times, but atomic RMW is already expensive.

The extra instructions to move vector data to/from integer regs are not free, but also not expensive compared to lock cmpxchg16b.

# xmm0 -> rdx:rax, using SSE4
movq   rax, xmm0
pextrq rdx, xmm0, 1


# rdx:rax -> xmm0, again using SSE4
movq   xmm0, rax
pinsrq xmm0, rdx, 1

In C++11 terms:

atomic<__m128d> would be slow even for read-only or write-only operations (using cmpxchg16b), even if implemented optimally. atomic<__m256d> can't even be lock-free.

alignas(64) atomic<double> shared_buffer[1024]; would in theory still allow auto-vectorization for code that reads or writes it, only needing to movq rax, xmm0 and then xchg or cmpxchg for atomic RMW on a double. (In 32-bit mode, cmpxchg8b would work.) You would almost certainly not get good asm from a compiler for this, though!

You can atomically update a 16B object, but atomically read the 8B halves separately. (I think this is safe with respect to memory-ordering on x86: see my reasoning at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835).

However, compilers don't provide any clean way to express this. I hacked up a union type-punning thing that works for gcc/clang: How can I implement ABA counter with c++11 CAS?. But gcc7 and later won't inline cmpxchg16b, because they're re-considering whether 16B objects should really present themselves as "lock-free". (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

On x86-64 atomic operations are implemented via the LOCK prefix. The Intel Software Developer's Manual (Volume 2, Instruction Set Reference) states

The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG.

Neither of those instructions operates on floating point registers (like the XMM, YMM or FPU registers).

This means that there is no natural way to implement atomic float/double operations on x86-64. While most of those operations could be implemented by loading the bit representation of the floating point value into a general purpose (i.e. integer) register, doing so would severely degrade performance so the compiler authors opted not to implement it.

As pointed out by Peter Cordes in the comments, the LOCK prefix is not required for loads and stores, as those are always atomic on x86-64. However the Intel SDM (Volume 3, System Programming Guide) only guarantees that the following loads/stores are atomic:

Instructions that read or write a single byte.

Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.

Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.

Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.

In particular, atomicity of loads/stores from/to the larger XMM and YMM vector registers is not guaranteed.

Atomic double floating point or SSE/AVX vector load/store on x86_64

Related

Recent Posts