What does the C++ compiler do to ensure that different but adjacent memory locations are safe to be used on different threads?

Lets say I have a struct:

struct Foo {
  char a;  // read and written to by thread 1 only
  char b;  // read and written to by thread 2 only
};

Now from what I understand, the C++ standard guarantees the safety of the above when two threads operate on the two different memory locations.

I would think though that, since char a and char b, fall within the same cache line, that the compiler has to do extra syncing.

What exactly happens here?


Solution 1:

This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from

char a[2];
// or
char a, b;

In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.

However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.

Solution 2:

As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:

std::array<std::uint8_t, 8u> c;

void f()
{
    c[0] ^= 0xfa;
    c[3] ^= 0x10;
    c[6] ^= 0x8b;
    c[7] ^= 0x92;
}

Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):

load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]

This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.

For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.

(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)