How do I Understand Read Memory Barriers and Volatile
Some languages provide a volatile
modifier that is described as performing a "read memory barrier" prior to reading the memory that backs a variable.
A read memory barrier is commonly described as a way to ensure that the CPU has performed the reads requested before the barrier before it performs a read requested after the barrier. However, using this definition, it would seem that a stale value could still be read. In other words, performing reads in a certain order does not seem to mean that the main memory or other CPUs must be consulted to ensure that subsequent values read actually reflect the latest in the system at the time of the read barrier or written subsequently after the read barrier.
So, does volatile really guarantee that an up-to-date value is read or just (gasp!) that the values that are read are at least as up-to-date as the reads before the barrier? Or some other interpretation? What are the practical implications of this answer?
Solution 1:
There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc).
The barriers are not there to control "latest" value or "freshness" of the values. They are there to control the relative ordering of memory accesses.
Write barriers control the order of writes. Because writes to memory are slow (compared to the speed of the CPU), there is usually a write-request queue where writes are posted before they 'really happen'. Although they are queued in order, while inside the queue the writes may be reordered. (So maybe 'queue' isn't the best name...) Unless you use write barriers to prevent the reordering.
Read barriers control the order of reads. Because of speculative execution (CPU looks ahead and loads from memory early) and because of the existence of the write buffer (the CPU will read a value from the write buffer instead of memory if it is there - ie the CPU thinks it just wrote X = 5, then why read it back, just see that it is still waiting to become 5 in the write buffer) reads may happen out of order.
This is true regardless of what the compiler tries to do with respect to the order of the generated code. ie 'volatile' in C++ won't help here, because it only tells the compiler to output code to re-read the value from "memory", it does NOT tell the CPU how/where to read it from (ie "memory" is many things at the CPU level).
So read/write barriers put up blocks to prevent reordering in the read/write queues (the read isn't usually so much of a queue, but the reordering effects are the same).
What kinds of blocks? - acquire and/or release blocks.
Acquire - eg read-acquire(x) will add the read of x into the read-queue and flush the queue (not really flush the queue, but add a marker saying don't reorder anything before this read, which is as if the queue was flushed). So later (in code order) reads can be reordered, but not before the read of x.
Release - eg write-release(x, 5) will flush (or marker) the queue first, then add the write-request to the write-queue. So earlier writes won't become reordered to happen after x = 5, but note that later writes can be reordered before x = 5.
Note that I paired the read with acquire and write with release because this is typical, but different combinations are possible.
Acquire and Release are considered 'half-barriers' or 'half-fences' because they only stop the reordering from going one way.
A full barrier (or full fence) applies both an acquire and a release - ie no reordering.
Typically for lockfree programming, or C# or java 'volatile', what you want/need is read-acquire and write-release.
ie
void threadA()
{
foo->x = 10;
foo->y = 11;
foo->z = 12;
write_release(foo->ready, true);
bar = 13;
}
void threadB()
{
w = some_global;
ready = read_acquire(foo->ready);
if (ready)
{
q = w * foo->x * foo->y * foo->z;
}
else
calculate_pi();
}
So, first of all, this is a bad way to program threads. Locks would be safer. But just to illustrate barriers...
After threadA() is done writing foo, it needs to write foo->ready LAST, really last, else other threads might see foo->ready early and get the wrong values of x/y/z. So we use a write_release
on foo->ready, which, as mentioned above, effectively 'flushes' the write queue (ensuring x,y,z are committed) then adds the ready=true request to the queue. And then adds the bar=13 request. Note that since we just used a release barrier (not a full) bar=13 may get written before ready. But we don't care! ie we are assuming bar is not changing shared data.
Now threadB() needs to know that when we say 'ready' we really mean ready. So we do a read_acquire(foo->ready)
. This read is added to the read queue, THEN the queue is flushed. Note that w = some_global
may also still be in the queue. So foo->ready may be read before some_global
. But again, we don't care, as it is not part of the important data that we are being so careful about.
What we do care about is foo->x/y/z. So they are added to the read queue after the acquire flush/marker, guaranteeing that they are read only after reading foo->ready.
Note also, that this is typically the exact same barriers used for locking and unlocking a mutex/CriticalSection/etc. (ie acquire on lock(), release on unlock() ).
So,
I'm pretty sure this (ie acquire/release) is exactly what MS docs say happens for read/writes of 'volatile' variables in C# (and optionally for MS C++, but this is non-standard). See http://msdn.microsoft.com/en-us/library/aa645755(VS.71).aspx including "A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it..."
I think java is the same, although I'm not as familiar. I suspect it is exactly the same, because you just don't typically need more guarantees than read-acquire/write-release.
In your question you were on the right track when thinking that it is really all about relative order - you just had the orderings backwards (ie "the values that are read are at least as up-to-date as the reads before the barrier? " - no, reads before the barrier are unimportant, its reads AFTER the barrier that are guaranteed to be AFTER, vice versa for writes).
And please note, as mentioned, reordering happens on both reads and writes, so only using a barrier on one thread and not the other WILL NOT WORK. ie a write-release isn't enough without the read-acquire. ie even if you write it in the right order, it could be read in the wrong order if you didn't use the read barriers to go with the write barriers.
And lastly, note that lock-free programming and CPU memory architectures can be actually much more complicated than that, but sticking with acquire/release will get you pretty far.
Solution 2:
volatile
in most programming languages does not imply a real CPU read memory barrier but an order to the compiler not to optimize the reads via caching in a register. This means that the reading process/thread will get the value "eventually". A common technique is to declare a boolean volatile
flag to be set in a signal handler and checked in the main program loop.
In contrast CPU memory barriers are directly provided either via CPU instructions or implied with certain assembler mnemonics (such as lock
prefix in x86) and are used for example when talking to hardware devices where order of reads and writes to memory-mapped IO registers is important or synchronizing memory access in multi-processing environment.
To answer your question - no, memory barrier does not guarantee "latest" value, but guarantees order of memory access operations. This is crucial for example in lock-free programming.
Here is one of the primers on CPU memory barriers.