Synchronizing caches for JIT/self-modifying code on ARM
(Disclaimer: this answer is based on reading specs and some tests, but not on previous experience.)
First of all, there is an explanation and example code for this exact
case (one core writes code for another core to execute) in B2.2.5 of
the Architecture Reference Manual (version G.b). The only difference
from the examples you've shown is that the final isb
needs to
be executed in the thread that will execute the new code (which I
guess is your "consumer"), after the cache invalidation has finished.
I found it helpful to try to understand the abstract constructs like "inner shareable domain", "point of unification" from the architecture reference in more concrete terms.
Let's think about a system with several cores. Their L1d caches are coherent, but their L1i caches need not be unified with L1d, nor coherent with each other. However, the L2 cache is unified.
The system does not have any way for L1d and L1i to talk to each other
directly; the only path between them is through L2. So once we have
written our new code to L1d, we have to write it back to L2 (dc cvau
), then
invalidate L1i (ic ivau
) so that it repopulates from the new code in L2.
In this setting, PoU is the L2 cache, and that's exactly where we want to clean / invalidate to.
There's some explanation of these terms in page D4-2646. In particular:
The PoU for an Inner Shareable shareability domain is the point by which the instruction and data caches and the translation table walks of all the PEs in that Inner Shareable shareability domain are guaranteed to see the same copy of a memory location.
Here, the Inner Shareable domain is going to contain all the cores
that could run the threads of our program; indeed, it is supposed to
contain all the cores running the same kernel as us (page B2-166).
And because the memory we are dc cvau
ing is presumably marked with
the Inner Shareable attribute or better, as any reasonable OS should
do for us, it cleans to the PoU of the domain, not merely the PoU of
our core (PE). So that's just what we want: a cache level that all
instruction cache fills from all cores would see.
The Point of Coherency is further down; it is the level that everything on the system sees, including DMA hardware and such. Most likely this is main memory, below all the caches. We don't need to get down to that level; it would just slow everything down for no benefit.
Hopefully that helps with your question 1.
Note that the cache clean and invalidate instructions run "in the
background" as it were, so that you can execute a long string of them
(like a loop over all affected cache lines) without waiting for them
to complete one by one. dsb ish
is used once at the end to wait for
them all to finish.
Some commentary about dsb
, towards your questions #2 and #3. Its
main purpose is as a barrier; it makes sure that all the pending data
accesses within our core (in store buffers, etc) get flushed out to
L1d cache, so that all other cores can see them. This is the kind of
barrier you need for general inter-thread memory ordering. (Or for
most purposes, the weaker dmb
suffices; it enforces ordering but
doesn't actually wait for everything to be flushed.) But it doesn't
do anything else to the caches themselves, nor say anything about what
should happen to that data beyond L1d. So by itself, it would not be
anywhere near strong enough for what we need here.
As far as I can tell, the "wait for cache maintenance to complete"
effect is a sort of bonus feature of dsb ish
. It seems orthogonal
to the instruction's main purpose, and I'm not sure why they didn't
provide a separate wcm
instruction instead. But anyway, it is only
dsb ish
that has this bonus functionality; dsb ishst
does not.
D4-2658: "In all cases, where the text in this section refers to a DMB
or a DSB, this means a DMB or DSB whose required access type is
both loads and stores".
I ran some tests of this on a Cortex A-72. Omitting either of the dc cvau
or ic ivau
usually results in the stale code being executed, even if dsb ish
is done instead. On the other hand, doing dc cvau ; ic ivau
without any dsb ish
, I didn't observe any failures; but that could be luck or a quirk of this implementation.
To your #4, the sequence we've been discussing (dc cvau ; dsb ish ; ci ivau ; dsb ish ; isb
) is intended for the case when you will run
the code on the same core that wrote it. But it actually shouldn't
matter which thread does the dc cvau ; dsb ish ; ci ivau ; dsb ish
sequence, since the cache maintenance instructions cause all the cores
to clean / invalidate as instructed; not just this one. See table
D4-6. (But if the dc cvau
is in a different thread than the writer, maybe the writer has to have completed a dsb ish
beforehand, so that the written data really is in L1d and not still in the writer's store buffer? Not sure about that.)
The part that does matter is isb
. After ci ivau
is complete, the
L1i caches are cleared of stale code, and further instruction fetches
by any core will see the new code. However, the runner core might
previously have fetched the old code from L1i, and still be holding
it internally (decoded and in the pipeline, uop cache, speculative
execution, etc). isb
flushes these CPU-internal mechanisms,
ensuring that all further instructions to be executed have actually
been fetched from the L1i cache after it was invalidated.
Thus, the isb
needs to be executed in the thread that is going to
run the newly written code. And moreover you need to make sure that
it is done after all the cache maintenance has fully completed;
maybe by having the writer thread notify it via condition variable or
the like.
I tested this too. If all the cache maintenance instructions, plus an isb
, are done by the writer, but the runner doesn't isb
, then once again it can execute the stale code. I was only able to reproduce this in a test where the writer patches an instruction in a loop that the runner is executing concurrently, which probably ensures that the runner had already fetched it. This is legal provided that the old and new instruction are, say, a branch and a nop respectively (see B2.2.5), which is what I did. (But it is not guaranteed to work for arbitrary old and new instructions.)
I tried some other tests to try to arrange it so that the instruction wasn't actually executed until it was patched, yet it was the target of a branch that should have been predicted taken, in hopes that this would get it prefetched; but I couldn't get the stale version to execute in that case.
One thing I wasn't quite sure about is this. A typical modern OS may
well have W^X, where no virtual page can be simultaneously writable
and executable. If after writing the code, you call the equivalent of
mprotect
to make the page executable, then most likely the OS is
going to take care of all the cache maintenance and synchronization
for you (but I guess it doesn't hurt to do it yourself too).
But another way to do it would be with an alias: you map the memory
writable at one virtual address, and executable at another. The
writer writes at the former address, and the runner jumps to the
latter. In that case, I think you would simply dc cvau
the
writable address, and ic ivau
the executable one, but I couldn't
find confirmation of that. But I tested it, and it worked no matter which alias was passed to which cache maintenance instruction, while it failed if either instruction was omitted altogether. So it appears that the cache maintenance is done by physical address underneath.