How to flush the CPU cache for a region of address space in Linux?

Solution 1:

Check this page for list of available flushing methods in linux kernel: https://www.kernel.org/doc/Documentation/cachetlb.txt

Cache and TLB Flushing Under Linux. David S. Miller

There are set of range flushing functions

2) flush_cache_range(vma, start, end);
   change_range_of_page_tables(mm, start, end);
   flush_tlb_range(vma, start, end);

3) void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)

Here we are flushing a specific range of (user) virtual
addresses from the cache.  After running, there will be no
entries in the cache for 'vma->vm_mm' for virtual addresses in
the range 'start' to 'end-1'.

You can also check implementation of the function - http://lxr.free-electrons.com/ident?a=sh;i=flush_cache_range

For example, in arm - http://lxr.free-electrons.com/source/arch/arm/mm/flush.c?a=sh&v=3.13#L67

 67 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
 68 {
 69         if (cache_is_vivt()) {
 70                 vivt_flush_cache_range(vma, start, end);
 71                 return;
 72         }
 73 
 74         if (cache_is_vipt_aliasing()) {
 75                 asm(    "mcr    p15, 0, %0, c7, c14, 0\n"
 76                 "       mcr     p15, 0, %0, c7, c10, 4"
 77                     :
 78                     : "r" (0)
 79                     : "cc");
 80         }
 81 
 82         if (vma->vm_flags & VM_EXEC)
 83                 __flush_icache_all();
 84 }

Solution 2:

This is for ARM.

GCC provides __builtin___clear_cache which does should do syscall cacheflush. However it may have its caveats.

Important thing here is Linux provides a system call (ARM specific) to flush caches. You can check Android/Bionic flushcache for how to use this system call. However I'm not sure what kind of guarantees Linux gives when you call it or how it is implemented through its inner workings.

This blog post Caches and Self-Modifying Code may help further.

Solution 3:

In the x86 version of Linux you also can find a function void clflush_cache_range(void *vaddr, unsigned int size) which is used for the purposes of flush a cache range. This function relies to the CLFLUSH or CLFLUSHOPT instructions. I would recommend checking that your processor actually supports them, because in theory they are optional.

CLFLUSHOPT is weakly ordered. CLFLUSH was originally specified as ordered only by MFENCE, but all CPUs that implement it do so with strong ordering wrt. writes and other CLFLUSH instructions. Intel decided to add a new instruction (CLFLUSHOPT) instead of changing the behaviour of CLFLUSH, and to update the manual to guarantee that future CPUs will implement CLFLUSH as strongly ordered. For this use, you should MFENCE after using either, to make sure that the flushing is done before any loads from your benchmark (not just stores).

Actually x86 provides one more instruction that could be useful: CLWB. CLWB flushes data from cache to memory without (necessarily) evicting it, leaving it clean but still cached. clwb on SKX does evict like clflushopt, though

Note also that these instructions are cache coherent. Their execution will affect all caches of all processors (processor cores) in the system.

All these three instructions are available in user mode. Thus, you can employ assembler (or intrinsics like _mm_clflushopt) and create your own void clflush_cache_range(void *vaddr, unsigned int size) in your user space application (but do not forget to check their availability, before actual use).


If I correctly understand, it is much more difficult to reason about ARM in this regard. Family of ARM-processors is much less consistent then family of IA-32 processors. You can have one ARM with full-featured caches, and another one completely without caches. Further more, many manufacturers can use customized MMUs and MPUs. So it is better to reason about some particular ARM processor model.

Unfortunately, it looks like that it will be almost impossible to perform any reasonable estimation of time required to flush some data. This time is affected by too many factors including the number of cache lines flushed, unordered execution of instructions, the state of TLB (because instruction takes a virtual address as an argument, but caches use physical addresses), number of CPUs in the system, actual load in terms of memory operations on the other processors in the system, and how many lines from the range are actually cached by processors, and finally by performance of CPU, memory, memory controller and memory bus. In a result, I think execution time will vary significantly in different environments and with different loads. The only reasonable way is to measure the flush time on the system and with load similar to the target system.


And final note, do not confuse memory caches and TLB. They are both caches but organized in different ways and serving different purposes. TLB caches just most recently used translations between virtual and physical addresses, but not data which are pointed by that addresses.

And TLB is not coherent, in contrast to memory caches. Be careful, because flushing of TLB entries does not lead to the flushing of appropriate data from memory cache.

Solution 4:

Several people have expressed misgivings about clear_cache. Below is a manual process to evict the cache which is in-efficient, but possible from any user-space task (in any OS).


PLD/LDR

It is possible to evict caches by mis-using the pld instruction. The pld will fetch a cache line. In order to evict a specific memory address, you need to know the structure of your caches. For instance, a cortex-a9 has a 4-way data cache with 8 words per line. The cache size is configurable into 16KB, 32KB, or 64KB. So that is 512, 1024 or 2048 lines. The ways are always insignificant to the lower address bits (so sequential addresses don't conflict). So you will fill a new way by accessing memory offset + cache size / ways. So that is every 4KB, 8KB and 16KB for a cortex-a9.

Using ldr in 'C' or 'C++' is simple. You just need to size an array appropriately and access it.

See: Programmatically get the cache line size?

For example, if you want to evict 0x12345 the line starts at 0x12340 and for a 16KB round-robin cache a pld on 0x13340, 0x14340, 0x15340, and 0x16340 would evict any value form that way. The same principal can be applied to evict L2 (which is often unified). Iterating over all of the cache size will evict the entire cache. You need to allocate an unused memory the size of the cache to evict the entire cache. This might be quite large for the L2. pld doesn't need to be used, but a full memory access (ldr/ldm). For multiple CPUs (threaded cache eviction) you need to run the eviction on each CPU. Usually the L2 is global to all CPUs so it only needs to be run once.

NB: This method only works with LRU (least recently used) or round-robin caches. For pseudo-random replacement, you will have to write/read more data to ensure eviction, with an exact amount being highly CPU specific. The ARM random replacement is based on an LFSR that is from 8-33bits depending on the CPU. For some CPUs, it defaults to round-robin and others default to the pseudo-random mode. For a few CPUs a Linux kernel configuration will select the mode. ref: CPU_CACHE_ROUND_ROBIN However, for newer CPUs, Linux will use the default from the boot loader and/or silicon. In other words, it is worth the effort to try and get clear_cache OS calls to work (see other answers) if you need to be completely generic or you will have to spend a lot of time to clear the caches reliably.

Context swich

It is possible to circumvent the cache by fooling an OS using the MMU on some ARM CPUs and particular OSes. On an *nix system, you need multiple processes. You need to switch between processes and the OS should flush caches. Typically this will only work on older ARM CPUs (ones not supporting pld) where the OS should flush the caches to ensure not information leakage between processes. It is not portable and requires that you understand a lot about your OS.

Most explicit cache flushing registers are restricted to system mode to prevent denial of service type attacks between processes. Some exploits can try to gain information by seeing what lines have been evicted by some other process (this can give information about what addresses another process is accessing). These attacks are more difficult with pseudo-random replacement.

Solution 5:

In x86 to flush the entire cache hierarchy you can use this

native_wbinvd()

Which is defined in arch/x86/include/asm/special_insns.h . If you look at its implementation, it simply calls the WBINVD instruction

static inline void native_wbinvd(void)
{
        asm volatile("wbinvd": : :"memory");
}

Note that you need to be in privileged mode to execute the WBINVD X86 instruction. This is contrast to the CLFLUSH x86 instruction which clears a single cacheline and doesnt need the caller to be in privileged mode.

If you look at x86 Linux kernel code you will only see a handful (6 places when I write this) of this instruction. This is because it slows all entities running on that system. Imagine running this on a server with 100MB LLC. This instruction will mean moving the entire 100+ MB from cache to the RAM. Further it was brought to my notice that this instruction is non-interruptible. So its usage could significantly impact the determinism of a RT system for e.g.

(Though the original question asks about how to clear a specific address range, I thought info on clearing the entire cache hierarchy would also be useful for some readers)