Intel's CLWB instruction invalidating cache lines
I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind
command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I am observing eviction with PAPI hardware counters, with disabled prefetchers.
This is a simple loop that I am testing. array and tmp variable, both are declared as volatile, so the loads are really executed.
for(int i=0; i < arr_size; i++){
tmp = array[i];
_mm_clwb(& array[i]);
_mm_mfence();
tmp = array[i];
}
Both reads are giving cache misses.
I was wondering if anyone else has tried to detect whether there is some configuration or memory access pattern that would leave the cache line in the cache?
Solution 1:
clwb
behaves like clflushopt
on SKX and CSL. However, programs that use clwb
on these processors will automatically benefit when run on a future process that supports an optimized implementation of clwb
.
clwb
retains the cache line on ICL.
Note that cpuid
leaf 0x7 information from InstLatx64 says that ICL doesn't support clwb
, which is incorrect.
clwb
is also supported on Zen 2, but I don't know how it works on this microarchitecture.