How do I manage last level cache among virtual machines?

Solution 1:

Test your workload on the hardware and software environments you are considering. Ideally actual production work, but short of that simulate what clients would be doing: transactional, file transfer, encoding media, whatever. Find technical blogs for inspiration. For example, Cloudflare's "Gen X" hardware selection is lacking in scientific discipline, but they are right to consider caches in their Intel Xeon Platinum 6162 vs AMD EPYC 7642 competition.

Several vendors can deliver 20+ core CPUs, and they differ on L3 cache. AMD EPYC has L3 in each 4 core complex. IBM POWER9 has L3 dedicated per 2 cores. Intel Xeon Cascade Lake has L3 shared across all cores. Locating caches as close to the core as possible shaves off cycles of latency, possibly making a difference to performance.

CPU caches tend to be less shared system wide than memory. In designs where the cache is dedicated to a core complex, a remote VM's 4 cores is not going to steal from some other cores. Yes, if something is churning the L3 cache that will hurt the hit ratio and need to go to memory more, but that depends on workload.

First test with an unaware scheduler and no QoS, as in don't explicitly manage L3. Even a naïve assignment to cores should give cores a roughly equal share of cache. Find where the bottlenecks truly are, could equally be storage or network or a hundred other places.

Should you need fine control of cache, ask your vendors how to do that on your chosen CPU and hypervisor. Intel has Resource Director. AMD has Platform Quality of Service. Both are showing up in the Linux kernel, AMD more recently. But do this after testing, so you are informed with how hard your workload hits the cache.