Why are multiple levels of caches used in modern CPUs?
I recently read about processors (mainly multi-processor) and I came across the concept of multi-level caches. These designs have several levels of cache in this case to increase the performance.
I could not figure out, however, how a additional caches increases performance in the processor? Why not just increase the size of existing caches instead?
Solution 1:
The use of multiple cache levels is partially a mechanism to coordinate multi-core processors and partially a compromise between price and performance.
In a processor with multiple cores, each core has its own L1 cache. This allows the core to read and write from and to the cache without worrying about interfering with other cores. The cores need shared storage, though, so that they can exchange certain information easily. The L2 cache is shared by all cores, so it's used as a sort of communal storage space where information is available for all threads.
The difference between the L2 and L3 caches is the compromise part. Caches are made of static RAM, or SRAM. This is different from the Dynamic RAM (DRAM) that makes up your main memory. Dynamic RAM needs to be "refreshed" periodically, that is, over time DRAM cells lose their value unless they are read and then re-written. Your memory controller does this automatically, but every time the memory controller has to do this (thousands of times a second) it is unable to read/write values for the processor until it is done. This slows down DRAM. SRAM does not have this limitation, SRAM will retain its value indefinitely as long as it has operating power, making it much faster. So, your caches (both L2 and L3) are made of SRAM. The trouble is that SRAM is very expensive; while 4GB of high-speed DRAM is a bit pricey but affordable, 4GB of SRAM is well beyond your price range.
So, whenever a processor manufacturer decides to add more SRAM to the design, it gets noticeably more expensive. SRAM comes in multiple speeds, and as you might expect faster SRAM is more expensive than slower SRAM. So, your processor's manufacturer has tried to optimize it for both speed and cost by using both a higher speed SRAM and a lower speed SRAM. The processor is then designed such that it will put the values it needs most in the faster cache (L2) and the information that it needs less in a slower cache (L3). By carefully managing this memory in the processor's microcode, this creates an even compromise: there is more cache, and some of the cache (hopefully the cache that the processor needs most) is very fast.
So, to sum it up, processors have multi-level caches in order to increase the capacity of the processor cache without also dramatically increasing the price of the processor. This careful mixture allows for processors that are faster and cheaper.
Solution 2:
Multilevel caches are a primarily a compromise between capacity and access cost (both latency/time and energy).
It might help to compare it to buying a tool. Going to the local hardware store (comparable to L1 cache) would be fast and use less energy, but the local hardware store is small and is more likely not to have the specific tool one seeks. Going to the big box hardware store (comparable to L2 cache) will take more time and energy (it is farther away and looking for the tool will take longer), but the tool is more likely to be in stock. If even the big box hardware store does not have the tool, one might go to the manufacturer's warehouse (comparable to main memory) which is almost certain to have the tool. If even the warehouse does not have the tool, then an even longer wait is expected until the manufacturer's factory (comparable to disk) produces more of the tool.
Living next to a big box hardware store (having a very large L1 cache) would save time if the diversity of hardware supplies sought was typically great (some PA-RISC processors targeting commercial workloads did this), but typically a small diversity of supplies are used so a small local store would be very likely to have the item in stock (high probability of a cache hit) and finding a commonly used item is faster in a smaller store.
As jcrawfordor mentioned, there are some advantages to sharing a level of cache among multiple cores since it can: avoid repeated storage of the same memory contents, allow imbalanced use of storage capacity (e.g., one core could use all L2 storage with a shared L2 while with per-core L2 caches the core would be constricted to its own L2 cache), and simplify and speed communication between cores (the same L2 would be accessed anyway on an L1 miss and there would be no need to check if other L2 caches had the data).
(Similar sharing advantages can apply with respect to an L2 and separate L1 instruction and data caches, but such content sharing is usually avoided (i.e., a cache line usually has only code or data) and, excluding less common actions like self-modifying code and JIT compilation, there is rarely communication between an instruction cache and a data cache.)
Sharing does have overhead, however. One might compare it to shopping at a department store. The more shoppers using the store, the more likely there will be a line at any given checkout station (comparable to banks in an L2 cache). In addition, the shared entrance/exit introduces delays (comparable to arbitration delays for cache access), providing multiple doors can support higher throughput but increases the time required to choose a door--the choice overhead may be extremely small (but not non-existent) when no one else is entering/exiting but when the store is busy the choice of door becomes more complex. If one assumes that the store will be busy, some of the decision delay can be avoided; but just using the most convenient door would be faster if the store is not busy (similarly a cache might, e.g., take the extra time to allocate a buffer to hold the memory request information even if such a buffer would not be necessary if the cache is not busy--without such an optimization, if the cache is busy, the two steps of determining whether the cache was busy and allocating a buffer entry would occur in sequence so the total time would be the sum of the two, but if the cache is not busy the buffer allocation step is avoided).
Sharing can also increase the frequency of conflict misses given the limited associativity of a cache and can cause poor cache replacement choices (e.g., one core using a streaming access pattern with little reuse of data would tend to use capacity that another core with frequent reuse of data would have greater benefit in using). There are techniques to reduce such disadvantages, but they add complexity and have other costs.