Why is cache memory so expensive?
Solution 1:
Take a look at : Processors cache L1, L2 and L3 are all made of SRAM?
In general they are all implemented with SRAM.
(IBM's POWER and zArchitecture chips use DRAM memory for L3. This is called embedded DRAM because it is implemented in the same type of process technology as logic, allowing fast logic to be integrated into the same chip as the DRAM. For POWER4 the off-chip L3 used eDRAM; POWER7 has the L3 on the same chip as the processing cores.)
Although they use SRAM, they do not all use the same SRAM design. SRAM for L2 and L3 are optimized for size (to increase the capacity given limited manufacturable chip size or reduce the cost of a given capacity) while SRAM for L1 is more likely to be optimized for speed.
More importantly, the access time is related to the physical size of the storage. With a two dimensional layout one can expect physical access latency to be roughly proportional to the square root of the capacity. (Non-uniform cache architecture exploits this to provide a subset of cache at lower latency. The L3 slices of recent Intel processors have a similar effect; a hit in the local slice has significantly lower latency.) This effect can make a DRAM cache faster than an SRAM cache at high capacities because the DRAM is physically smaller.
Another factor is that most L2 and L3 caches use serial access of tags and data where most L1 caches access tags and data in parallel. This is a power optimization (L2 miss rates are higher than L1 miss rates, so data access is more likely to be wasted work; L2 data access generally requires more energy--related to the capacity--; and L2 caches usually have higher associativity which means that more data entries would have to be read speculatively). Obviously, having to wait for the tag matching before accessing the data will add to the time required to retrieve the data. (L2 access also typically only begins after an L1 miss is confirmed, so the latency of L1 miss detection is added to the total access latency of L2.)
In addition, L2 cache is physically more distant from the execution engine. Placing the L1 data cache close to the execution engine (so that the common case of L1 hit is fast) generally means that L2 must be placed farther away.
Take a look at : Why is the capacity of of cache memory so limited?
The total silicon area (max chip size) is limited. Putting more cores or more cache, or more hierarchy of the caches are the trade-off of the design.
Take a look at : L2 and L3 Cache Difference?
Typically there are now 3 layers of cache on modern CPU cores:
- L1 cache is very small and very tightly bound to the actual processing units of the CPU, it can typically fulfil data requests within 3 CPU clock ticks. L1 cache tends to be around 4-32KB depending on CPU architecture and is split between instruction and data caches.
- L2 cache is generally larger but a bit slower and is generally tied to a CPU core. Recent processors tend to have 512KB of cache per core and this cache has no distinction between instruction and data caches, it is a unified cache. I believe the response time for in-cache data is typically under 20 CPU "ticks"
- L3 cache tends to be shared by all the cores present on the CPU and is much larger and slower again, but it is still a lot faster than going to main memory. L3 cache tends to be of the order of 4-8MB these days.
- Main Memory (~16 G, Shared).
Take a look at : https://stackoverflow.com/questions/4666728/size-of-l1-cache-and-l2-cache
L1 is closer to the processor, and is accessed on every memory access so its accesses are very frequent. Thus, it needs to return the data really fast (usually within on clock cycle). It also needs lots of read/write ports and high access bandwidth. Building a large cache with these properties is impossible. Thus, designers keep it small, e.g., 32KB in most processors today.
L2 is accessed only on L1 misses so accesses are less frequent (usually 1/20th of the L1). Thus, L1 can take multiple cycles to access (usually kept under 10) and have fewer ports. This allows designers to make it bigger.
Both of them pay very different roles. If L1 is made bigger, it will increase L1 access latency which will drastically reduce performance because it will make all loads and stores slower. Thus, L1 size is barely debatable.
If we removed L2, L1 misses will have to go to the next level, say memory. This means that a lot of access will be going to memory which would imply we need more memory bandwidth, which is already a bottleneck. Thus, keeping the L2 around is favorable.
Experts often refer to L1 as a latency filter (as it makes the common case of L1 hits faster) and L2 as a bandwidth filter as it reduces memory bandwidth usage.
Note: I have assumed a 2-level cache hierarchy in my argument to make it simpler. In almost all of todays multicore chips, there also exists an L3. In these chips, L3 is the one that plays the role of memory bandwidth filter. L2 now plays the role of on-chip bandwidth filter, i.e., it reduces access to the on-chip interconnect and the L3 (allowing designers to put a low-bandwidth interconnects like a ring and a slow single-port L3 which allows them to make L3 bigger).
Perhaps worth mentioning that the number of ports is a very important design point because it determines how much of the chip area will the cache consume. Ports add wires to the cache which consumes a lot of chip area and power.
Take a look at : http://en.wikipedia.org/wiki/Static_random-access_memory
Static random-access memory (SRAM or static RAM) is a type of semiconductor memory that uses bistable latching circuitry to store each bit. The term static differentiates it from dynamic RAM (DRAM) which must be periodically refreshed. SRAM exhibits data remanence,1 but it is still volatile in the conventional sense that data is eventually lost when the memory is not powered.