From a previous question on this forum, I learned that in most of the memory systems, L1 cache is a subset of the L2 cache means any entry removed from L2 is also removed from L1.

So now my question is how do I determine a corresponding entry in L1 cache for an entry in the L2 cache. The only information stored in the L2 entry is the tag information. Based on this tag information, if I re-create the addr it may span multiple lines in the L1 cache if the line-sizes of L1 and L2 cache are not same.

Does the architecture really bother about flushing both the lines or it just maintains L1 and L2 cache with the same line-size.

I understand that this is a policy decision but I want to know the commonly used technique.


Solution 1:

In core i7 the line sizes in L1 , L2 and L3 are the same: that is 64 Bytes. I guess this simplifies maintaining the inclusive property, and coherence.

See page 10 of: https://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf

Solution 2:

Cache-Lines size is (typically) 64 bytes.

Moreover, take a look at this very interesting article about processors caches: Gallery of Processor Cache Effects

You will find the following chapters:

  1. Memory accesses and performance
  2. Impact of cache lines
  3. L1 and L2 cache sizes
  4. Instruction-level parallelism
  5. Cache associativity
  6. False cache line sharing
  7. Hardware complexities

Solution 3:

The most common technique of handling cache block size in a strictly inclusive cache hierarchy is to use the same size cache blocks for all levels of cache for which the inclusion property is enforced. This results in greater tag overhead than if the higher level cache used larger blocks, which not only uses chip area but can also increase latency since higher level caches generally use phased access (where tags are checked before the data portion is accessed). However, it also simplifies the design somewhat and reduces the wasted capacity from unused portions of the data. It does not take a large fraction of unused 64-byte chunks in 128-byte cache blocks to compensate for the area penalty of an extra 32-bit tag. In addition, the larger cache block effect of exploiting broader spatial locality can be provided by relatively simple prefetching, which has the advantages that no capacity is left unused if the nearby chunk is not loaded (to conserve memory bandwidth or reduce latency on a conflicting memory read) and that the adjacency prefetching need not be limited to a larger aligned chunk.

A less common technique divides the cache block into sectors. Having the sector size the same as the block size for lower level caches avoids the problem of excess back-invalidation since each sector in the higher level cache has its own valid bit. (Providing all the coherence state metadata for each sector rather than just validity can avoid excessive writeback bandwidth use when at least one sector in a block is not dirty/modified and some coherence overhead [e.g., if one sector is in shared state and another is in the exclusive state, a write to the sector in the exclusive state could involve no coherence traffic—if snoopy rather than directory coherence is used].)

The area savings from sectored cache blocks were especially significant when tags were on the processor chip but the data was off-chip. Obviously, if the data storage takes area comparable to the size of the processor chip (which is not unreasonable), then 32-bit tags with 64-byte blocks would take roughly a 16th (~6%) of the processor area while 128-byte blocks would take half as much. (IBM's POWER6+, introduced in 2009, is perhaps the most recent processor to use on-processor-chip tags and off-processor data. Storing data in higher-density embedded DRAM and tags in lower-density SRAM, as IBM did, exaggerates this effect.)

It should be noted that Intel uses "cache line" to refer to the smaller unit and "cache sector" for the larger unit. (This is one reason why I used "cache block" in my explanation.) Using Intel's terminology it would be very unusual for cache lines to vary in size among levels of cache regardless of whether the levels were strictly inclusive, strictly exclusive, or used some other inclusion policy.

(Strict exclusion typically uses the higher level cache as a victim cache where evictions from the lower level cache are inserted into the higher level cache. Obviously, if the block sizes were different and sectoring was not used, then an eviction would require the rest of the larger block to be read from somewhere and invalidated if present in the lower level cache. [Theoretically, strict exclusion could be used with inflexible cache bypassing where an L1 eviction would bypass L2 and go to L3 and L1/L2 cache misses would only be allocated to either L1 or L2, bypassing L1 for certain accesses. The closest to this being implemented that I am aware of is Itanium's bypassing of L1 for floating-point accesses; however, if I recall correctly, the L2 was inclusive of L1.])

Solution 4:

Typically, in one access to the main memory 64 bytes of data and 8 bytes of parity/ECC (I don't remember exactly which) is accessed. And it is rather complicated to maintain different cache line sizes at the various memory levels. You have to note that cache line size would be more correlated to the word alignment size on that architecture than anything else. Based on that, a cache line size is highly unlikely to be different from memory access size. Now, the parity bits are for the use of the memory controller - so cache line size typically is 64 bytes. The processor really controls very little beyond the registers. Everything else going on in the computer is more about getting hardware in to optimize CPU performance. In that sense also, it really would not make any sense to import extra complexity by making cache line sizes different at different levels of memory.