Solution 1:

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)

As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).

By contrast, a hit in L1 cache has a load-use latency of 3 or 4 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.

Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the x86 tag wiki has lots of performance links for that architecture.

Solution 2:

First of all a main memory access is very expensive. Currently a 2GHz CPU (the slowest once) has 2G ticks (cycles) per second. A CPU (virtual core nowadays) can fetch a value from its registers once per tick. Since a virtual core consists of multiple processing units (ALU - arithmetic logic unit, FPU etc) it can actually process certain instructions in parallel if possible.

An access of the main memory costs about 70ns to 100ns (DDR4 is slightly faster). This time is basically looking up the L1, L2 and L3 cache and than hit the memory (send command to memory controller, which sends it to the memory banks), wait for the response and done.

100ns means about 200 ticks. So basically if a program would always miss the caches which each memory access, the CPU would spend about 99,5% of its time (if it only reads memory) idle waiting for the memory.

In order to speed things up there is the L1, L2, L3 caches. They use memory being directly placed on the chip and using a different kind of transistor circuits to store the given bits. This takes more room, more energy and is more costly than the main memory since a CPU is usually produced using a more advanced technology and a production failure in the L1, L2, L3 memory has the chance to render the CPU worthless (defect) so large L1, L2, L3 caches increase the error rate which decreases the yield which directly decreases ROI. So there is a huge trade off when it comes to available cache size.

(currently one creates more L1, L2, L3 caches in order to be able to deactivate certain portions to decrease the chance that an actual production defect is the cache memory areas renders the CPU defect as a whole).

To give a timing idea (source: costs to access caches and memory)

  • L1 cache: 1ns to 2ns (2-4 cycles)
  • L2 cache: 3ns to 5ns (6-10 cycles)
  • L3 cache: 12ns to 20ns (24-40 cycles)
  • RAM: 60ns (120 cycles)

Since we mix different CPU types these are just estimates but give a good idea whats really going when a memory value is fetched and we might have a hit or a miss in certain cache layer.

So a cache basically speeds up memory access greatly (60ns vs. 1ns).

Fetching a value, storing it in the cache for the chance of rereading it is good for variables that are often accessed but for memory copy operations it would be still to slow since one just reads a value, writes the value somewhere and never reads the value again... no cache hits, dead slow (beside this can happen in parallel since we have out of order execution).

This memory copy is so important that there are different means to speed it up. In early days memory was often able to copy memory outside of the CPU. It was handled by the memory controller directly, so a memory copy operation did not pollute the caches.

But beside from a plain memory copy other serial access of memory was quite common. An example is analyzing a series of information. Having an array of integers and calculating sum, mean, average or even more simple find a certain value (filter/search) were another very important class of algorithms run every time on any general purpose CPU.

So by analyzing memory access pattern it was apparent that data is read sequentially very often. There was a high probability that if a program reads the value at index i, that the program will also read the value i+1. This probability is slightly higher than the probability that the same program will also read the value i+2 and so on.

So given a memory address it was (and still is) a good idea to read ahead and fetch additional values. This is the reason why there is a boost mode.

Memory access in boost mode means, that an address is send and multiple values are sequentially send. Each additional value send only takes about additional 10ns (or even below).

Another problem was an address. Sending an address takes time. In order to address a large portion of memory large addresses has to be send. In early days it meant that the address bus was not big enough to send the address in a single cycle (tick) and more than one cycle was needed to send the address adding more delay.

A cache line of 64 bytes for instance means that the memory is divided in distinct (non-overlapping) blocks of memory being 64bytes in size. 64bytes mean the start address of each block has the lowest six address bits to be always zeros. So sending these six zero bits each time is not needed increasing the address space 64 times for any number of address bus width (welcome effect).

Another problem the cache line solves (beside reading ahead and saving / freeing six bits on the address bus) is in the way the cache is organized. For example if a cache would be divided in 8 byte (64bit) blocks (cells) one needs to store the address of the memory cell this cache cell holds the value for along with it. If the address would also be 64bit this means that half the cache size is consumed by the address resulting in an overhead of 100%.

Since a cache line is 64bytes and a CPU might use 64bit - 6bit = 58bit (no need to store the zero bits too right) means we can cache 64bytes or 512bits with an overhead of 58bit (11% overhead). In reality the addresses stored are even smaller than this but there are status informations (like is the cache line valid and accurate, dirty and needs to wrote back in ram etc).

Another aspect is that we have set-associative cache. Not every cache cell is able to store a certain address but only a subset of those. This makes the necessary stored address bits even smaller, allows parallel access of the cache (each subset can be accessed once but independent of the other subsets).

There is more especially when it comes to synchronize cache/memory access between the different virtual cores, their independent multiple processing units per core and finally multiple processors on one mainboard (which there are boards housing as much as 48 processors and more).

This is basically the current idea why we have cache lines. The benefit from reading ahead is very high and the worst case of reading a single byte out of a cache-line and never read the rest again is very slim since the probability is very slim.

The size of the cache-line (64) is a wise chosen trade-off between larger cache-lines makes it unlikely for the last byte of it to be read also in the near future, the duration it takes to fetch the complete cache line from memory (and to write it back) and also the overhead in cache organization and the parallelization of cache and memory access.

Solution 3:

If cache lines are 64 bytes wide, then they correspond to blocks of memory which start on addresses that are divisible by 64. The least significant 6 bits of any address are an offset into the cache line.

So for any given byte, the cache line which has to be fetched can be found by clearing the least signficant six bits of the address, which corresponds to rounding down to the nearest address that is divisible by 64.

Though this is done by hardware, we can show the calculations using some reference C macro definitions:

#define CACHE_BLOCK_BITS 6
#define CACHE_BLOCK_SIZE (1U << CACHE_BLOCK_BITS)  /* 64 */
#define CACHE_BLOCK_MASK (CACHE_BLOCK_SIZE - 1)    /* 63, 0x3F */

/* Which byte offset in its cache block does this address reference? */
#define CACHE_BLOCK_OFFSET(ADDR) ((ADDR) & CACHE_BLOCK_MASK)

/* Address of 64 byte block brought into the cache when ADDR accessed */
#define CACHE_BLOCK_ALIGNED_ADDR(ADDR) ((ADDR) & ~CACHE_BLOCK_MASK)

Solution 4:

Processors may have multi-level caches (L1, L2, L3), and these differ on size and speed.

Yet, to understand what exactly goes into each cache you'll have to study the branch predictor used by that specific processor, and how the instructions/data of your program behave against it.

Read about branch predictor, CPU cache and replacement policies.

This is not an easy task. If at the end of the day all you want is a performance test, you can use a tool like Cachegrind. However, as this is a simulation, its result may differ at some degree.

Solution 5:

I can't say for certain as every hardware is different, but it is typically "64 bytes start at the closest 64 bytes boundary below" as that is a very fast and simple operation for the CPU.