Memory latency measurement with time stamp counter

Solution 1:

First, note that the two calls to printf after measuring diff1 and diff2 may perturb the state of the L1D and even the L2. On my system, with printf, the reported values for diff3-ov range between 4-48 cycles (I've configured my system so that the TSC frequency is about equal to the core frequency). The most common values are those of the L2 and L3 latencies. If the reported value is 8, then we've got our L1D cache hit. If it is larger than 8, then most probably the preceding call to printf has kicked out the target cache line from the L1D and possibly the L2 (and in some rare cases, the L3!), which would explain the measured latencies that are higher than 8. @PeterCordes have suggested to use (void) *((volatile int*)array + i) instead of temp = array[i]; printf(temp). After making this change, my experiments show that most reported measurements for diff3-ov are exactly 8 cycles (which suggests that the measurement error is about 4 cycles), and the only other values that get reported are 0, 4, and 12. So Peter's approach is strongly recommended.

In general, the main memory access latency depends on many factors including the state of the MMU caches and the impact of the page table walkers on the data caches, the core frequency, the uncore frequency, the state and configuration of the memory controller and the memory chips with respect to the target physical address, uncore contention, and on-core contention due to hyperthreading. array[70] might be in a different virtual page (and physical page) than array[30] and their IPs of the load instructions and the addresses of the target memory locations may interact with the prefetchers in complex ways. So there can be many reasons why cache miss1 is different from cache miss2. A thorough investigation is possible, but it would require a lot of effort as you might imagine. Generally, if your core frequency is larger than 1.5 GHz (which is smaller than the TSC frequency on high-perf Intel processors), then an L3 load miss will take at least 60 core cycles. In your case, both miss latencies are over 100 cycles, so these are most likely L3 misses. In some extremely rare cases though, cache miss2 seems to be close to the L3 or L2 latency ranges, which would be due to prefetching.


I've determined that the following code gives a statistically more accurate measurement on Haswell:

t1 = __rdtscp(&dummy);
tmp = *((volatile int*)array + 30);
asm volatile ("add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
              "add $1, %1\n\t"
          : "+r" (tmp));          
t2 = __rdtscp(&dummy);
t2 = __rdtscp(&dummy);
loadlatency = t2 - t1 - 60; // 60 is the overhead

The probability that loadlatency is 4 cycles is 97%. The probability that loadlatency is 8 cycles is 1.7%. The probability that loadlatency takes other values is 1.3%. All of the other values are larger than 8 and multiple of 4. I'll try to add an explanation later.

Solution 2:

Some ideas:

  • Perhaps a[70] was prefetched into some level of cache besides L1?
  • Perhaps some optimization in DRAM causes this access to be fast, for instance maybe the row buffer is left open after accessing a[30].

You should investigate other access besides a[30] and a[70] to see if you get different numbers. E.g. do you get the same timings for hit on a[30] followed by a[31] (which should be fetched in the same line as a[30], if you use aligned_alloc with 64 byte alignment). And do other elements like a[69] and a[71] give the same timings as a[70]?