What is the Cost of an L1 Cache Miss?
Here is an attempt to provide insight into the relative cost of cache misses by analogy with baking chocolate chip cookies ...
Your hands are your registers. It takes you 1 second to drop chocolate chips into the dough.
The kitchen counter is your L1 cache, twelve times slower than registers. It takes 12 x 1 = 12 seconds to step to the counter, pick up the bag of walnuts, and empty some into your hand.
The fridge is your L2 cache, four times slower than L1. It takes 4 x 12 = 48 seconds to walk to the fridge, open it, move last night's leftovers out of the way, take out a carton of eggs, open the carton, put 3 eggs on the counter, and put the carton back in the fridge.
The cupboard is your L3 cache, three times slower than L2. It takes 3 x 48 = 2 minutes and 24 seconds to take three steps to the cupboard, bend down, open the door, root around to find the baking supply tin, extract it from the cupboard, open it, dig to find the baking powder, put it on the counter and sweep up the mess you spilled on the floor.
And main memory? That's the corner store, 5 times slower than L3. It takes 5 x 2:24 = 12 minutes to find your wallet, put on your shoes and jacket, dash down the street, grab a litre of milk, dash home, take off your shoes and jacket, and get back to the kitchen.
Note that all these accesses are constant complexity -- O(1) -- but the differences between them can have a huge impact on performance. Optimizing purely for big-O complexity is like deciding whether to add chocolate chips to the batter 1 at a time or 10 at a time, but forgetting to put them on your grocery list.
Moral of the story: Organize your memory accesses so the CPU has to go for groceries as rarely as possible.
Numbers were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:
- register access = 4 instructions per cycle
- L1 latency = 3 cycles (12 x register)
- L2 latency = 12 cycles (4 x L1, 48 x register)
- L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
- DRAM latency = 65 ns = 195 cycles on a 3 GHz CPU (5 x L3, 15 x L2, 60 x L1, 720 x register)
The Gallery of Processor Cache Effects also makes good reading on this topic.
While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.
Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.
3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.
Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.
Well yeah that does look like it will mainly be L1 cache misses.
10 cycles for an L1 cache miss does sound about reasonable, probably a little on the low side.
A read from RAM is going to take of the order of 100s or may be even 1000s (Am too tired to attempt to do the maths right now ;)) of cycles so its still a huge win over that.
If you plan on using cachegrind, please note that it is a cache hit/miss simulator only. It won't always be accurate. For example: if you access some memory location, say 0x1234 in a loop 1000 times, cachegrind will always show you that there was only one cache miss (the first access) even if you have something like:
clflush 0x1234 in your loop.
On x86, this will cause all 1000 cache misses.