I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?

I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory.


Solution 1:

(Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.)

Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. And to support hardware virtualization, the page-walkers can handle guest page tables inside a host VM. (Guest physical memory = host virtual memory, more or less. VMWare published a paper with a summary of EPT, and benchmarks on Nehalem).

Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. (Intel also lowered the page-split load penalty from ~100 to ~5 or 10 extra cycles of latency, about the same as a cache-line split but worse throughput. This may be related, or maybe adding a 2nd page-walk unit was a separate response to discovering that page split accesses (and TLB misses?) were more important than they had previously estimated in real workloads).

Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions.

(Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.


The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). The event name is PAGE_WALKER_LOADS.DTLB_L1 for Number of DTLB page walker hits in the L1+FB, and others for ITLB and other levels of cache.

Since modern page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3). So does ARM, with their Intermediate table walk cache

That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.

But note that Intel guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by Intel manuals.)

But there are old Cyrix CPUs that do perform negative caching, though. The common subset of x86 guarantees across vendors isn't always as strong as Intel's. 64-bit kernels should safely be able to change a PTE from not-present to present without invlpg, though, because those Cyrix chips were 32-bit-only. (If Intel, AMD, and Via manuals all agree that it's safe; IDK of any other x86-64 vendors.)


(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (it was usually write-through so this made pagewalk coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache.

That other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than just the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF re-runs the faulting instruction so HW pagewalk will succeed this time.) Perhaps the paper is thinking of ISAs like MIPS where software TLB management / miss-handling is possible.

I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86.

Or perhaps he meant using MSRs to set up TLB entries ahead of time in cases where you expect there not to already be one, avoiding some page walks. Apparently 386 / 486 had TLB-entry query / set access via special registers: https://retrocomputing.stackexchange.com/questions/21963/how-did-the-test-registers-work-on-the-i386-and-the-i486 But there's probably no P5 MSR equivalent for that 386/486 functionality.
AFAIK, there wasn't a way to have a TLB miss trap to a software function (with paging disabled?) even on 386/486, so you couldn't fully avoid the HW page walker, just prime the TLB to avoid some TLB misses, at least on 386/486.


As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)

BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)

Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.

No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.


Software TLB handling: some RISCs are like this, not x86

In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, if the CPU doesn't know about page table format.

This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86). A paper, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations shows some example code from what is says is the TLB miss handler in Ultrix, if you want a real example.


Other links

  • How does CPU make data request via TLBs and caches? A duplicate of this.
  • VIPT Cache: Connection between TLB & Cache? - the internals of a load port / load execution unit that accesses the dTLB in parallel with fetching tags/data from the indexed set.
  • What is PDE cache?
  • Measuring TLB miss handling cost in x86-64 Describes Westmere's perf counter for Page Walk Cycles. (apparently new with 2nd-gen-Nehalem = Westmere)
  • https://lwn.net/Articles/379748/ (Linux hugepage support/performance, talks some about PowerPC and x86, and using oprofile to count page-walk cycles)
  • What Every Programmer Should Know About Memory?
  • Understanding TLB from CPUID results on Intel my answer includes some background on TLBs, including why it wouldn't make sense to have a shared L3TLB across cores. (Summary: because unlike data, page translations are thread-private. Also, more / better page-walk hardware and TLB prefetch does more to help reduce the average cost of an L1i/dTLB miss in more cases.)

Comments about TLB coherency from Andy Glew, one of the architects on Intel P6 (Pentium Pro / II / III), then later worked at AMD.

The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win1. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.

By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.

so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.

recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

Footnote 1: This is the surprising claim I mentioned earlier, possibly referring to using MSRs to prime the TLB to hopefully avoid some page walks.


More from Andy Glew from the same thread, because these comments deserve to be in a full answer somewhere.

(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.

(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".

(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.

(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.

(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.

(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?

(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.

Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.

me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.

Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.

Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.

Ditto memory ordering. High end snooping faster; low end draining cheaper.

It is hard to maintain this dichotomy.

It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.

This comment thread was on Andy's answer to a question about self-modifying code and seeing stale instructions; another case where real CPUs go above and beyond the requirements on paper, because it's actually easier to always snoop for stores near EIP/RIP than to re-sync only on branch instructions if you didn't keep track of what happened between branches.