Where is the Write-Combining Buffer located? x86

Solution 1:

Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different things in different contexts. This answer is about Intel and AMD processors only.

Write-Combining Buffers on Intel Processors

Each cache might be accompanied with zero or more line fill buffers (also called fill buffers). The collection of fill buffers at L2 are called the super queue or superqueue (each entry in the super queue is a fill buffer). If the cache is shared between logical cores or physical cores, then the associated fill buffers are shared as well between the cores. Each fill buffer can hold a single cache line and additional information that describes the cache line (if it's occupied) including the address of the cache line, the memory type, and a set of validity bits where the number of bits depends on the granularity of tracking the individual bytes of the cache line. In early processors (such as Pentium II), only one of the fill buffers is capable of write-combining (and write-collapsing). The total number of line buffers and those capable of write-combing has increased steadily with newer processors.

Nehalem up to Broadwell include 10 fill buffers at each L1 data cache. Core and Core2 have 8 LFBs per physical core. According to this, there are 12 LFBs on Skylake. @BeeOnRope has observed that there are 20 LFBs on Cannon Lake. I could not find a clear statement in the manual that says LFBs are the same as WCBs on all of these microarchitectures. However, this article written by a person from Intel says:

Consult the Intel® 64 and IA-32 Architectures Optimization Reference Manual for the number of fill buffers in a particular processor; typically the number is 8 to 10. Note that sometimes these are also referred to as "Write Combining Buffers", since on some older processors only streaming stores were supported.

I think the term LFB was first introduced by Intel with the Intel Core microarchitecture, on which all of the 8 LFBs are WCBs as well. Basically, Intel sneakily renamed WCBs to LFBs at that time, but did not clarify this in their manuals since then.

That same quote also says that the term WCB was used on older processors because streaming loads were not supported on them. This could be interpreted as the LFBs are also used by streaming load requests (MOVNTDQA). However, Section 12.10.3 says that streaming loads fetch the target line into buffers called streaming load buffers, which are apparently physically different from the LFBs/WCBs.

A line fill buffer is used in the following cases:

(1) A fill buffer is allocated on a load miss (demand or prefetch) in the cache. If there was no fill buffer available, load requests keep piling up in the load buffers, which may eventually lead to stalling the issue stage. In case of a load request, the allocated fill buffer is used to temporarily hold requested lines from lower levels of the memory hierarchy until they can be written to the cache data array. But the requested part of the cache line can still be provided to the destination register even if the line has not yet been written to the cache data array. According to Patrick Fay (Intel):

If you search for 'fill buffer' in the PDF you can see that the Line fill buffer (LFB) is allocated after an L1D miss. The LFB holds the data as it comes in to satisfy the L1D miss but before all the data is ready tobe written to the L1D cache.

(2) A fill buffer is allocated on a cacheable store to the L1 cache and the target line is not in a coherence state that allows modifications. My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

This suggests that cacheable stores are not committed to the LFB if the target line is not in the L1D. In other words, the store has to wait in the store buffer until either the target line is written into the LFB, and then the line is modified in the LFB, or the target line is written into the L1D, and then the line is modified in the L1D.

(3) A fill buffer is allocated on a uncacheable write-combining store in the L1 cache irrespective of whether the line is in the cache or its coherence state. WC stores to the same cache line can be combined and collapsed (multiple writes to the same location in the same line will make the last store in program order overwrite previous stores before they become globally observable) in a single LFB. Ordering is not maintained among the requests currently allocated in LFBs. So if there are two WCBs in use, there is no guarantee which will be evicted first, irrespective of the order of stores with respect to program order. That's why WC stores may become globally observable out of order even if all stores are retired committed in order (although the WC protocol allows WC stores to be committed out of order). In addition, WCBs are not snooped and so only becomes globally observable when they reach the memory controller. More information can be found in Section 11.3.1 in the Intel manual V3.

There are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. There were also a number of WCB buffers in the P6 (the first to implement WCBs) and P4 dedicated for the WC memory type (cannot be used for other memory types). On the early versions of P4, there are 4 such buffers. For the P4 versions that support hyperthreading, when hyperthreading is enabled and both logical cores are running, the WCBs are statically partitioned between the two logical cores. Modern Intel microarchitectures, however, competitively share the all the LFBs, but I think keep at least one available for each logical core to prevent starvation.

(4) The documentation of L1D_PEND_MISS.FB_FULL indicates that UC stores are allocated in the same LFBs (irrespective of whether the line is in the cache or its coherence state). Like cacheable stores, but unlike WC, UC stores are not combined in the LFBs.

(5) I've experimentally observed that requests from IN and OUT instructions are also allocated in LFBs. For more information, see: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.

Additional information:

The fill buffers are managed by the cache controller, which is connected to other cache controllers at other levels (or the memory controller in case of the LLC). A fill buffer is not allocated when a request hits in the cache. So a store request that hits in the cache is performed directly in the cache and a load request that hits in the cache is directly serviced from the cache. A fill buffer is not allocated when a line is evicted from the cache. Evicted lines are written to their own buffers (called writeback buffers or eviction buffers). Here is a patent from Intel that discusses write combing for I/O writes.

I've run an experiment that is very similar to the one I've described here to determine whether a single LFB is allocated even if there are multiple loads to the same line. It turns out that that is indeed the case. The first load to a line that misses in the write-back L1D cache gets an LFB allocated for it. All later loads to the same cache line are blocked and a block code is written in their corresponding load buffer entries to indicate that they are waiting on the same request being held in that LFB. When the data arrives, the L1D cache sends a wake-up signal to the load buffer and all entries that are waiting on that line are woken up (unblocked) and scheduled to be issued to the L1D cache when at least one load port is available. Obviously the memory scheduler has to choose between the unblocked loads and the loads that have just been dispatched from the RS. If the line got evicted for whatever reason before all waiting loads get the chance to be serviced, then they will be blocked again and an LFB will be again allocated for that line. I've not tested the store case, but I think no matter what the operation is, a single LFB is allocated for a line. The request type in the LFB can be promoted from prefetch to demand load to speculative RFO to demand RFO when required. I also found out empirically that speculative requests that were issued from uops on a mispredicted path are not removed when flushing the pipeline. They might be demoted to prefetch requests. I'm not sure.

Write-Combining Buffers on AMD Processors

I mentioned before according to an article that there are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. I quote from the article:

On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses.

The "on the older AMD processors" part got me curious. Did this change on newer AMD processors? It seems to me that this is still true on all newer AMD processors including the most recent Family 17h Processors (Zen). The WikiChip article on the Zen mircoarchitecture includes two figures that mention WC buffers: this and this. In the first figure, it's not clear how the WCBs are used. However, in the second one it's clear that the WCBs shown are indeed specifically used for NT writes (there is no connection between the WCBs and the L1 data cache). The source for the second figure seems to be these slides¹. I think that the first figure was made by WikiChip (which explains why the WCBs were placed in an indeterminate position). In fact, the WikiChip article does not say anything about the WCBs. But still, we can confirm that the WCBs shown are only used for NT writes by looking at Figure 7 from the Software Optimization Guide for AMD Family 17h Processors manual and the patent for the load and store queues for the Family 17h processors. The AMD optimization manual states that there are 4 WCBs per core in modern AMD processors. I think this applies to the K8 and all later processors. Unfortunately, nothing is said about the AMD buffers that play the role of Intel fill buffers.

1 Michael Clark, A New, High Performance x86 Core Design from AMD, 2016.

Solution 2:

In modern Intel CPUs, write-combining is done by the LFBs (line-fill-buffers), also used for other pending transfers from L1 <-> L2. Each core has 10 of these (since Nehalem). (Transfers between L2 and L3 use different buffers, called the "superqueue").

That's why Intel recommends avoiding too much other traffic when doing NT stores, to avoid early flushes of partially-filled LFBs caused by demand-loads allocating LFBs. https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

The "inside" of the LFBs have connections to L1d, the store buffer, and load ports.

The "outside" of the LFBs can talk to L2 or (probably with L2's help) go over the ring bus / mesh to memory controllers, or L3 for NT prefetch. Going off-core is probably not very different for L3 vs. memory; just a different type of message to send on the ring / mesh interconnect between cores; in Intel CPUs, the memory controllers are just another stop on the ring bus (in the "system agent), like other cores with their slices of L3. @BeeOnRope suggests that L1 LFBs aren't really directly connected to the ring bus, and that requests that don't put data into L2 probably still go through the L2 superqueue buffers to the ring bus / mesh. This seems likely, so each core only needs one point of presence on the ring bus and arbitration for it between L2 and L1 happens inside the core.

NT store data enters an LFB directly from the store buffer, as well as probing L1d to see if it needs to evict that line first.

Normal store data enters an LFB when its evicted from L1d, either to make room for a new line being allocated or in response to an RFO from another core that wants to read that line.

Normal loads (and stores) that miss in L1d need the cache to fetch that line, which also allocates an LFB to track the incoming line (and the request to L2). When data arrives, it's sent straight to a load buffer that's waiting for it, in parallel with placing it in L1d. (In CPU architecture terms, see "early restart" and "critical word first": the cache miss only blocks until the needed data arrives, the rest of the cache line arrives "in the background".) You (and the CPU architects at Intel) definitely don't want L2 hit latency to include placing the data in L1d and getting it back out again.

NT loads from WC memory (movntdqa) read directly from an LFB; the data never enters cache at all. LFBs already have a connection to load ports for early-restart of normal loads, so SSE4 was able to add movntdqa without a lot of extra cost in silicon, I think. It is special in that a miss will only fill an LFB directly from memory, bypassing L3/L2/L1, though. NT stores already need the LFBs to be able to talk to memory controllers.

Solution 3:

There are a number of buffers in the L1 cache.

This patent gives the following buffer types:

Snoop buffers (buffers that service M/E state snoops from other cores (read / RFO))
Writeback buffers (buffers that service M state evictions from L1)
Line fill buffers (buffers that service cacheable load/store L1 misses)
- Read buffers (service L1 read misses of cacheable temporal loads)
- Write buffers (service L1 write misses of cacheable temporal stores)
- Write combining line fill buffers (not sure, appears to be the same thing as a write combining dedicated buffer in this patent)
Dedicated buffers (buffers that service uncacheable loads/stores and are 'dedicated' for the purpose of fetching from memory and not L2 (but still pass the request through L2), and don't fill the cache line)
- Non write combining dedicated buffers (services UC loads/stores and WP stores)
- Write combining dedicated buffers (services USWC loads/stores)

The patent suggests these can all be functions of the same physical buffer, or they can be physically separate and there is a set of buffers for each function. On Intel, the 12 LFBs on Skylake might be all there are and the logical functions are shared between them with a type or state field. On some embodiments, the line fill buffers can also handle USWC loads/stores. On some embodiments, dedicated buffers can handle cacheable non-temporal (NT) loads/stores that miss L1 (such that they do not 'fill' the L1d cache, like the name implies, taking advantage of the NT hint to prevent cache pollution).

'Write combining buffer' here implies USWC memory / non-temporality and inherent weak ordering and uncacheability, but the actual words 'write combining' does not imply any of these things, and could just be a concept on its own where regular write misses to the same store buffer are squashed and written into the same line fill buffer in program order. A patent suggests such functionality, so it is probable that regular temporal write buffers that aren't marked WC probably have a combining functionality. Related: Are write-combining buffers used for normal writes to WB memory regions on Intel?

The x86-64 optimisation manual states (massive giveaway):

On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line. Store ordering and visibility are also important issues for write combining. When a write to a write-combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO). If a subsequent write happens to another write combining buffer, a separate RFO may be caused for that cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the second RFO has been serviced to guarantee properly ordered visibility of the writes. If the memory type for the writes is write-combining, there will be no RFO since the line is not cached, and there is no such delay.

This is blatant evidence of the term 'write combining buffer' being used to describe regular write buffers that have a purely the combining ability, where strong ordering is maintained. We also now know that it's not just non-temporal stores to any memory that allocate write combining buffers, but all writes (because non-temporal stores do not issue RFOs). The buffer is used to combine writes while a RFO is taking place so the stores can be completed and store buffer entries can be freed up (possibly multiple if they all write to the same cache line). The invalid bits indicate the bits to merge into the cache line when it arrives in E state. The LFB could be dumped to cache as soon as the line is present in cache and all writes to the line after that either write directly to the cache line, or it could remain allocated to speed up further reads/writes until a deallocation condition occurs (e.g. it needs to be used for another purpose or an RFO arrives for the line, meaning it needs to be written back to the line)

So it seems like nowadays, all buffers can be any type of logical buffer and all logical write buffers are write-combining buffers (unless UC) and the cache type determines the way the buffer is treated in terms of weak/strong ordering and whether RFOs are performed or whether it is written back to the cache. The cache type in the LFB which either comes from the TLB (which acquires the cache type from the PMH, which analyses the PTE, PAT MSRs and MTRR MSRs and calculates the final cache type), or the SAB (Store Address Buffer) after buffering the result of a speculative TLB lookup.

So now there are 6 types of buffers:

Write combining LFB (WB write miss / prefetch)
Read LFB (read miss / prefetch from anywhere other than UC and USWC)
Write combining dedicated buffer (WP write, WT write miss, USWC read/write, NT read/write to anywhere other than UC)
Dedicated buffer (UC read/write)
Snoop buffer
Eviction writeback buffer

These buffers are indexed by physical address and are scanned in parallel with the L1 cache and, if they contain valid data, can satisfy read/write hits faster and more efficiently until they are deallocated when a deallocation condition occurs. I think the '10 LFBs' value refers to the number of buffers available for the first 2 purposes. There is a separate FIFO queue for L1d writebacks.

Let's not forget the cache type order of precedence:

UC (Intel E bit)
USWC (PAT)
UC (MTRR)
UC (PAT)
USWC (MTRR) (if combined with WP or WT (PAT/MTRR): either logical and or illegal: defaults to UC)
UC- (PAT)
WT WP (PAT/MTRR) (combining MTRRs in this rank result in logical and of the memory types; combining MTRR and PAT on this rank results in logical and (Intel); AMD (illegal:UC))
WB (PAT/MTRR)

MTRR here includes the default type where a range is not mapped by an MTRR. MTRR is the final type that results from the MTRRs having resolved any conflicts or defaults. Firstly, defaults are resolved to UC and rank the same as any UC MTRR, then any MTRRs that conflict are combined into a final MTRR. Then this MTRR is compared with the PAT and the E bit and the one with the highest precedence becomes the final memory type, although in some cases, they are an illegal combination that results in a different type being created. There is no UC- MTRR.

Description of cache types (temporal):

UC (Strong Uncacheable). Speculative reads and write combining are not allowed. Strongly ordered.
UC- (Weak Uncacheable) the same as UC except it is a lower precedence UC for the PAT
USWC (Uncacheable Speculative Write Combining) speculation and write combining are allowed. Reads and writes are not cached. Both reads and writes become weakly ordered with respect to other reads and writes.
WT (Write Through) reads are cacheable and behave like WB. WT writes that hit the L1 cache update both the L1 cache and external memory at the same time, whereas WT writes that miss the L1 cache only update external memory. Speculative reads and write combining are allowed. Strongly ordered.
WP (Write Protect) reads are cacheable and behave like WB. Writes are uncacheable and cause lines to be invalidated. Speculative reads are allowed. Strongly ordered.
WB (Write Back) everything is allowed. Strongly ordered.

Description of cache types (non-temporal):

NT UC no difference (UC overrides)
NT USWC no difference to USWC I think
NT WT I would think this behaves identically to NT WB. Seems so.
NT WP I'm not sure if WP overrides NT hint for writes only or reads as well. If it doesn't override reads, then reads presumably behave like NT WB, most likely.
NT WB In the patent at the top of the answer, NT reads can hit L1 cache and it uses a biased LRU policy that reduces pollution (which is something like forcing the set's tree PLRU to point to that way). Read misses act like USWC read misses and a write combining dedicated buffer is allocated and it causes any aliasing lines in LLC or other cores or sockets to be written back to memory before reading the line from memory and reads are also weakly ordered. It is implementation specific as to what happens on modern intel CPUs for NT WB reads -- the NT hint can be completely ignored and it behaves like WB (see full discussion). Write hits in L1 cache in some implementations can merge the write with the line in the L1 with a forced PLRU such that it is evicted next (as WB), alternatively a write hit causes an eviction and then a write combining dedicated buffer is allocated as if there were a miss, which is written back as USWC (using WCiL(F)) on the deallocation condition. Write misses allocate a dedicated write combining buffer and it is written back to memory as USWC when deallocated, but if that miss results in a L2 hit, the write combining buffer is written to L2 immediately or on a deallocation condition and this either causes an immediate eviction from L2 or it forces the PLRU bits so it is the next eviction. Further reads/writes to the line continue to be satisfied by the buffer until it is deallocated. NT Writes are weakly ordered. A Write hit in L1/L2 that isn't in an M/E state may still result in a WiL to invalidate all other cores on the current and other sockets to get the E state, otherwise, it just invalidates the line and when the USWC store is finally made, the LLC checks to see if any other cores on the current or a remote socket need to be invalidated.

If a full USWC store (opcode WCiLF) hits in the LLC cache, the Cbo sends IDI invalidates (for some reason invalidate IDI opcode (as part of egress request in the IPQ logical queue of the TOR) sent by Cbo is undocumented) to all cores with a copy and also always sends a QPI InvItoE regardless of whether there is a LLC miss or not, to the correct home agent based on SAD interleave rules. The store can only occur once all cores in the filter have responded to the invalidation and the home agent has also; after they have responded, the Cbo sends a WrPull_GO_I (which stands for Write Pull with globally observed notification and Invalidate Cache Line) of the data from L2 and sends the data to home. If a partial USWC store WCiL hits in the LLC cache, the same occurs, except if the line is now modified in the LLC slice (from a SnpInv it sent instead of an invalidate if the line was only present in one core -- I'm guessing it does do this and doesn't just send plain invalidates for WCiL like it does for WCiLF) or was modified in the LLC all along, the Cbo performs a WBMtoI/WbMtoIPtl to the home agent before performing a write enable bit writeback WcWrPtl for the USWC store. PATs operate on virtual addresses, so aliasing can occur, i.e. the same physical page can have multiple different cache policies. Presumably, WP write and UC read/write aliasing also has the same behaviour, but I'm not sure.

The core superqueue is an interface between L2 and L3. The SQ is also known as the 'off core requests buffer' and any offcore request is known as any request that has reached the SQ. Although, I believe entries are allocated for filling the L2 on a L1 writeback, which isn't really a 'request'. It therefore follows that OFFCORE_REQUESTS_BUFFER.SQ_FULL can happen when L1D writeback pending FIFO requests buffer is full, suggesting that another entry in the SQ cannot be allocated if that buffer is full, suggesting that entries are allocated in the SQ and that buffer at the same time. As for a LFB, on a L2 hit, the data is provided directly to the LFB, otherwise on a miss, if allocates a SQ entry and is provided to the LFB when the fetched data from both 32B IDI transactions is written into the SQ. A further L2 miss can hit the SQ and is squashed to the same entry (SQ_MISC.PROMOTION).

An RFO intent begins at the store buffer and if it hits the L1d cache in an M or E state, the write is performed and the RFO ends. If the line is in an I state, a LFB is allocated and the RFO propagates to L2, where it can be satisfied there if present in an M or E state (when a M line is written back to L2, it becomes an M state there with respect to L3). If it is an I state / not present, it is allocated in the SQ and an RFO or ItoM packet propagates to the corresponding LLC slice Cbo that handles the address range. The Cbo slice then invalidates other cores, using the snoop filter, which involves sending invalidate requests to cores (or snoop invalidates (SnpInv), if it is only present in one core -- which get the data as well, because the Cbo does not know whether this is modified or not). The Cbo waits until it receives acknowledgements of the invalidation from the cores (as well as the data if modified). The Cbo then indicates to the SQ of the requesting core that it now has exclusive access. It likely acknowledges this early because the Cbo may have to fetch from the memory controller, therefore it can acknowledge early that the data is not present in any other core. The SQ propagates this information to the L1d cache, which results in a globally observed bit being set in the LFB and the senior store can now retire from the SAB/SDB to free up its entry. When the data eventually arrives, it is propagated to the LFB, where it is merged into the invalid bits and then it is written to the cache upon a deallocation condition for that address or due to LFB resource constraints.

If a WB line is present in L1 but in an S state, it may or may not allocate a LFB to merge stores before the line can be written to. If it is invalid / not present in L1, an LFB is allocated to merge stores. Then, if the line is present in L2 but is in an S state, a WiL packet is sent to the LLC slice (it only needs to invalidate other cores). It then informs the SQ of the requesting core that it now can transition it to an E state. This information is propagated to the L1d cache where the LFB can now be merged into the cache before a deallocation condition occurs for that address of LFB resource constraints.

ItoM is used instead of an RFO when it's assumed that the full line is going to be written to so it doesn't need a copy of the data already in the line, and it already has the data if it's in any other state (S, E, M). A theoretical StoI i.e. a WiL is the same thing as an RFO, same for E, all except for I, where ItoM and RFO differs in that the LLC doesn't need to send the data to the core for an ItoM. The name emphasises only the state changes. How it knows the whole line is going to be written to by stores I dont know.. maybe the L1d cache can squash a bunch of sequential senior stores in the MOB all at once while it allocates a LFB, because the RFO is sent immediately upon allocation I thought (and then retires them all once the RFO arrives). I guess it has some further time for stores to arrive in the LFB (L2 lookup) before the opcode has to be generated. This also might be used by rep stos.

I'm assuming RFO IDI packets don't need to distinguish between demand lock RFO, prefetch RFO, demand regular RFO (non-prefetch), to correspond with the Xeon 5500 core events, but might for priority purposes (prioritise demand traffic over prefetch), otherwise only the core needs to know this information, this is either encoded in an RFO or there are separate undocumented opcodes. PrefRFO is sent by the core for prefetching into LLC.

L1i ostensibly lacking fill buffers implies the main benefit of the fill buffer is a location to store and combine stores and have store buffer entries free up more quickly. Since L1i does not perform any stores, this isn't necessary. I would have thought that it does have read LFBs still so that it can provide miss data while or before filling the cache, but subsequent reads are not sped up because I think the buffers are PIPT and their tags are scanned in parallel with the cache. Read LFBs would also squash reads to point to the LFB and prevent multiple lookups, as well as prevent the cache from blocking by tracking current misses in the LFBs MSHRs, so it's highly likely this functionality exists.