What branch misprediction does the Branch Target Buffer detect?

I am currently looking at the various parts of the CPU pipeline which can detect branch mispredictions. I have found these are:

  1. Branch Target Buffer (BPU CLEAR)
  2. Branch Address Calculator (BA CLEAR)
  3. Jump Execution Unit (not sure of the signal name here??)

I know what 2 and 3 detect, but I do not understand what misprediction is detected within the BTB. The BAC detects where the BTB has erroneously predicted a branch for a non-branch instruction, where the BTB has failed to detect a branch, or the BTB has mispredicted the target address for a x86 RET instruction. The execution unit evaluates the branch and determines if it was correct.

What type of misprediction is detected at the Branch Target Buffer? What exactly is detected as a misprediction here?

The only clue I could find was this inside Vol 3 of the Intel Developer Manuals (the two BPU CLEAR event counters at the bottom):

enter image description here

BPU predicted a taken branch after incorrectly assuming that it was not taken.

This seems to imply the prediction is not done "synchronously", but rather "asynchronously", hence the "after incorrectly assuming"??

UPDATE:

Ross, this is the CPU branch circuitry, from the original Intel Patent (hows that for "reading"?):

enter image description here

I don't see "Branch Prediction Unit" anywhere? Would it be reasonable that somebody having read this paper would assume that "BPU" is a lazy way of grouping the BTB Circuit, BTB Cache, BAC and RSB together??

So my question still stands, which component raises the BPU CLEAR signal?


Solution 1:

This is a good question! I think the confusion that it's causing is due to Intel's strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments.

First of all. I agree that in standard computer science terminology a branch target buffer isn't synonymous with branch predictor. However in Intel terminology the Branch Target Buffer (BTB) [in capitals] is something specific and contains both a predictor and a Branch Target Buffer Cache (BTBC) which is just a table of branch instructions and their targets on a taken outcome. This BTBC is what most people understand as a branch target buffer [lower case]. So what is the Branch Address Calculator (BAC) and why do we need it if we have a BTB?

So, you understand that modern processors are split into pipelines with multiple stages. Whether this is a simple pipelined processor or an out of order supersclar processor, the first stages are typically fetch then decode. In the fetch stage all we have is the address of the current instruction contained in the program counter (PC). We use the PC to load bytes from memory and send them to the decode stage. In most cases we increment the PC in order to load the subsequent instruction(s) but in other cases we process a control flow instruction which can modify the contents of the PC completely.

The purpose of the BTB is to guess if the address in the PC points to a branch instruction, and if so, what should the next address in the PC be? That's fine, we can use a predictor for conditional branches and the BTBC for the next address. If the prediction was right, that's great! If the prediction was wrong, what then? If the BTB is the only unit we have then we would have to wait until the branch reaches the issue/execute stage of the pipeline. We would have to flush the pipeline and start again. But not every situation needs to be resolved so late. This is where the Branch Address Calculator (BAC) comes in.

The BTB is used in the fetch stage of the pipeline but the BAC resides in the decode stage. Once the instruction we fetched is decoded, we actually have a lot more information which can be useful. The first new piece of information we know is: "is the instruction I fetched actually a branch?" In the fetch stage we have no idea and the BTB can only guess, but in the decode stage we know it for sure. It is possible that the BTB predicts a branch when in fact the instruction is not a branch; in this case the BAC will halt the fetch unit, fix the BTB, and reinitiate fetching correctly.

What about branches like unconditional relative and call? These can be validated at the decode stage. The BAC will check the BTB, see if there are entries in the BTBC and set the predictor to always predict taken. For conditional branches, the BAC cannot confirm if they are taken/not-taken yet, but it can at least validate the predicted address and correct the BTB in the event of a bad address prediction. Sometimes the BTB won't identify/predict a branch at all. The BAC needs to correct this and give the BTB new information about this instruction. Since the BAC doesn't have a conditional predictor of its own, it uses a simple mechanism (backwards branches taken, forward branches not taken).

Somebody will need to confirm my understanding of these hardware counters, but I believe they mean the following:

  • BACLEAR.CLEAR is incremented when the BTB in fetch does a bad job and the BAC in decode can fix it.
  • BPU_CLEARS.EARLY is incremented when fetch decides (incorrectly) to load the next instruction before the BTB predicts that it should actually load from the taken path instead. This is because the BTB requires multiple cycles and fetch uses that time to speculatively load a consecutive block of instructions. This can be due to Intel using two BTBs, one quick and the other slower but more accurate. It takes more cycles to get a better prediction.

This explains why the penalty of a detecting a misprediction in the BTB is 2/3 cycles whereas the detecting a misprediction in the BAC is 8 cycles.

Solution 2:

The fact BPU_CLEARS.EARLY description shows that this event occurs when the BPU predicts, correcting an assumption, implies there are 3 stages in the front end. Assume, predict and calculate. My current guess is that an early clear is flushing the stages of the pipeline that are before the stage that a prediction from the L1 BTB is even known, when the prediction is 'taken' as opposed to not taken.

The BTB set contains 4 ways for a maximum of 4 branches per 16 bytes (where all the ways in the set get tagged with the same tag indicating that particular 16 byte aligned block). Each way has an offset indicating the 4 LSBs of the address therefore the byte position within 16 bytes. Each entry also has a speculative bit, valid bit, pLRU bits, a speculative local BHR, a real local BHR, and each way shares the set BPT (PHT) as a second level of prediction. This may be alloyed with a GHR / speculative GHR.

I think the BPU provides a 64B prediction block to the uop cache and instruction cache (it used to be 32B, and it was 16B on P6). For the legacy route it needs to provide a 64 (or 32/16) byte prediction block which is a set of 64 bit masks which mark predicted branch instructions, prediction directions and which byte is a branch target. This information will be furnished by the L1 BTB while the fetch for the 64 byte line is underway such that 16 byte aligned (IFU has always been 16B) blocks that are read out of it with no used bits at all will not be fetched by the instruction predecoder (unused is the default because on architectures where the prediction block is smaller than the line size, the BPU may only provide bitmasks for 16 or 32B of the line). The BPU therefore provides the predictions made mask, the used/unused mask (marking bytes after first taken branch in the first prediction block and before the branch target in the second prediction block as unused and the rest used), the prediction directions mask; and the ILD provides the branch instructions mask. The first used byte in the prediction block is implicitly a branch target or the start of the instruction flow after a resteer or switch from the uop cache (DSP) to the legacy pipeline (MITE). The used bytes within the prediction block make up a prediction window.

enter image description here

Here is a P6 pipeline. Using this as an example, an early clear is in the 3rd cycle (13), when a prediction is made (and the target and entry type is read, and therefore unconditional branch targets are now known as well as conditional and their predictions). The first predicted taken branch target in the set for the 16 byte block is used. At this point, the 2 pipe stages before it have already been filled with fetches and beginning of lookups from the next sequential 16 byte blocks, which means that they need to be flushed if there is any taken prediction (otherwise it doesn't need to be as the next sequential 16 byte block is already beginning to be looked up in the pipestage before it), leaving a 2 cycle gap or bubble. The cache lookup occurs at the same time as the BTB lookup, so both the BTB and cache parallel 2 pipestages will have to be flushed, whereas the 3rd stage doesn't need to be flushed from the cache or the BTB because the IP is on a confirmed path and is the IP being currently looked up for the next one. In fact, in this P6 design, there is only a one cycle bubble for this early clear, because the new IP can be sent to the first stage to decode a set again on the high edge of clock while those other stages are being flushed.

This pipelining is more beneficial than waiting for the prediction before beginning a lookup on the next IP. This would mean a lookup every other cycle. This would give a throughput of 16 bytes of predictions every 2 cycles, so 8B/c. In the P6 pipelined scenario, the throughput is 16 bytes per cycle on a correct assumption and 8B/c on an incorrect assumption. Obviously faster. If we assume 2/3s of assumptions are correct for 1 in 9 instructions being a taken branch for 4 instructions per block, this gives a throughput of 16B per ((1*0.666)+2*0.333)) =1.332 cycles instead of 16B per 2 cycles.

If this is true, every taken branch will cause an early clear. This is however not the case when I use the event on my KBL. Hopefully the event is actually wrong because it is supposed to not be supported on my KBL, but does show a random low number, so hopefully it is counting something else. This also does not appear to be supported by the following https://gist.github.com/mattgodbolt/4e2cbb1c9aa97e0c5478 https://github.com/mattgodbolt/agner/blob/master/tests/branch.py. Given the 900k instructions and 100k early clears, I do not see how you can have an odd number of early clears if you use my definition of early clears and then look at his code. If we assume that the window is 32B for that CPU, then if you use an alignment of 4 on each branch instruction in that macro you get a clear every 8 instructions, because 8 will fit into the 32B aligned window.

I am not sure why Haswell and Ivy Bridge have such values for early and late clears but these events (0xe8) disappear starting with SnB, which happens to coincide with when the BTB was unified into a single structure. It also looks like the late clears counter is now counting the early clears event because it is the same number as the early clears on the Arrandale CPU, and the early clears event is now counting nothing. I'm also not sure why Nehalem has a 2 cycle bubble for early clears as the design of the L1 Nehalem BTB doesn't seem to change much from the P6 BTB, both 512 entries with 4 ways per set. It is probably because it has been broken down into more stages due to the higher clock speeds and hence also the longer L1i cache latency.

The late clear (BPU_CLEARS.LATE) appears to happen at the ILD. In the diagram above, the cache lookup takes only 2 cycles. In more recent processors, it apparently takes 4 cycles. This allows another L2 BTB to be inserted and a lookup in it to take place. 'MRU bypass' and 'MRU conflicts' could just mean that there was a miss in the MRU BTB or it could also mean that the prediction in the L2 is different to the one in L1 in the event that it uses a different prediction algorithm and history file. On my KBL, which does not support either event, I always get 0 for ILD_STALL.MRU but not BPU_CLEARS.LATE. The 3 cycle bubble comes from the BPU at stage 5 (which is also an ILD stage) resteering the pipeline at the low edge of stage 1 and flushing stages 2, 3 and 4 (which falls in line with cited L1i latencies of 4 cycles, as the L1i lookup occurs across stages 1–4 for a hit+ITLB hit). As soon as the prediction is made, the BTBs update the entries' speculative local BHR bits with the prediction that was made.

A BACLEAR happens for instance when the IQ compares the predictions-made mask with the branch instruction mask produced by the predecoder, and then for certain instruction types like a relative jump, it will check the sign bit to perform a static branch prediction. I'd imagine the static prediction happens as soon as it enters the IQ from the predecoder, such that instructions that immediately go the decoder contain the static prediction. The branch now being predicted taken will result in a BACLEAR_FORCE_IQ when the branch instruction is decoded, because there won't be a target to verify when the BAC calculates the absolute address of the relative conditional branch instruction, which is needs to verify when it is predicted taken.

The BAC at the decoders also makes sure the relative branches and direct branches have the correct branch target prediction after calculating the absolute address from the instruction itself and comparing it with it, if not, a BACLEAR is issued. For relative jumps, static prediction in the BAC uses the sign bit of the jump displacement to statically predict taken / not taken if a prediction has not been made but also overrides all return predictions as taken if the BTB does not support return entry types (it doesn't on P6 and makes no prediction, instead the BAC uses the BPU's RSB mechanism and it is the first point in the pipeline that a return instruction is acknowledged) and overrides all register indirect branch predictions as taken on P6 (because there is no IBTB) as it uses the statistic that more branches are taken that not. The BAC calculates and inserts the absolute target from the relative target into the uop and inserts the IP delta into the uop and inserts the fall through IP (NLIP) into the BPU's BIT, which may be tagged to the uop, or more likely the BIT entries work on a corresponding circular queue which will stall if there aren't enough BIT entries, and the indirect target prediction or known target is inserted into the uop 64 bit immediate field. These fields in the uop are used by the allocator for allocation into the RS/ROB later on. The BAC also informs the BTB of any spurious predictions (non branch instructions) that need their entries deallocating from the BTB. At the decoders, branch instructions are detected early in the logic (when prefixes are decoded and the instruction is examined to see if it can be decoded by the decoder) and the BAC is accessed in parallel with the rest. The BAC inserting the known or otherwise predicted target into the uop is known as converting an auop into a duop. The prediction is encoded into the uop opcode.

The BAC likely instructs the BTB to speculatively update its BTB for the detected branch instruction's IP. If the target is now known and no prediction was made for it (meaning it wasn't in the cache) -- it is still speculative as although the branch target is known for certain, it still could be on a speculative path, so is marked with a speculative bit -- this will now immediately provide early steers especially for unconditional branches now entering the pipeline but also for conditional, with a blank history so will predict not taken next time, rather than having to wait until retire).

enter image description here

The IQ above contains a bitmask field for branch prediction directions (BTBP) and branch predictions made / no prediction made (BTBH) (to distinguish which of the 0s in the BTBP are not taken as opposed to no prediction made) for each of the 8 instruction bytes in an IQ line as well as the target of a branch instruction, meaning there can only be one branch per IQ line and it ends the line. This diagram does not show the branch instruction mask produced by the predecoder that shows what instructions actually are branches such that the IQ knows what not-made predictions it needs to make a prediction for (and what ones are not branch instructions at all).

The IQ is a contiguous block of instruction bytes and the ILD populates 8-bit bitmasks which identify the first opcode byte (OpM) and instruction end byte (EBM) of each macroinstruction as it wraps round bytes into the IQ. It probably also provides bits indicating whether it is a complex instruction or a simple instruction (as suggested by the 'predecode bits' on many AMD patents). The gaps between these markers are implicitly prefix bytes for the following instruction. I'm thinking the IQ is designed such that the uops it issues in the IDQ/ROB will rarely outrun the IQ such that the head pointer in the IQ starts overwriting macroinstructions still tagged in the IDQ waiting to be allocated, and when it does, there is a stall, so the IDQ tags refer back to the IQ, which the allocator accesses. I think the ROB uses this uop tag as well. The IQ on SnB if 16 bytes * 40 entries contains 40 macroops in the worst case, 320 in the average case, 640 in the best case. The number of uops these produce will be much greater, so it will rarely outrun, and when it does, I guess it stalls decode until more instructions retire. The tail pointer contains the recently allocated tag by the ILD, the head pointer contains the next macroinstruction instruction waiting to retire, and the read pointer is the current tag to be consumed by the decoders (which moves towards the tail pointer). Although, this becomes difficult now that some if not the majority of the uops in the path come from the uop cache since SnB. The IQ may be allowed to outrun the back end in the event that uops are not tagged with the IQ entries (and the fields in the IQ are instead inserted into uops directly), and this will be detected and the pipeline will just be resteered from the beginning.

When the allocator allocates a physical destination (Pdst) for a branch micro-op into the ROB, the allocator provides the Pdst entry number to the BPU. The BPU inserts this into the correct BIT entry assigned by the BAC (which is probably is at the head of a circular queue of active BIT entries that are yet to be allocated a Pdst). The allocator also extracts fields from the uop and allocates the data into the RS.

enter image description here

The RS contains a field that indicates whether an instruction is a MSROM uop or a regular uop, which the allocator populates. The allocator also inserts the confirmed absolute target or the predicted absolute target into the immediate data and as a source, renames the flags register (or just a flag bit) and in the case of an indirect branch, there is also the renamed register that contains the address as another source. The Pdst in the PRF scheme would be the ROB entry, which as a Pdst would be the retirement macro-RIP or micro-IP register. The JEU writes the target or fallthrough to that register (it may not need to if the prediction is correct).

When the reservation station dispatches a branch micro-op to a jump execution unit located in the integer execution unit, the reservation station informs the BTB of the Pdst entry for the corresponding branch micro-op. In response, the BTB accesses the corresponding entry for the branch instruction in the BIT and the fall through IP (NLIP) is read out, decremented by the IP delta in the RS, and decoded to point to the set that the branch entry will be updated/allocated.

The outcome from the renamed flag register source Pdst to determine whether the branch is taken / not taken is then compared with the prediction in the opcode in the scheduler, and additionally, if the branch is indirect, the predicted target in the BIT is compared with the the address in the source Pdst (that was calculated and became available in the RS before it was scheduled and dispatched) and it is now known whether a correct prediction was made or not and whether the target is correct or not.

The JEU propagates an exception code to the ROB and flushes the pipeline (JEClear -- which flushes the whole pipeline before the allocate stage, as well as stalls the allocator) and redirects the next IP logic at the start of the pipeline using the fallthrough (in BIT) / target IP appropriately (as well as microsequencer if it is a microbranch misprediction; the RIP directed to the start of the pipeline will be the same one throughout the MSROM procedure). Speculative entries are deallocated and true BHRs are copied into the speculative BHRs. In the event there is a BOB in the PRF scheme, the BOB takes snapshots of the RAT state for every branch instruction and when there is a misprediction. The JEU rolls back the RAT state to that snapshot and the allocator can proceed immediately (which is particularly useful for microbranch misprediction as it is closer to the allocator therefore the bubble will not be as well hidden by the pipeline), rather than stalling the allocator and having to wait until retire for the retirement RAT state to be known and using that to restore the RAT and then clear the ROB (ROClear, which unstalls the allocator). With a BOB, the allocator can start issuing new instructions while the stale uops continue to execute, and when the branch is ready to retire, the ROClear only clears the uops between the retired misprediction and the new uops. If it is an MSROM uop, because it may have completed, the start of the pipeline still needs to be redirected to the MSROM uop again, but this time it will start at the redirected microip (this is the case with inline instructions (and it may be able to replay it out of the IQ). If a misprediction happens in an MSROM exception then it doesn't need to resteer the pipeline, just redirects it directly, because it has taken over the IDQ issue until the end of the procedure -- the issue may have already ended for inline issues.

The ROClear in response to the branch exception vector in the ROB actually happens on the second retirement stage RET2 (which is really the 3rd of 3 stages of typical retirement pipeline) when the uops are retired. The macroinstruction only retires and exceptions only trigger and the macroinstruction RIP only updates (with new target or increase by IP delta in the ROB) when the EOM uop (end of macroinstruction) marker retires, even if a non EOM instruction writes to it, it is not written to the RRF immediately unlike other registers -- anyway, the branch uop is likely going to be the final uop in typical branch macroinstruction handled by the decoders. If this is a microbranch in an MSROM procedure, it will not update the BTB; it updates the uIP when it retires, and the RIP is not updated until the end of the procedure.

If a generic non-mispredict exception occurs (i.e. one that requires a handler) during a MSROM macroop execution, once it has been handled, the microip that caused the exception is restored by the handler to the uIP register (in the event that it is passed to the handler when it is called), as well as the current RIP of the macroinstruction which triggered the exception, and when the exception handling ends, instruction fetch is resumed at this RIP+uIP: the macroinstruction is refetched and reattempted in the MSROM, which starts at the uIP signalled to it. The RRF write (or retirement RAT update on the PRF scheme) for previous uops in a complex non-MSROM macroinstruction may occur on the cycle before the EOM uop retires, which means that a restart can happen at a certain uop within a regular complex macroop and not just a MSROM macroop, and in this case, the instruction flow is restarted at the BPU at the RIP, and the complex decoder is configured with valid / invalid bits on the PLA cuop outputs. The uIP for this regular complex instruction that is used to configure the complex decoder valid bits is a value between 0-3, which I think the ROB sets to 0 at each EOM and increments for each microop retired, so that the non-MSROM complex instructions can be addressed, and for MSROM complex instructions, the MSROM routine contains a uop that tells the ROB the uIP of that instruction. The architectural RIP register however, which is updated by the IP delta only when the EOM uop retires is still pointing to the current macroop because the EOM uop failed to retire), which only happens for exceptions but not hardware interrupts, which can't interrupt MSROM procedures or complex instruction mid retirement (software interrupts are similar and trigger at the EOM -- the trap MSROM handler performs a macrojump to the RIP of the software trap handler once it has finished).

The BTB read and tag comparison happens in RET1 while the branch unit writes back the results, and then in the next cycle, perhaps also during RET1 (or maybe this is done in RET2), the tags in the set are compared and then, if there is a hit, a new history BHR is calculated; if there is a miss, an entry needs to be allocated at that IP with that target. Only once the uop retires in order (in RET2) can the the result be placed into the real history and the branch prediction algorithm is utilised to update the pattern table where an update is required. If the branch requires allocation, the replacement policy is utilised to decide the means for allocating the branch. If there is a hit, the target will already be correct for all direct and relative branches, so it doesn't have to be compared, in the event of no IBTB. The speculative bit is now removed from the entry if present. Finally, in the next cycle, the branch is written in the BTB cache by the BTB write control logic. The first part of the BTB lookup may be able to go ahead throughout RET1 and then may stall the BTB write pipeline until RET2 when the stage waiting to write to the BTB's ROB entry retires, otherwise, the lookup could be decoupled and the first part completes and writes data to, for instance, the BIT, and at RET2 the corresponding entry to the one retiring is just written back to the BTB (which would mean decoding the set again, comparing tags again and then writing the entry, so maybe not)

If P6 had a uop cache, the pipeline would be something like:

  • 1H: select IP
  • 1L: BTB set decode + cache set decode (physical/virtual index) + ITLB lookup + uop cache set decode
  • 2H: cache read + BTB read + uop cache read
  • 2L: cache tag compare + BTB tag compare + uop cache tag compare; if uop cache hit, stall until uop cache can issue, then clock gate legacy decode pipeline
  • 3H: predict, if taken, flush 3H,2L,2H,1L
  • 3L if taken, begin a 1L with new IP to decode new set and continue with current 16 byte block for which the branch instruction resides to 4L

As for the uop cache, because it is past the stage of the BAC, there is never going to be a bogus branch or an incorrect prediction for an unconditional branch or an incorrect target for a non-indirect branch. The uop cache will used the used/unused mask from the BPU to emit uops for instructions that begin at those bytes, and will use the prediction direction mask to change the macrobranch uops to a predicted not taken / predicted taken macrobranch uop (T/NT predictions are inserted into the uop itself). If it is predicted taken then it stops emitting uops for that 64B aligned block (again used to be 32B, previously 16B) and waits for the next window right behind it in the pipeline. The uop cache is going to know what uops are branches and probably statically predicts not taken to all non-predictions, or might have a more advanced static prediction. Indirect target predictions from the IBTB are inserted into the uop immediate field and then it will wait for the next BPU prediction block if this branch is also predicted taken. I would imagine the uop cache creates BIT entries and updates predictions in the BTBs, to ensure that uop cache and MITE (legacy decode) uops update the history in correct sequential order.