What is a Partial Flag Stall?
I was just going over this answer by Peter Cordes and he says,
Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.
I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?
Solution 1:
Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.
So an instruction like inc
that sets only some flags (it doesn't set CF
) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF
) that was not set by inc
(without any intervening instruction that sets the CF
flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.
So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc
family and cmovcc
and a few specialized instructions like adc
) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.
Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.
Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe
which has two register inputs, and whose condition be
requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.
Examples
Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".
Stall and merging uop
The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
ja label ; reads CF and ZF
The ja
instruction reads CF
and ZF
which were last set by the add
and inc
instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja
. On architectures that stall, a stall occurs because ja
reads from CF
which was not set by the most recent flag setting instruction.
Stall only
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jc label ; reads CF
This causes a stall because as in the prior example CF
is read which is not set by the last flag setting instruction (here inc
). In this case, the stall could be avoided by simply swapping the order of the inc
and add
since they are independent and then the jc
would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF
) all come from the same add
instruction.
Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.
No stall or merging uop
add rbx, 5 ; sets CF, ZF, others
inc rax ; sets ZF, but not CF
jnz label ; reads ZF
Here there is no stall or merging uop needed, even though the last instruction (inc
) only sets some flags, because the consuming jnz
only reads (a subset of) flags set by the inc
and no others. So this common looping idiom (usually with dec
instead of inc
) doesn't inherently cause a problem.
Here's another example that doesn't cause any stall or merge uop:
inc rax ; sets ZF, but not CF
add rbx, 5 ; sets CF, ZF, others
ja label ; reads CF and ZF
Here the ja
does read both CF
and ZF
and an inc
is present which doesn't set ZF
(i.e., a partial flag writing instruction), but there is no problem because the add
comes after the inc
and writes all the relevant flags.
Shifts
The shift instructions sar
,shr
and shl
in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").
I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.
Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.
1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.
Solution 2:
A flag modifying uop may only update part of the flags register. The RAT has one entry for the flags/eflags/rflags register and a mask showing the flags that are changed by the uop that caused the physical register the entry is pointing to to be assigned. If a series of instructions occur that read and write the same flag, then a separate physical register gets assigned for each write and each read uses the previous physical register. In those registers will be written that flag and all other flags will be clear. That's why the current physical register cannot be used when a read from a different flag that is not in the mask in the flags RAT entry, because it would read a clear bit and not the real state of the flag that has been left behind. On old microarchitectures, a stall occurs until the state of the flags register is valid in the RRF (by waiting for the retirement of each flag setting uop before it to insert the bits they set in the RRF flags register, where each uop is examined to know the architectural registers it uses / flags it changes, which is in an easier format to interpret than x86 macroops).
On microarchitectures that use the PRF scheme (SnB onwards), a merging uop is required to keep a unified flags register when there is no dedicated RRF register, otherwise the retirement RAT would be pointing to a meaningless physical register with only 1 of the flags in. The merging uop occurs after every partial-flags modifying instruction like inc
or dec
. add
modifies all 6 status flags and therefore does not require a merge uop. I think this probably implies that status, control and system flags are renamed separately on the PRF scheme, given that add
does not require a merging uop. Apparently the CF flag is renamed differently to the SPAZO cluster.
Partial register stalls are similar. The RAT has 2 entries to represent rax
: an entry for al/ax/eax/rax
(distinguished by a size indicator in the entry) and ah
(both are updated on a write to ax
, eax
or rax
to point to the same register). It only needs 2 to represent because there are only 2 mutually exclusive registers. If a read from eax
occurs before a previous write to one of the smaller registers retires, then the allocator stalls (because the ROB entry cannot have 2 dependencies for the same operand) until the full register is present in the RRF, and then it will rename both entries to the RRF register for rax
.
In later microarchitectures that use the PRF scheme, this is now difficult because a single RRF for rax
is no longer kept. Therefore, a merging uop needs to be used, which also happens to be faster than the stall method of the previous microarchitectures.
merging uop implementations
-
One implementation of the merging uop could be that it is inserted before every write to a partial flag / register, and the merging uop reads from the full register / flags register before writing it all to a new physical register. The write is then allocated the same register, which results in the write naturally merging itself in. The following read can then read any part of the register / any flag. This basically sets up a dependency chain between every partial-flag writing instruction and a previous flag writing instruction (partial or full) and between every partial register write and a previous (full / partial) write to the register. In this instance, the RAT never has partial renames.
-
It could be allocated immediately after the write to a partial register. The merge uop takes the previous physical register (which will always be a full
rax/eax
write, or in the case of flags, a full status flag update, like that which is done byadd
or the merge uop) and the new physical register and combines them into the new physical register. This would suggest that the allocator inserts it. If it were inserted by the decoder, the allocator could allocate that uop in a different cycle, when the previous RAT pointer is unknown. -
It could be allocated immediately before a read that occurs from a register that has an unified state in the RAT. This would imply that the RAT tracks
rax/eax
separately toax
,al
andah
. In this case, the 2 physical registers that need to be merged are taken from the RAT.
The optimisation manual implies it is one of the latter 2 scenarios 'The merging uop occurs after every partial register write' (i.e. a write to ax
, al
or ah
, but not eax
).