What is the stack engine in the Sandybridge microarchitecture?
-
Like Agner Fog's microarch doc explains, the stack engine handles the
rsp+=8
/rsp-=8
part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from
rsp
when the 8bit displacement counter overflows, or when the OoO core needs the value ofrsp
directly (e.g.sub rsp, 8
, ormov [rsp-8], eax
after acall
,ret
,push
orpop
typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).Note that Agner's instruction tables show that Pentium-M and later decode
pop reg
to a single uop which runs only on the load port. But Pentium II/III decodespop eax
to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for amov ebp, esp
, or an address formov eax, [esp+16]
.
-
The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.
SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.
Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.
SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.