What is the stack engine in the Sandybridge microarchitecture?

Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8 / rsp-=8 part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).

So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp when the 8bit displacement counter overflows, or when the OoO core needs the value of rsp directly (e.g. sub rsp, 8, or mov [rsp-8], eax after a call, ret, push or pop typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).

Note that Agner's instruction tables show that Pentium-M and later decode pop reg to a single uop which runs only on the load port. But Pentium II/III decodes pop eax to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp, or an address for mov eax, [esp+16].

The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.

SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.

Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.

SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.

What is the stack engine in the Sandybridge microarchitecture?

Related

Recent Posts