What is an assembly-level representation of pushl/popl %esp?

I'm trying to understand the behavior of pushing and popping the stack pointer register. In AT&T:

pushl %esp

and

popl %esp

Note that they store the computed value back into %esp.

I'm considering these instructions independently, not in sequence. I know that the value stored in %esp is always the value before the increment/decrement, but how could I represent the behavior in assembly language? This is what I've come up with so far.

For pushl %esp (ignoring FLAGS and the effect on the temporary register):

movl %esp, %edx     1. save value of %esp
subl  $4, %esp      2. decrement stack pointer
movl %edx, (%esp)   3. store old value of %esp on top of stack

For popl %esp:

movl (%esp), %esp   You wouldn’t need the increment portion. 

Is this correct? If not, where am I going wrong?


As it says about push esp in Intel® 64 and IA-32 Architectures Developer's Manual: Combined Volumes (actually in vol.2, or HTML scrape at https://www.felixcloutier.com/x86/push):

The PUSH ESP instruction pushes the value of the ESP register as it existed before the instruction was executed. If a PUSH instruction uses a memory operand in which the ESP register is used for computing the operand address, the address of the operand is computed before the ESP register is decremented.

And as regards to pop esp (https://www.felixcloutier.com/x86/pop):

The POP ESP instruction increments the stack pointer (ESP) before data at the old top of stack is written into the destination.

and pop 16(%esp)

If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction computes the effective address of the operand after it increments the ESP register.

So yes, your pseudo-code is correct except for modifying FLAGS and %edx.


Yes, those sequences are correct except for the effect on FLAGS, and of course push %esp doesn't clobber %edx. Instead, imagine an internal temporary1 if you want to break it down into separate steps, instead of thinking of a push primitive operation which snapshots its input (source operand) before doing anything else.

(Similarly pop DST can be modeled as pop %temp / mov %temp, DST, with all effects of pop finished before it evaluates and writes to the destination, even if that is or involves the stack pointer.)

push equivalents that work even in the ESP special cases

(In all of these, I'm assuming 32-bit compat or protected mode with SS configured normally, with stack address size matching the mode, if it's even possible for that not to be the case. The 64-bit mode equivalent with %rsp works the same way with -8 / +8. 16-bit mode doesn't allow (%sp) addressing modes so you'd have to consider this as pseudo-code.)

#push SRC         for any source operand including %esp or 1234(%esp)
   mov  SRC, %temp
   lea  -4(%esp), %esp         # esp-=4 without touching FLAGS
   mov  %temp, (%esp)

i.e. mov SRC, %temp ; push %temp
Or since we're describing an uninterruptible transaction anyway (a single push instruction),
we don't need to move ESP before storing:

#push %REG              # or immediate, but not memory source
   mov  %REG, -4(%esp)
   lea  -4(%esp), %esp

(This simpler version wouldn't assemble for real with a memory source, only register or immediate, as well as being unsafe if an interrupt or signal handler runs between the mov and the LEA. In real assembly, mov mem, mem with two explicit addressing modes isn't encodeable, but push (%eax) because the memory destination is implicit. You could consider it as pseudo-code even for a memory source. But snapshotting in a temporary is a more realistic model of what happens internally, like the first block or mov SRC, %temp / push %temp.)

If you're talking about actually using such a sequence in a real program, I don't think there's a way to exactly duplicate push %esp without a temporary register (first version), or (second version) disabling interrupts or having an ABI with a red-zone. (Like x86-64 System V for non-kernel code, so you could duplicate push %rsp.)

pop equivalents:

#pop DST   works for any operand
  mov  (%esp), %temp
  lea  4(%esp), %esp      # esp += 4 without touching FLAGS
  mov  %temp, DST         # even if DST is %esp or 1234(%esp)

i.e. pop %temp / mov %temp, DST. That accurately reflects the case where DST is a memory addressing mode that involves ESP: the value of ESP after the increment is used. I verified Intel's docs for this with push $5 ; pop -8(%esp). That copied the dword 5 to the dword right below the one written by push when I single-stepped it in GDB on a Skylake CPU. If -8(%esp) address calculation had happened using ESP before that instruction executed, there would have been a 4-byte gap.

In the special case of pop %esp, yes that steps on the increment, simplifying to:

#pop %esp  # 3 uops on Skylake, 1 byte
   mov  (%esp), %esp             # 1 uop on Skylake.  3 bytes of machine-code size

Intel manuals have misleading pseudocode

Intel's pseudocode in the Operation sections of their instruction-set manual entries (SDM vol.2) do not accurately reflect the stack-pointer special cases. Only the extra paragraphs in the Description sections (quoted in @nrz's answer) get that right.

https://www.felixcloutier.com/x86/pop shows (for StackAddrSize = 32 and OperandSize = 32) a load into DEST and then incrementing ESP

     DEST ← SS:ESP; (* Copy a doubleword *)
     ESP ← ESP + 4;

But that's misleading for pop %esp because it implies that ESP += 4 happens after ESP = load(SS:ESP). Correct pseudo-code would use

 if ... operand size etc.
     TEMP ← SS:ESP; (* Copy a doubleword *)
     ESP ← ESP + 4;

 ..
 // after all the if / else size blocks:
 DEST ← TEMP 

Intel gets this right for other instructions like pshufb where the pseudo-code starts out with TEMP ← DEST to snapshot the original state of the read-write destination operand.

Similarly, https://www.felixcloutier.com/x86/push#operation shows RSP being decremented first, not showing the src operand being snapshotted before that. Only the extra paragraphs in the text Description section correctly handle that special case.


AMD's manual Volume 3: General-Purpose and System Instructions (March 2021) is similarly wrong about this (my emphasis):

Copies the value pointed to by the stack pointer (SS:rSP) to the specified register or memory location and then increments the rSP by 2 for a 16-bit pop, 4 for a 32-bit pop, or 8 for a 64-bit pop.

Unlike Intel, it doesn't even document the special cases of popping into the stack pointer itself or with a memory operand involving rSP. At least not here, and a search on push rsp or push esp didn't find anything.

(AMD uses rSP to mean SP / ESP / RSP depending on current stack-size attribute selected by SS.)

AMD doesn't have a pseudocode section like Intel does, at least not for supposedly simple instructions like push/pop. (There is one for pusha.)


Footnote 1: That could even be what happens on some CPUs (although I don't think so). For example on Skylake, Agner Fog measured push %esp as 2 uops for the front-end vs. 1 micro-fused store for pushing any other register.

We do know that Intel CPUs do have some registers that get renamed like the architectural registers, but which are only accessible by microcode. e.g. https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ mentions "some extra architectural registers for internal use." So mov %esp, %temp / push %temp could in theory be how it decoded.

But a more likely explanation is that the extra measured uops in a long sequence of push %esp instructions are just stack-sync uops, like we get any time the OoO back-end explicitly reads ESP after a push/pop operation. e.g. push %eax / mov %esp, %edx would also cause a stack-sync uop. (The "stack engine" is what avoids needing an extra uop for the esp -= 4 part of push)

push %esp is sometimes useful, e.g. to push the address of some stack space you just reserved:

  sub   $8, %esp
  push  %esp
  push  $fmt         # "%lf"
  call  scanf
  movsd 8(%esp), %xmm0

  # add $8, %esp    # balance out the pushes at some point, or just keep using that allocated space for something.  Or clean it up just before returning along with the space for your local var.

pop %esp costs 3 uops on Skylake, one load (p23) and two ALU for any integer ALU port (2p0156). So it's even less efficient, but it has basically no use-cases. You can't usefully save/restore the stack pointer on the stack; if you know how to get to where you saved it, you can just restore it with add.