Why does this function push RAX to the stack as the first operation?
The 64-bit ABI requires that the stack is aligned to 16 bytes before a call
instruction.
call
pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call
.
(The ABI design choice of requiring alignment before a call
instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)
Pushing a don't-care value works well, and can be more efficient than sub rsp, 8
on CPUs with a stack engine. (See the comments).
The reason push rax
is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1
branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8
. The ABI states the alignment this way:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.
Prior to the call to function f
the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f
the return address was placed on the stack misaligning the stack by 8. push rax
is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()
the stack will be properly aligned for that call to work.
In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8
instruction is executed. The return address of the CALLER to function f
will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24]
to transfer control to the function a
. This will JMP to the function not CALL it. When function a
does a RET it will return directly back to the function that called f
.
At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1
could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call()
works properly.
As @CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2
or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:
f(std::function<void ()>):
cmp QWORD PTR [rdi+16], 0 # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
je .L7 #,
jmp [QWORD PTR [rdi+24]] # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
sub rsp, 8 #,
call std::__throw_bad_function_call() #
This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.
In other cases, clang typically fixes up the stack before returning with a pop rcx
.
Using push
has an upside for efficiency in code-size (push
is only 1 byte vs. 4 bytes for sub rsp, 8
), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp
directly because the call
that brought us to the top of the current function makes the stack engine "dirty").
This long and rambling answer discusses the worst-case performance risks of using push rax
/ pop rcx
for aligning the stack, and whether or not rax
and rcx
are good choices of register. (Sorry for making this so long.)
(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al
or ax
are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push
to align the stack by 16 for another call
unless it was already saving/restoring some call-preserved regs for its own use.)
Using push rax
instead of sub rsp, 8
introduces a dependency on the old value of rax
, so you'd think it might slow things down if the value of rax
is the result of a long-latency dependency chain (and/or a cache miss).
e.g. the caller might have done something slow with rax
that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);
# example caller that leaves RAX not-ready for a long time
mov rdi, rax ; prepare function arg
div rbx ; very high latency
mov rax, [table + rdx] ; rax = table[ value % something ], may miss in cache
mov [rsp + 24], rax ; spill the result.
call foo ; foo uses push rax to align the stack
Fortunately out-of-order execution will do a good job here.
The push
doesn't make the value of rsp
dependent on rax
. (It's either handled by the stack engine, or on very old CPUs push
decodes to multiple uops, one of which updates rsp
independently of the uops that store rax
. Micro-fusion of the store-address and store-data uops let push
be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)
As long as nothing depends on the output push rax
/ pop rcx
, it's not a problem for out-of-order execution. If push rax
has to wait because rax
isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push
because the instruction that's slow to produce rax
, and whatever instruction in the caller consumes rax
before the call are even older, and can't retire either until rax
is ready. Retirement has to happen in-order in case of exceptions / interrupts.
(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call
. The caller's instruction that consumes rax
definitely can't execute/retire until our push
can do the same.)
When rax
does become ready, push
can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.
So even in the worst case, where the instruction that produces rax
was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax
only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push
retires.)
A push rax
that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8
that could execute would only be consuming a ROB entry, and not much else.)
It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx
uop's load address is ready, so it should dispatch to a load port and execute. (When the pop
load executes, it finds that its address matches the incomplete push
store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)
Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub
) is worse.
(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)
ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])
). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)
There are cases where push rax / pop rcx
can be more dangerous:
The caller of this function knows that rcx
is call-clobbered, so won't read the value. But it might have a false dependency on rcx
after we return, like bsf rcx, rax
/ jnz
or test eax,eax
/ setz cl
. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl
has a false dep on rcx
. bsf
actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.
The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx
with instructions dependent on its inputs.
It would be worse to use push rbx
/pop rbx
to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx
would be written right before the call
, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx
and rbp
.)
On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax
with push
could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al
before the call
. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax
at all.
It would be nice to push
a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil
and sil
, so rdi
and rsi
would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl
or cl
as 8-bit scratch registers, using dil
or sil
even in tiny functions where nothing else is using rdx
or rcx
. (Although lack of low8 renaming in some CPUs means that setcc cl
has a false dependency on the old rcx
, so setcc dil
is safer if the flag-setting was dependent on the function arg in rdi
.)
pop rcx
at the end "cleans" rcx
of any partial-register stuff. Since cl
is used for shift counts, and functions do sometimes write just cl
even when they could have written ecx
instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)
push rdi
would probably be a good choice in a lot of cases, since the rest of the function also reads rdi
, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push
out of the way if rax
is ready before rdi
, though.
Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8
, that would be 2 ALU uops at the top of the function.