Why can't kernel code use a Red Zone
It is highly recommended when creating a 64-bit kernel (for x86_64 platform), to instruct the compiler not to use the 128-byte Red Zone that the user-space ABI does. (For GCC the compiler flag is -mno-red-zone
).
The kernel would not be interrupt-safe if it is enabled.
But why is that?
Quoting from the AMD64 ABI:
The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.
Essentially, it's an optimization - the userland compiler knows exactly how much of the Red Zone is used at any given time (in the simplest implementation, the entire size of local variables) and can adjust the %rsp
accordingly before calling a sub-function.
Especially in leaf functions, this can yield some performance benefits of not having to adjust %rsp
as we can be certain no unfamiliar code would run while in the function. (POSIX Signal Handlers might be seen as a form of a co-routine, but you can instruct the compiler to adjust the registers before using stack variables in a signal handler).
In the kernel space, once you start thinking about interrupts, if those interrupts make any assumptions about %rsp
, they will likely be incorrect - there is no certainty with regards to the utilization of the Red Zone. So, you either assume all of it is dirty, and needlessly waste stack space (effectively running with a 128-byte guaranteed local variable in every function), or, you guarantee that the interrupts make no assumptions about %rsp
- which is tricky.
In user space, context switches + 128-byte overallocation of stack handle it for you.
In kernel-space, you're using the same stack that interrupts use. When an interrupt happens, the CPU pushes a return address and RFLAGS. This clobbers 16 bytes below rsp
. Even if you wanted to write an interrupt-handler that assumed the full 128 bytes of the red-zone were valuable, it would be impossible.
You could maybe have a kernel-internal ABI that had a small red-zone from rsp-16
to rsp-48
or something. (Small because kernel stack is valuable, and most functions don't need very much red-zone anyway.)
Interrupt handlers would have to sub rsp, 32
before pushing any registers. (and restore it before iret
).
This idea won't work if an interrupt handler can itself be interrupted before it runs sub rsp, 32
, or after it restores rsp
before an iret
. There would be a window of vulnerability where valuable data is at rsp .. rsp-16
.
Another practical problem with this scheme is that AFAIK gcc doesn't have configurable red-zone parameters. It's either on or off. So you'd have to add support for a kernel flavour of red-zone to gcc / clang if you wanted to take advantage of it.
Even if it was safe from nested interrupts, the benefits are pretty small. The difficulty of proving it's safe in a kernel might make it not worth it. (And as I said, I'm not at all sure it can be implemented safely, because I think nested interrupts are possible.)
(BTW, see the x86 tag wiki for links to the ABI documenting the red-zone, and other stuff.)