Why do x86-64 Linux system calls modify RCX, and what does the value mean?
I'm trying to allocate some memory in linux with sys_brk
syscall. Here is what I tried:
BYTES_TO_ALLOCATE equ 0x08
section .text
global _start
_start:
mov rax, 12
mov rdi, BYTES_TO_ALLOCATE
syscall
mov rax, 60
syscall
The thing is as per linux calling convention I expected the return value to be in rax
register (pointer to the allocated memory). I ran this in gdb and after making sys_brk
syscall I noticed the following register contents
Before syscall
rax 0xc 12
rbx 0x0 0
rcx 0x0 0
rdx 0x0 0
rsi 0x0 0
rdi 0x8 8
After syscall
rax 0x401000 4198400
rbx 0x0 0
rcx 0x40008c 4194444 ; <---- What does this value mean?
rdx 0x0 0
rsi 0x0 0
rdi 0x8 8
I do not quite understand the value in the rcx
register in this case. Which one to use as a pointer to the beginning of 8 bytes I allocated with sys_brk
?
Solution 1:
The system call return value is in rax
, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.
Note that sys_brk
has a slightly different interface than the brk
/ sbrk
POSIX functions; see the C library/kernel differences section of the Linux brk(2)
man page. Specifically, Linux sys_brk
sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.
The other interesting part of your question is:
I do not quite understand the value in the rcx register in this case
You're seeing the mechanics of how the syscall
/ sysret
instructions are designed to allow the kernel to resume user-space execution but still be fast.
syscall
doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.
It's not a coincidence that RCX=RIP
and R11=RFLAGS
after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace
system call modified the process's saved rcx
or r11
value while it was inside the kernel. (ptrace
is the system call gdb uses). In that case, Linux would use iret
instead of sysret
to return to user space, because the slower general-case iret
can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall
in a 64-bit process, though.)
Instead of pushing a return address onto the kernel stack (like int 0x80
does), syscall
:
-
sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed
syscall
). -
masks
RFLAGS
with a pre-configured mask from a config register (theIA32_FMASK
MSR). This lets the kernel disable interrupts (IF) until it's doneswapgs
and settingrsp
to point to the kernel stack. Even withcli
as the first instruction at the entry point, there'd be a window of vulnerability. You also getcld
for free by masking offDF
sorep movs
/stos
go upward even if user-space had usedstd
.Fun fact: AMD's first proposed
syscall
/swapgs
design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon). -
jumps to the configured
syscall
entry point (setting CS:RIP =IA32_LSTAR
). The oldCS
value isn't saved anywhere, I think. -
It doesn't do anything else, the kernel has to use
swapgs
to get access to an info block where it saved the kernel stack pointer, becausersp
still has its value from user-space.
So the design of syscall
requires a system-call ABI that clobbers registers, and that's why the values are what they are.