Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code?
Yes their proposed binary randomizer needs to handle this case because obfuscated binaries can exist, or hand-written code might do arbitrary things because the author didn't know better or for some weird reason.
But no, normal compilers don't do this for x86. This answer addresses the SO question as written, not the paper containing those claims:
Modern compilers aggressively interleave static data within code sections in both PE and ELF binaries for performance reasons
Citation needed! This is just plain false for x86 in my experience with compilers like GCC and clang, and some experience looking at asm output from MSVC and ICC.
Normal compilers put static read-only data into section .rodata
(ELF platforms), or section .rdata
(Windows). The .rodata
section (and the .text
section) are linked as part of the text segment, but all the read-only data for the whole executable or library is grouped together, and all the code is separately grouped together. What's the difference of section and segment in ELF file format (Or more recently, even in a separate ELF segment so .rodata
can be mapped noexec.)
Intel's optimization guide says not to mix code/data, especially read+write data:
Assembly/Compiler Coding Rule 50. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.
Assembly/Compiler Coding Rule 51. (H impact, L generality) Always put code and data on separate pages. Avoid self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code that performs the modifications and the code being modified are on separate 4-KByte pages or on separate aligned 1-KByte subpages.
(Fun fact: Skylake actually has cache-line granularity for self-modifying-code pipeline nukes; it's safe on that recent high-end uarch to put read/write data within 64 bytes of code.)
Mixing code and data in the same page has near-zero advantage on x86, and wastes data-TLB coverage on code bytes, and wastes instruction-TLB coverage on data bytes. And same within 64-byte cache lines for wasting space in L1i / L1d. The only advantage is code+data locality for unified caches (L2 and L3), but that's not typically done. (e.g. after code-fetch brings a line into L2, fetching data from the same line could hit in L2 vs. having to go to RAM for data from another cache line.)
But with split L1iTLB and L1dTLBs, and the L2 TLB as a unified victim cache (maybe I think?), x86 CPUs are not optimized for this. An iTLB miss while fetching a "cold" function doesn't prevent a dTLB miss when reading bytes from the same cache line on modern Intel CPUs.
There is zero advantage for code-size on x86. x86-64's PC-relative addressing mode is [RIP + rel32]
, so it can address anything within +-2GiB of the current location. 32-bit x86 doesn't even have a PC-relative addressing mode.
Perhaps the author is thinking of ARM, where nearby static data allows PC-relative loads (with a small offset) to get 32-bit constants into registers? (This is called a "literal pool" on ARM, and you'll find them between functions.)
I assume they don't mean immediate data, like mov eax, 12345
, where a 32-bit 12345
is part of the instruction encoding. That's not static data to be loaded with a load instruction; immediate data is a separate thing.
And obviously it's only for read-only data; writing near the instruction pointer will trigger a pipeline clear to handle the possibility of self-modifying code. And you generally want W^X (write or exec, not both) for your memory pages.
and how does the CPU can distinguish between code and data?
Incrementally. The CPU fetches bytes at RIP, and decodes them as instructions. After starting at the program entry point, execution proceeds following taken branches, and falling through not-taken branches, etc.
Architecturally, it doesn't care about bytes other than the ones it's currently executing, or that are being loaded/stored as data by an instruction. Recently-executed bytes will stick around in the L1-I cache, in case they're needed again, and same for data in L1-D cache.
Having data instead of other code right after an unconditional branch or a ret
is not important. Padding between functions can be anything. There might be rare corner cases where data could stall pre-decode or decode stages if it has a certain pattern (because modern CPUs fetch/decode in wide blocks of 16 or 32 bytes, for example), but any later stages of the CPU are only looking at actual decoded instructions from the correct path. (Or from mis-speculation of a branch...)
So if execution reaches a byte, that byte is (part of) an instruction. This is totally fine for the CPU, but unhelpful for a program that wants to look through an executable and classify each byte as either/or.
Code-fetch always checks permissions in the TLB, so it will fault if RIP points into a non-executable page. (NX bit in the page table entry).
But really as far as the CPU is concerned, there is no true distinction. x86 is a von Neumann architecture. An instruction can load its own code bytes if it wants.
e.g. movzx eax, byte ptr [rip - 1]
sets EAX to 0x000000FF, loading the last byte of the rel32 = -1 = 0xffffffff displacement.
isnt this VERY bad for security considering that the code section is executable and CPU might by mistake execute a malicious data as code? (maybe attacker redirecting the program to that instruction? )
Read-only data in executable pages can be used as a Spectre gadget, or a gadget for return-oriented-programming (ROP) attacks. But usually there's already enough such gadgets in real code that it's not a big deal, I think.
But yes, that's a minor objection to this which is actually valid, unlike your other points.
Recently (2019 or late 2018), GNU Binutils ld
has started putting the .rodata
section in a separate page from the .text
section so it can be read-only without exec permission. This makes static read-only data non-executable, on ISAs like x86-64 where exec permission is separate from read permission. i.e. in a separate ELF segment.
The more things you can make non-executable the better, and mixing code+constants would require them to be executable.
- Interleaving code and data will keep the data closer to the code that use it. This will make the data accessible by simpler and faster instructions.
- The CPU doesn't, it is up to the programmer/compiler to make sure that the data is put in locations outside the actual program flow. If the program flow accidentally enters the data block the CPU will interpret the data as instructions. Normally the data is placed between functions but sometimes the compiler can add an extra branch instruction to make place for a data block inside a function.
- Normally this is not a problem since the programmer or compiler make sure that the data section is not entered by the program flow, but you are partially right since if an attacker manage to trick the CPU into execute the data this will not be caught by the memory protection mechanisms.