为什么编译器将数据放在 PE 和 ELF 文件的 .text(code) 部分,CPU 如何区分数据和代码? [英] Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code?

查看:17
本文介绍了为什么编译器将数据放在 PE 和 ELF 文件的 .text(code) 部分,CPU 如何区分数据和代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我参考这篇论文:

二元搅拌:自随机指令地址旧版 x86 二进制代码

但是我有一些问题:

  1. 这如何加速程序?!我只能想象这只会使 cpu 执行更复杂?

  2. CPU是如何区分代码和数据的?因为据我所知,除非存在跳转类型的指令,否则 cpu 将以线性方式一条接一条地执行每条指令,那么 cpu 怎么知道代码中哪些指令是代码,哪些是数据?

  3. 考虑到代码部分是可执行的并且 CPU 可能会错误地将恶意数据作为代码执行,这对安全性来说是不是非常糟糕?(也许攻击者将程序重定向到该指令?)

解决方案

是的,他们提议的二进制随机化器需要处理这种情况,因为混淆的二进制文件可能存在,或者手写代码可能会因为作者不了解而做任意事情或者出于某种奇怪的原因.

但不,普通编译器不会为 x86 执行此操作.该答案解决了所写的 SO 问题,而不是包含这些声明的论文:

<块引用>

出于性能原因,现代编译器在 PE 和 ELF 二进制文件的代码部分中积极地交错静态数据

需要引用!根据我使用 GCC 和 clang 等编译器的经验,以及查看 MSVC 和 ICC 的 asm 输出的一些经验,这对于 x86 来说完全是错误的.

普通编译器将静态只读数据放入section .rodata(ELF 平台)或section .rdata(Windows)..rodata section(和 .text 部分)作为文本segment的一部分链接,但是将整个可执行文件或库的所有只读数据组合在一起,而将所有代码单独组合在一起.ELF文件中section和segment的区别是什么格式(或者最近,甚至在一个单独的 ELF 段中,所以 .rodata 可以被映射为 noexec.)


Intel 的优化指南说不要混合代码/数据,尤其是读+写数据:

<块引用>

汇编/编译器编码规则 50.(M 影响,L 普遍性)如果(希望是只读)数据必须与代码发生在同一页上,避免在间接跳转后立即放置.例如,跟随最有可能的目标的间接跳转,并将数据放在无条件分支之后.

<块引用>

汇编/编译器编码规则 51.(H 影响,L 通用性) 始终将代码和数据放在单独的页面.尽可能避免自我修改代码.如果要修改代码,尽量在一次并确保执行修改的代码和被修改的代码都在单独的 4 KB 页面或单独对齐的 1 KB 子页面.

(有趣的事实:Skylake 实际上具有用于自修改代码管道核武器的缓存行粒度;在最近的高端 uarch 上将读/写数据放在 64 字节代码内是安全的.)


在同一页中混合代码和数据在 x86 上的优势接近于零,并且在代码字节上浪费了数据 TLB 覆盖率,在数据字节上浪费了指令 TLB 覆盖率.并且在 64 字节缓存行中同样会浪费 L1i/L1d 中的空间.唯一的优势是统一缓存(L2 和 L3)的代码+数据局部性,但通常不会做到这一点.(例如,在代码提取将一行带入 L2 之后,从同一行获取数据可能会在 L2 中命中,而必须从另一个缓存行中获取数据.)

但是通过拆分 L1iTLB 和 L1dTLB,并将 L2 TLB 作为统一的受害者缓存(也许我认为?),x86 CPU 为此进行优化.在获取冷"时 iTLB 未命中从现代 Intel CPU 上的同一缓存行读取字节时,该函数无法防止 dTLB 未命中.

在 x86 上代码大小的优势为零.x86-64 相对于 PC 的寻址模式是 [RIP + rel32],因此它可以寻址当前位置 +-2GiB 范围内的任何内容.32 位 x86 甚至没有相对于 PC 的寻址模式.

也许作者在考虑 ARM,其中附近的静态数据允许 PC 相关的负载(具有小偏移量)将 32 位常量放入寄存器?(这称为文字pool",您会在函数之间找到它们.)

我认为它们不是指即时数据,例如 mov eax, 12345,其中 32 位 12345 是指令编码.那不是用加载指令加载的静态数据;即时数据是另一回事.

显然它只用于只读数据;在指令指针附近写入将触发管道清除以处理自修改代码的可能性.并且您通常需要 W^X(写入或执行,而不是两者)作为您的内存页.

<块引用>

CPU如何区分代码和数据?

逐渐增加.CPU 在 RIP 处获取字节,并将它们解码为指令.在程序入口点开始执行后,执行遵循已采用的分支,并经过未采用的分支等.

在体系结构上,它不关心当前正在执行的字节以外的字节,或者由指令加载/存储为数据的字节.最近执行的字节将留在 L1-I 缓存中,以防再次需要它们,L1-D 缓存中的数据也是如此.

在无条件分支或 ret 之后立即使用数据而不是其他代码并不重要. 函数之间的填充可以是任何东西.如果数据具有某种模式(例如,现代 CPU 在 16 或 32 字节的宽块中获取/解码),则可能存在罕见的极端情况,其中数据可能会停止预解码或解码阶段,但 CPU 的任何后期阶段都是只查看来自正确路径的实际解码指令.(或者来自一个分支的错误推测......)

因此,如果执行达到一个字节,则该字节是指令(的一部分).这对 CPU 来说完全没问题,但对于想要查看可执行文件并将每个字节分类为非此即彼"的程序没有帮助.

Code-fetch 总是检查 TLB 中的权限,所以如果 RIP 指向一个不可执行的页面,它就会出错.(页表条目中的 NX 位).

但实际上就 CPU 而言,并没有真正的区别.x86 是冯诺依曼架构.如果需要,指令可以加载自己的代码字节.

例如movzx eax, byte ptr [rip - 1] 将EAX设置为0x000000FF,加载rel32 = -1 = 0xffffffff位移的最后一个字节.


<块引用>

考虑到代码部分是可执行的并且 CPU 可能会错误地将恶意数据作为代码执行,这对安全性来说是不是非常糟糕?(也许攻击者将程序重定向到该指令?)

可执行页面中的只读数据可用作 Spectre 小工具,或用于面向返回编程 (ROP) 攻击的小工具.但通常情况下,实际代码中已经有足够多的这样的小工具,我认为这没什么大不了的.

但是,与您的其他观点不同,这是对这一点的一个小反对意见,实际上是有效的.

最近(2019 年或 2018 年末),GNU Binutils ld 开始将 .rodata 部分与 .text 部分放在一个单独的页面中部分,因此它可以是只读的没有 exec 权限.这使得静态只读数据在 x86-64 等 ISA 上不可执行,其中 exec 权限与读取权限是分开的.即在一个单独的 ELF 段中.

不可执行的东西越多越好,混合代码+常量会要求它们是可执行的.

So i am referencing this paper :

Binary Stirring: Self-randomizing Instruction Addresses of Legacy x86 Binary Code

https://www.utdallas.edu/~hamlen/wartell12ccs.pdf

Code interleaved with data: Modern compilers aggressively interleave static data within code sections in both PE and ELF binaries for performance reasons. In the compiled binaries there is generally no means of distinguishing the data bytes from the code. Inadvertently randomizing the data along with the code breaks the binary, introducing difficulties for instruction-level randomizers. Viable solutions must somehow preserve the data whilst randomizing all the reachable code.

but i have some questions :

  1. how does this speed up the program?! i can only imagine this will only make the cpu execution more complex?

  2. and how does the CPU can distinguish between code and data? because as far as i remember cpu will execute each instruction one after the other in a linear way unless there is a jump type of instruction, so how can the cpu know which instructions inside code are code and which ones are data?

  3. isnt this VERY bad for security considering that the code section is executable and CPU might by mistake execute a malicious data as code? (maybe attacker redirecting the program to that instruction? )

解决方案

Yes their proposed binary randomizer needs to handle this case because obfuscated binaries can exist, or hand-written code might do arbitrary things because the author didn't know better or for some weird reason.

But no, normal compilers don't do this for x86. This answer addresses the SO question as written, not the paper containing those claims:

Modern compilers aggressively interleave static data within code sections in both PE and ELF binaries for performance reasons

Citation needed! This is just plain false for x86 in my experience with compilers like GCC and clang, and some experience looking at asm output from MSVC and ICC.

Normal compilers put static read-only data into section .rodata (ELF platforms), or section .rdata (Windows). The .rodata section (and the .text section) are linked as part of the text segment, but all the read-only data for the whole executable or library is grouped together, and all the code is separately grouped together. What's the difference of section and segment in ELF file format (Or more recently, even in a separate ELF segment so .rodata can be mapped noexec.)


Intel's optimization guide says not to mix code/data, especially read+write data:

Assembly/Compiler Coding Rule 50. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.

Assembly/Compiler Coding Rule 51. (H impact, L generality) Always put code and data on separate pages. Avoid self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code that performs the modifications and the code being modified are on separate 4-KByte pages or on separate aligned 1-KByte subpages.

(Fun fact: Skylake actually has cache-line granularity for self-modifying-code pipeline nukes; it's safe on that recent high-end uarch to put read/write data within 64 bytes of code.)


Mixing code and data in the same page has near-zero advantage on x86, and wastes data-TLB coverage on code bytes, and wastes instruction-TLB coverage on data bytes. And same within 64-byte cache lines for wasting space in L1i / L1d. The only advantage is code+data locality for unified caches (L2 and L3), but that's not typically done. (e.g. after code-fetch brings a line into L2, fetching data from the same line could hit in L2 vs. having to go to RAM for data from another cache line.)

But with split L1iTLB and L1dTLBs, and the L2 TLB as a unified victim cache (maybe I think?), x86 CPUs are not optimized for this. An iTLB miss while fetching a "cold" function doesn't prevent a dTLB miss when reading bytes from the same cache line on modern Intel CPUs.

There is zero advantage for code-size on x86. x86-64's PC-relative addressing mode is [RIP + rel32], so it can address anything within +-2GiB of the current location. 32-bit x86 doesn't even have a PC-relative addressing mode.

Perhaps the author is thinking of ARM, where nearby static data allows PC-relative loads (with a small offset) to get 32-bit constants into registers? (This is called a "literal pool" on ARM, and you'll find them between functions.)

I assume they don't mean immediate data, like mov eax, 12345, where a 32-bit 12345 is part of the instruction encoding. That's not static data to be loaded with a load instruction; immediate data is a separate thing.

And obviously it's only for read-only data; writing near the instruction pointer will trigger a pipeline clear to handle the possibility of self-modifying code. And you generally want W^X (write or exec, not both) for your memory pages.

and how does the CPU can distinguish between code and data?

Incrementally. The CPU fetches bytes at RIP, and decodes them as instructions. After starting at the program entry point, execution proceeds following taken branches, and falling through not-taken branches, etc.

Architecturally, it doesn't care about bytes other than the ones it's currently executing, or that are being loaded/stored as data by an instruction. Recently-executed bytes will stick around in the L1-I cache, in case they're needed again, and same for data in L1-D cache.

Having data instead of other code right after an unconditional branch or a ret is not important. Padding between functions can be anything. There might be rare corner cases where data could stall pre-decode or decode stages if it has a certain pattern (because modern CPUs fetch/decode in wide blocks of 16 or 32 bytes, for example), but any later stages of the CPU are only looking at actual decoded instructions from the correct path. (Or from mis-speculation of a branch...)

So if execution reaches a byte, that byte is (part of) an instruction. This is totally fine for the CPU, but unhelpful for a program that wants to look through an executable and classify each byte as either/or.

Code-fetch always checks permissions in the TLB, so it will fault if RIP points into a non-executable page. (NX bit in the page table entry).

But really as far as the CPU is concerned, there is no true distinction. x86 is a von Neumann architecture. An instruction can load its own code bytes if it wants.

e.g. movzx eax, byte ptr [rip - 1] sets EAX to 0x000000FF, loading the last byte of the rel32 = -1 = 0xffffffff displacement.


isnt this VERY bad for security considering that the code section is executable and CPU might by mistake execute a malicious data as code? (maybe attacker redirecting the program to that instruction? )

Read-only data in executable pages can be used as a Spectre gadget, or a gadget for return-oriented-programming (ROP) attacks. But usually there's already enough such gadgets in real code that it's not a big deal, I think.

But yes, that's a minor objection to this which is actually valid, unlike your other points.

Recently (2019 or late 2018), GNU Binutils ld has started putting the .rodata section in a separate page from the .text section so it can be read-only without exec permission. This makes static read-only data non-executable, on ISAs like x86-64 where exec permission is separate from read permission. i.e. in a separate ELF segment.

The more things you can make non-executable the better, and mixing code+constants would require them to be executable.

这篇关于为什么编译器将数据放在 PE 和 ELF 文件的 .text(code) 部分,CPU 如何区分数据和代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆