从 JITed 代码处理对(可能)很远的提前编译函数的调用 [英] Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

查看:24
本文介绍了从 JITed 代码处理对(可能)很远的提前编译函数的调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题因为太宽泛而被搁置,大概是因为我在努力展示我的工作"而不是提出一个低效率的问题时进行了研究.为了解决这个问题,请允许我用一句话总结整个问题(这句话归功于@PeterCordes):

<块引用>

如何从 JIT 代码(我正在生成)中高效地调用 (x86-64) 提前编译的函数(我控制的,可能比 2GB 还远)?

我怀疑仅此一项就会因为太宽泛"而被搁置.特别是,它缺少您尝试过什么".所以,我觉得有必要添加额外的信息来展示我的研究/思考和我的尝试.以下是对此的一些意识流.

请注意,下面提出的问题都不是我希望回答的问题; 他们的目的是证明为什么我不能回答上述问题(尽管我进行了研究,但我缺乏这方面的经验来做出明确的陈述,例如@PeterCordes 的分支预测隐藏了从内存中获取和检查函数指针,假设它预测得很好.").另请注意,Rust 组件在这里基本上无关紧要,因为这是一个组装问题.我之所以包含它,是因为提前编译的函数是用 Rust 编写的,所以我不确定 Rust 做了什么(或指示 LLVM 做的)在这种情况下可能是有利的.完全不考虑 Rust 的答案是完全可以接受的;事实上,我预计情况会如此.

将以下内容视为数学考试后的草稿作业:

<小时>

注意: 我在这里混淆了术语内在函数.正如评论中指出的那样,提前编译的函数"是更好的描述.下面我将缩写AOTC函数.

我正在用 Rust 编写一个 JIT(虽然 Rust 只与我的一小部分问题有关,但其中大部分与 JIT 约定有关).我有在 Rust 中实现的 AOTC 函数,我需要能够从我的 JIT 发出的代码中call.我的 JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s 一些用于 jitted 代码的页面.我有我的 AOTC 函数的地址,但不幸的是它们比 32 位偏移更远.我现在正在尝试决定如何发出对这些 AOTC 函数的调用.我已经考虑了以下选项(这些不是要回答的问题,只是说明为什么我不能自己回答这个 SO 线程的核心问题):

  1. (Rust 特定)以某种方式使 Rust 将 AOTC 函数放置在靠近(可能在?)堆的位置,以便 call 将在 32-位偏移量.目前尚不清楚 Rust 是否可行(有一种方法可以指定 自定义链接器 args,但我不知道应用了哪些内容,以及我是否可以针对单个函数进行重定位.即使我可以将它放在哪里?).如果堆足够大,这似乎也可能会失败.

  2. (Rust 特定)分配我的 JIT 页面更接近 AOTC 函数.这可以通过 mmap(_, _, PROT_EXEC, MAP_FIXED) 来实现,但我不确定如何选择一个不会破坏现有 Rust 代码的地址(并保持在 arch 限制内——是有没有一种理智的方法来获得这些限制?).

  3. 在处理绝对跳转的 JIT 页面中创建存根(下面的代码),然后调用存根.这样做的好处是 JITted 代码中的(初始)调用站点是一个不错的小相对调用.但是必须跳过某些东西感觉是错误的.这似乎对性能有害(可能会干扰 RAS/跳转地址预测).此外,由于它的地址是间接的,而且它依赖于该地址的 mov,因此这种跳转似乎会更慢.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}jmp rax

  1. 与 (3) 相反,只是在 JIT 代码中的每个内部调用点内联上述内容.这解决了间接问题,但使 JITted 代码更大(也许这有指令缓存和解码后果).它仍然存在跳转是间接的并且依赖于mov的问题.

  2. AOTC 函数的地址放在 JIT 页面附近的 PROT_READ(仅)页面上.使所有呼叫站点靠近,绝对间接呼叫(代码如下).这从 (2) 中删除了第二级间接.但不幸的是,该指令的编码很大(6 个字节),因此与(4)存在相同的问题.此外,现在不需要依赖于寄存器,而是依赖于内存而不是不必要的跳转(只要地址在 JIT 时已知),这肯定会影响性能(尽管该页面可能被缓存?).

aotc_function_address:.quad 0xDEADBEEF# 然后在通话现场调用 qword ptr [rip+aotc_function_address]

  1. Futz 带有段寄存器,使其更靠近 AOTC 函数,以便可以相对于该段寄存器进行调用.这种调用的编码很长(所以这可能有解码管道问题),但除此之外,这在很大程度上避免了之前所有内容的许多棘手问题.但是,可能调用相对于非 cs 段的性能很差.或者,这种 futzing 可能是不明智的(例如,与 Rust 运行时混淆).(正如@prl 所指出的,如果没有远调用,这将不起作用,这对性能来说很糟糕)

  2. 这不是真正的解决方案,但我可以将编译器设为 32 位并且完全没有这个问题.这并不是一个很好的解决方案,它还会阻止我使用扩展的通用寄存器(我使用了所有寄存器).

所有提供的选项都有缺点.简而言之,只有 1 和 2 似乎不会对性能产生影响,但目前尚不清楚是否有一种非黑客的方式来实现它们(或任何方式).3-5 独立于 Rust,但有明显的性能缺陷.

鉴于这种意识流,我得出了以下反问句(不需要明确的答案),以证明我缺乏自己回答这个 SO 线程的核心问题的知识.我已经打过他们,让他们非常清楚地表明我并不是在提出所有这些都是我的问题的一部分.

<打击>

  1. 对于方法 (1),是否可以强制 Rust 在特定地址(堆附近)链接某些 extern "C" 函数?我应该如何选择这样的地址(在编译时)?假设 mmap 返回的任何地址(或 Rust 分配的)都在该位置的 32 位偏移量内是否安全?

  2. 对于方法 (2),我如何找到合适的位置来放置 JIT 页面(这样它不会破坏现有的 Rust 代码)?

以及一些 JIT(非 Rust)特定问题:

  1. 对于方法 (3),存根对性能的影响是否足以让我关心?间接 jmp 怎么样?我知道这有点类似于链接器存根,但据我所知链接器存根至少只解析一次(所以它们不需要是间接的?).是否有 JIT 使用这种技术?

  2. 对于方法 (4),如果 3 中的间接调用没问题,内联调用是否值得?如果 JIT 通常采用方法 (3/4),此选项是否更好?

  3. 对于方法(5),跳转对内存的依赖(假设地址在编译时已知)是坏的吗?这会降低 (3) 或 (4) 的性能吗?是否有 JIT 使用这种技术?

  4. 对于方法 (6),这样的 futzing 是不明智的吗?(特定于 Rust)是否有可用的段寄存器(运行时或 ABI 未使用)用于此目的?相对于非cs 段的调用是否与相对于cs 的调用一样高效?

  5. 而且最后(也是最重要的),是否有我在这里遗漏的更好的方法(也许 JIT 更常用)?

如果我的 Rust 问题没有答案,我就无法实现 (1) 或 (2).当然,我可以实施和基准测试 3-5(可能是 6,虽然事先了解段寄存器模糊测试会很好),但鉴于这些方法截然不同,我希望有关于此的现有文献我找不到,因为我不知道 google 的正确术语(我目前也在研究这些基准).或者,也许深入研究 JIT 内部结构的人可以分享他们的经验或他们常见的内容?

我知道这个问题:JIT 跳转 (x86_64).它与我的不同,因为它谈论的是将基本块串在一起(并且公认的解决方案对于经常调用的内在指令来说太多了).我也知道 在 x86 机器代码中调用绝对指针,虽然它讨论的主题与我的相似,但有所不同,因为我不认为绝对跳转是必要的(例如,方法 1-2 会避免它们).

解决方案

总结:尝试在您的静态代码附近分配内存.但是对于rel32无法到达的调用,回退到call qword [rel pointer] 或者inline mov r64,imm64/>调用r64.

如果您不能使 2. 工作,您的机制 5. 可能最适合性能,但 4. 很容易,应该没问题.直接call rel32也需要一些分支预测,但肯定还是更好.


术语:内在函数";应该是帮手"职能.内在"通常表示内置语言(例如 Fortran 的含义)或不是真正的函数,只是内联到机器指令的东西";(C/C++/Rust 的意思,比如 SIMD 之类的如 _mm_popcnt_u32()_pdep_u32()_mm_mfence()).您的 Rust 函数将编译为存在于机器代码中的实际函数,您将使用 call 指令调用这些函数.


是的,在目标函数的 +-2GiB 范围内分配 JIT 缓冲区显然是理想的,允许 rel32 直接调用.

最直接的方法是在 BSS 中使用一个大型静态数组(链接器将把它放置在您代码的 2GiB 内)并从中划分出您的分配.(使用 mprotect (POSIX) 或 VirtualProtect (Windows) 使其可执行).

大多数操作系统(包括 Linux)对 BSS 进行延迟分配(COW 映射到零页,只在写入时分配物理页框来支持该分配,就像没有 MAP_POPULATE 的 mmap 一样),所以它只会浪费虚拟地址空间,在 BSS 中有一个 512MiB 的数组,你只使用它的底部 10kB.

不要让它大于或接近 2GiB,因为这会将 BSS 中的其他东西推得太远.默认的小"代码模型(如 x86-64 System V ABI 中所述)将所有静态地址放在 2GiB 之内,用于 RIP 相关数据寻址和 rel32 调用/jmp.

缺点:您必须至少自己编写一个简单的内存分配器,而不是使用 mmap/munmap 处理整个页面.但是如果你不需要释放任何东西,这很容易.也许只是从一个地址开始生成代码,并在到达末尾并发现代码块有多长时更新一个指针.(但这不是多线程......)为了安全起见,请记住检查何时到达此缓冲区的末尾并中止,或者回退到 mmap.


如果您的绝对目标地址在虚拟地址空间的低 2GiB 中,请在 Linux 上使用 mmap(MAP_32BIT).(例如,如果您的 Rust 代码被编译为适用于 x86-64 Linux 的非 PIE 可执行文件.但对于 PIE 可执行文件,情况并非如此(这些天很常见),或用于共享库中的目标.您可以通过检查地址在运行时检测到这一点您的辅助功能之一.)

总的来说(如果 MAP_32BIT 没有帮助/不可用),你最好的选择可能是 mmap 没有 MAP_FIXED,但带有一个您认为免费的非空提示地址.

Linux 4.17 引入 MAP_FIXED_NOREPLACE 可以让您轻松搜索附近的未使用区域(例如,如果获得 EEXIST,则逐步搜索 64MB 并重试,然后记住该地址以避免下次搜索).否则,您可以在启动时解析 /proc/self/maps 一次,以在包含您的辅助函数之一的地址的映射附近找到一些未映射的空间.将紧密相连.

<块引用>

请注意,不识别 MAP_FIXED_NOREPLACE 标志的旧内核通常会(在检测到与预先存在的映射发生冲突时)回退到非 MAP_FIXED"标志.行为类型:他们将返回一个与请求地址不同的地址.

在下一个更高或更低的空闲页面中,对于具有非稀疏内存映射是理想的,因此页表不需要太多不同的顶级页面目录.(硬件页表是一个基数树.)一旦你找到了一个有效的地方,让未来的分配与它相邻.如果您最终在那里使用了大量空间,内核可以机会性地使用 2MB 大页面,并且让您的页面再次连续意味着它们在硬件页表中共享相同的父页面目录,因此 iTLB 错过触发页面遍历可能是 稍微便宜一些(如果这些更高级别在数据缓存中保持热状态,或者甚至缓存在 pagewalk 硬件本身中).并且为了让内核高效地作为一个更大的映射进行跟踪.当然,如果有空间,使用更多已经分配的页面会更好.在页面级别上更好的代码密度有助于指令 TLB,也可能在 DRAM 页面内(但不一定与虚拟内存页面大小相同).


然后,当您为每次调用执行代码生成时,只需检查目标是否在 调用 rel32 的范围内,使用 off == (off as i32) as i64
否则回退到 10 字节 mov r64,imm64/call r64.(rustcc 会将其编译为 movsxd/cmp,因此每次检查对于 JIT 编译时间来说只是微不足道的成本.)

(如果可能的话,或者 5 字节的 mov r32,imm32.不支持 MAP_32BIT 的操作系统可能仍然有目标地址在那里.用 <检查code>target == (target as u32) as u64.第三个mov-立即编码,7-byte mov r/m64, sign_extended_imm32 可能不是除非您为映射到高 2GiB 虚拟地址空间的内核 JITing 内核代码.)

尽可能检查和使用直接调用的美妙之处在于它将代码生成与有关分配附近页面或地址来自何处的任何知识分离,并且只是机会性地编写好的代码.(您可能会记录一次计数器或日志,以便您/您的用户至少注意到您附近的分配机制是否失败,因为性能差异通常不容易衡量.)


mov-imm/call reg 的替代方案

mov r64,imm64 是一个 10 字节的指令,它对于获取/解码和uop-cache 存储来说有点大.根据 Agner Fog 的 microarch pdf (https://agner.org/优化).但是现代 CPU 具有非常好的代码获取带宽和强大的前端.

如果分析发现前端瓶颈是您代码中的一个大问题,或者大代码导致从 L1 I-cache 驱逐其他有价值的代码,我会选择选项 5.

顺便说一句,如果您的任何函数是可变参数,x86-64 System V 要求您传递 AL=number of XMM args,您可以使用 r11 作为函数指针.它被调用破坏了,不用于 arg 传递.但是 RAX(或其他遗留"寄存器)会在 call 上保存一个 REX 前缀.


<块引用>

  1. mmap 将分配的位置附近分配 Rust 函数

不,我认为没有任何机制可以让您的静态编译函数靠近 mmap 可能放置新页面的位置.

mmap 有超过 4GB 的可用虚拟地址空间可供选择.你不知道它会提前分配到哪里.(虽然我认为 Linux 至少确实保留了一些局部性,以优化硬件页表.)

理论上你可以复制你的 Rust 函数的机器代码,但它们可能引用其他静态代码/数据和 RIP 相对寻址模式.


<块引用>

  1. 调用 rel32 到使用 mov/jmp reg
  2. 的存根

这似乎会损害性能(可能会干扰 RAS/跳转地址预测).

性能的缺点仅在于前端总共有 2 个调用/跳转指令,以便在它可以向后端提供有用的指令之前通过.这不是很好;5.好多了.

这基本上就是 PLT 在 Unix/Linux 上调用共享库函数的工作方式,并将执行相同的操作.通过 PLT(程序链接表)存根函数调用几乎是这样的.因此,已经对性能影响进行了充分研究,并与其他做事方式进行了比较.我们知道动态库调用不会造成性能灾难.

星号之前地址和推送指令,它被推送到哪里? 显示 AT&T 反汇编一个,或单步执行 C 程序,如 main(){puts(hello");puts("world");} 如果你好奇的话.(在第一次调用时,它压入一个 arg 并跳转到一个惰性动态链接器函数;在后续调用中,间接跳转目标是共享库中函数的地址.)

为什么在 GOT 之外还存在 PLT,而不是仅使用 GOT? 解释了更多.延迟链接更新地址的jmpjmp qword [xxx@GOTPLT].(是的,PLT 确实在这里使用了内存间接 jmp,即使在 i386 上,重写的 jmp rel32 也可以工作.如果 GNU/Linux 历史上有 IDK用于重写 jmp rel32 中的偏移量.)

jmp 只是一个标准的尾调用,并且不会不平衡返回地址预测器堆栈.目标函数中最终的ret会返回到原始call之后的指令,即call压入调用栈的地址,到微架构 RAS.只有当您使用 push/ret(如用于 Spectre 缓解的retpoline")时,您才会不平衡 RAS.

但是 跳转到 JIT (x86_64) 很糟糕(请参阅我在下面的评论).它打破未来回报的 RAS.您认为它只会通过调用(以获得要调整的返回地址)来平衡推送/返回,但实际上 call +0 是一种特殊情况,即大多数 CPU 中都没有使用 RAS:http://blog.stuffedcow.net/2018/04/ras-microbenchmarks.(调用 nop 可能会改变我的猜测,但整个事情与 call rax 相比完全是疯狂的,除非它试图防御 Spectre 攻击.)通常在 x86-64,您使用与 RIP 相关的 LEA 将附近的地址放入寄存器,而不是 call/pop.


<块引用>

  1. 内联mov r64, imm64/调用reg

这可能比 3 好;较大代码量的前端成本可能低于通过使用 jmp 的存根调用的成本.

但这也可能已经足够了,尤其是如果您的 alloc-within-2GiB 方法在您关心的大多数目标上的大部分时间都运行良好.

可能存在比 5 慢的情况.不过.分支预测隐藏了从内存中获取和检查函数指针的延迟,假设它预测得很好.(通常它会,否则它运行的频率太低以至于与性能无关.)


<块引用>

  1. 调用 qword [rel near around_func_ptr]

这是 gcc -fno-plt 在 Linux 上编译对共享库函数的调用的方式(call [rip + symbol@GOTPCREL]),以及如何Windows DLL 函数调用通常完成.(这就像 http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative] 是 6 个字节,仅比 call rel32 大 1 个字节,因此它对代码大小的影响与调用存根相比可以忽略不计.有趣的事实:您有时会在机器代码中看到 addr32 call rel32(地址大小前缀除了填充没有影响).如果在另一个 中发现具有非隐藏 ELF 可见性的符号,则链接器将 call [RIP + symbol@GOTPCREL] 放松到 call rel32.o 链接期间,毕竟不是不同的共享对象.

对于共享库调用,这通常比 PLT 存根更好,唯一的缺点是程序启动速度较慢,因为它需要早期绑定(非惰性动态链接).这对你来说不是问题;目标地址在代码生成时间之前是已知的.

补丁作者测试了其性能 vs. 一些未知 x86-64 硬件上的传统 PLT.Clang 可能是共享库调用的最坏情况,因为它对不需要太多时间的小型 LLVM 函数进行多次调用,而且它运行时间很长,因此早期绑定启动开销可以忽略不计.使用gccgcc -fno-plt编译clang后,clang -O2 -g编译tramp3d的时间从41.6s(PLT) 到 36.8s (-fno-plt).clang --help 变得稍微慢一些.

(x86-64 PLT 存根使用 jmp qword [symbol@GOTPLT],而不是 mov r64,imm64/jmp.内存-indirect jmp 在现代 Intel CPU 上只是一个 uop,所以它在正确预测时更便宜,但在错误预测时可能更慢,尤其是当 GOTPLT 条目在缓存中未命中时.如果经常使用,不过,它通常会正确预测.但无论如何,一个 10 字节的 movabs 和一个 2 字节的 jmp 可以作为一个块获取(如果它适合 16 字节对齐的取块),并在一个周期内解码,所以 3. 并非完全不合理.但这样更好.)

为您的指针分配空间时,请记住,它们是作为数据提取到 L1d 缓存中的,并且带有 dTLB 条目而不是 iTLB.不要将它们与代码交织在一起,这会浪费 I-cache 中的空间 在此数据上,并在包含一个指针和大部分代码的行上浪费 D-cache 中的空间.将您的指针组合在一个单独的 64 字节代码块中,这样该行就不需要同时位于 L1I 和 L1D 中.如果它们与某些代码在同一个 页面 中就可以了;它们是只读的,因此不会导致自修改代码管道核弹.

This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase):

How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)?

This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information showing my research/thinking and what I have tried. Below is a somewhat stream of consciousness of this.

Note that none of the questions posed below here are ones I expect to be answered; they are more rhetorical. Their purpose is to demonstrate why I can't answer the above question (despite my research, I lack the experience in this area to make definitive statements such as @PeterCordes's "branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well."). Also note that the Rust component is largely irrelevant here as this is an assembly issue. My reasoning for including it was the ahead-of-time compiled functions are written in Rust, so I was unsure if there was something that Rust did (or instructed LLVM to do) that could be advantageous in this situation. It is totally acceptable for an answer to not consider Rust at all; in fact, I expect this will be the case.

Think of the following as scratch work on the back of a math exam:


Note: I muddled the term intrinsics here. As pointed out in the comments, "ahead-of-time compiled functions" is a better description. Below I'll abbreviate that AOTC functions.

I'm writing a JIT in Rust (although Rust is only relevant to a bit of my question, the bulk of it relates to JIT conventions). I have AOTC functions that I've implemented in Rust that I need to be able to call from code emitted by my JIT. My JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s some pages for the jitted code. I have the addresses of my AOTC functions, but unfortunately they are much further away than a 32-bit offset. I'm trying to decide now how to emit calls to these AOTC functions. I've considered the following options (these are not questions to be answered, just demonstrating why I can't answer the core question of this SO thread myself):

  1. (Rust specific) Somehow make Rust place the AOTC functions close to (maybe on?) the heap so that the calls will be within a 32-bit offset. It's unclear that that is possible with Rust (There is a way to specify custom linker args, but I can't tell to what those are applied and if I could target a single function for relocation. And even if I could where do I put it?). It also seems like this could fail if the heap is large enough.

  2. (Rust specific) Allocate my JIT pages closer to the AOTC functions. This could be achieved with mmap(_, _, PROT_EXEC, MAP_FIXED), but I'm unsure how to pick an address that wouldn't clobbering existing Rust code (and keeping within arch restrictions--is there a sane way to get those restrictions?).

  3. Create stubs within the JIT pages that handle the absolute jump (code below), then call the stubs. This has the benefit of the (initial) call site in the JITted code being a nice small relative call. But it feels wrong to have to jump through something. This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction). Additionally, it seems like this jump would be slower since its address is indirect and it depends on the mov for that address.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}
jmp rax

  1. The reverse of (3), just inlining the above at each intrinsic call site in the JITed code. This resolves the indirection issue, but makes the JITted code larger (perhaps this has instruction cache and decoding consequences). It still has the issue that the jump is indirect and depends on the mov.

  2. Place the addresses of the AOTC functions on a PROT_READ (only) page near the JIT pages. Make all the call sites near, absolute indirect calls (code below). This removes the second level of indirection from (2). But the encoding of this instruction is unfortunately large (6 bytes), so it has the same issues as (4). Additionally, now instead of depending on a register, jumps unnecessarily (insofar as the address is known at JIT time) depend on memory, which certainly has performance implications (despite perhaps this page being cached?).

aotc_function_address:
    .quad 0xDEADBEEF

# Then at the call site
call qword ptr [rip+aotc_function_address]

  1. Futz with a segment register to place it closer to the AOTC functions so that calls can be made relative to that segment register. The encoding of such a call is long (so maybe this has decoding pipeline issues), but other than that this largely avoids lots of the tricky bits of everything before it. But, maybe calling relative to a non-cs segment performs poorly. Or maybe such futzing is not wise (messes with the Rust runtime, for example). (as pointed out by @prl, this doesn't work without a far call, which is terrible for performance)

  2. Not really a solution, but I could make the compiler 32-bit and not have this problem at all. That's not really a great solution and it also would prevent me from using the extended general purpose registers (of which I utilize all).

All of the options presented have drawbacks. Briefly, 1 and 2 are the only ones that don't seem to have performance impacts, but it's unclear if there is a non-hacky way to achieve them (or any way at all for that matter). 3-5 are independent of Rust, but have obvious performance drawbacks.

Given this stream of consciousness, I arrived at the following rhetorical question (which don't need explicit answers) to demonstrate that I lack the knowledge to answer the core question of this SO thread by myself. I have struck them to make it abundantly clear that I am not posing all of these are part of my question.

  1. For approach (1), is it possible to force Rust to link certain extern "C" functions at a specific address (near the heap)? How should I choose such an address (at compile time)? Is it safe to assume that any address returned by mmap (or allocated by Rust) will be within a 32 bit offset of this location?

  2. For approach (2), how can I find a suitable place to place the JIT pages (such that it doesn't clobber existing Rust code)?

And some JIT (non-Rust) specific questions:

  1. For approach (3), will the stubs hamper performance enough that I should care? What about the indirect jmp? I know this somewhat resembles linker stubs, except as I understand linker stubs are at least only resolved once (so they don't need to be indirect?). Do any JITs employ this technique?

  2. For approach (4), if the indirect call in 3 is okay, is inlining the calls worth it? If JITs typically employ approach (3/4) is this option better?

  3. For approach (5), is the dependence of the jump on memory (given that the address is known at compile time) bad? Would that make it less performant that (3) or (4)? Do any JITs employ this technique?

  4. For approach (6), is such futzing unwise? (Rust specific) Is there a segment register available (not used by the runtime or ABI) for this purpose? Will calls relative to a non-cs segment be as performant as those relative to cs?

  5. And finally (and most importantly), is there a better approach (perhaps employed more commonly by JITs) that I'm missing here?

I can't implement (1) or (2) without my Rust questions having answers. I could, of course, implement and benchmark 3-5 (perhaps 6, although it would be nice to know about the segment register futzing beforehand), but given that these are vastly different approaches, I was hoping there was existing literature about this that I couldn't find, because I didn't know the right terms to google for (I'm also currently working on those benchmarks). Alternatively maybe someone who's delved into JIT internals can share their experience or what they've commonly seen?

I am aware of this question: Jumps for a JIT (x86_64). It differs from mine because it is talking about stringing together basic blocks (and the accepted solution is way too many instructions for a frequently called intrinsic). I am also aware of Call an absolute pointer in x86 machine code, which while it discusses similar topics to mine, is different, because I am not assuming that absolute jumps are necessary (approaches 1-2 would avoid them, for example).

解决方案

Summary: try to allocate memory near your static code. But for calls that can't reach with rel32, fall back to call qword [rel pointer] or inline mov r64,imm64 / call r64.

Your mechanism 5. is probably best for performance if you can't make 2. work, but 4. is easy and should be fine. Direct call rel32 needs some branch prediction, too, but it's definitely still better.


Terminology: "intrinsic functions" should probably be "helper" functions. "Intrinsic" usually means either language built-in (e.g. Fortran meaning) or "not a real function, just something that inlines to a machine instruction" (C/C++ / Rust meaning, like for SIMD, or stuff like _mm_popcnt_u32(), _pdep_u32(), or _mm_mfence()). Your Rust functions are going to compile to real functions that exist in machine code that you're going to call with call instructions.


Yes, allocating your JIT buffers within +-2GiB of your target functions is obviously ideal, allowing rel32 direct calls.

The most straightforward would be to use a large static array in the BSS (which the linker will place within 2GiB of your code) and carve your allocations out of that. (Use mprotect (POSIX) or VirtualProtect (Windows) to make it executable).

Most OSes (Linux included) do lazy allocation for the BSS (COW mapping to the zero page, only allocating physical page frames to back that allocation when it's written, just like mmap without MAP_POPULATE), so it only wastes virtual address space to have a 512MiB array in the BSS that you only use the bottom 10kB of.

Don't make it larger than or close to 2GiB, though, because that will push other things in the BSS too far away. The default "small" code model (as described in the x86-64 System V ABI) puts all static addresses within 2GiB of each other for RIP-relative data addressing and rel32 call/jmp.

Downside: you'd have to write at least a simple memory allocator yourself, instead of working with whole pages with mmap/munmap. But that's easy if you don't need to free anything. Maybe just generate code starting at an address, and update a pointer once you get to the end and discover how long your code block is. (But that's not multi-threaded...) For safety, remember to check when you get to the end of this buffer and abort, or fall back to mmap.


If your absolute target addresses are in the low 2GiB of virtual address space, use mmap(MAP_32BIT) on Linux. (e.g. if your Rust code is compiled into a non-PIE executable for x86-64 Linux. But that won't be the case for PIE executables (common these days), or for targets in shared libraries. You can detect this at run-time by checking the address of one of your helper functions.)

In general (if MAP_32BIT isn't helpful/available), your best bet is probably mmap without MAP_FIXED, but with a non-NULL hint address that you think is free.

Linux 4.17 introduced MAP_FIXED_NOREPLACE which would let you easily search for a nearby unused region (e.g. step by 64MB and retry if you get EEXIST, then remember that address to avoid searching next time). Otherwise you could parse /proc/self/maps once at startup to find some unmapped space near the mapping that contains the address of one of your helper functions. The will be close together.

Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a "non-MAP_FIXED" type of behavior: they will return an address that is different from the requested address.

In the next higher or lower free page(s) would be ideal for having a non-sparse memory map so the page table doesn't need too many different top-level page directories. (HW page tables are a radix tree.) And once you find a spot that works, make future allocations contiguous with that. If you end up using a lot of space there, the kernel can opportunistically use a 2MB hugepage, and having your pages contiguous again means they share the same parent page directory in the HW page tables so iTLB misses triggering page walks may be slightly cheaper (if those higher levels stay hot in data caches, or even cached inside the pagewalk hardware itself). And for efficient for the kernel to track as one larger mapping. Of course, using more of an already-allocated page is even better, if there's room. Better code density on a page level helps the instruction TLB, and possibly also within a DRAM page (but that's not necessarily the same size as a virtual memory page).


Then as you do code-gen for each call, just check whether the target is in range for a call rel32 with off == (off as i32) as i64
else fall back to 10-byte mov r64,imm64 / call r64. (rustcc will compile that to movsxd/cmp, so checking every time only has trivial cost for JIT compile times.)

(Or 5-byte mov r32,imm32 if possible. OSes that don't support MAP_32BIT might still have the target addresses down there. Check for that with target == (target as u32) as u64. The 3rd mov-immediate encoding, 7-byte mov r/m64, sign_extended_imm32 is probably not interesting unless you're JITing kernel code for a kernel mapped in the high 2GiB of virtual address space.)

The beauty of checking and using a direct call whenever possible is that it decouples code-gen from any knowledge about allocating nearby pages or where the addresses come from, and just opportunistically makes good code. (You might record a counter or log once so you / your users at least notice if your nearby allocation mechanism is failing, because the perf diff won't typically be easily measurable.)


Alternatives to mov-imm / call reg

mov r64,imm64 is a 10-byte instruction that's a bit large to fetch/decode, and for the uop-cache to store. And may take an extra cycle to read from the uop cache on SnB-family according to Agner Fog's microarch pdf (https://agner.org/optimize). But modern CPUs have pretty good bandwidth for code-fetch, and robust front-ends.

If profiling finds that front-end bottlenecks are a big problem in your code, or large code size is causing eviction of other valuable code from L1 I-cache, I'd go with option 5.

BTW, if any of your functions are variadic, x86-64 System V requires that you pass AL=number of XMM args, you could use r11 for the function pointer. It's call-clobbered and not used for arg-passing. But RAX (or other "legacy" register) will save a REX prefix on the call.


  1. Allocate Rust functions near where mmap will allocate

No, I don't think there's any mechanism to get your statically compiled functions near where mmap might happen to put new pages.

mmap has more than 4GB of free virtual address space to pick from. You don't know ahead of time where it's going to allocate. (Although I think Linux at least does keep some amount of locality, to optimize the HW page tables.)

You in theory could copy the machine code of your Rust functions, but they probably reference other static code/data with RIP-relative addressing modes.


  1. call rel32 to stubs that use mov/jmp reg

This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction).

The perf downside is only from having 2 total call/jump instructions for the front-end to get past before it can feed the back-end with useful instructions. It's not great; 5. is much better.

This is basically how the PLT works for calls to shared-library functions on Unix/Linux, and will perform the same. Calling through a PLT (Procedure Linking Table) stub function is almost exactly like this. So the performance impacts have been well-studied and compared with other ways of doing things. We know that dynamic library calls aren't a performance disaster.

Asterisk before an address and push instructions, where is it being pushed to? shows AT&T disassembly of one, or single-step a C program like main(){puts("hello"); puts("world");} if you're curious. (On the first call, it pushes an arg and jumps to a lazy dynamic linker function; on subsequent calls the indirect jump target is the address of the function in the shared library.)

Why does the PLT exist in addition to the GOT, instead of just using the GOT? explains more. The jmp whose address is updated by lazy linking is jmp qword [xxx@GOTPLT]. (And yes, the PLT really does use a memory-indirect jmp here, even on i386 where a jmp rel32 that gets rewritten would work. IDK if GNU/Linux ever historically used to rewrite the offset in a jmp rel32.)

The jmp is just a standard tailcall, and does not unbalance the Return-Address predictor Stack. The eventual ret in the target function will return to the instruction after the original call, i.e. to the address that call pushed onto the call stack and onto the microarchitectural RAS. Only if you used a push / ret (like a "retpoline" for Spectre mitigation) would you unbalance the RAS.

But the code in Jumps for a JIT (x86_64) that you linked is unfortunately terrible (see my comment under it). It will break the RAS for future returns. You'd think it would only break it for this one with the call (to get a return address to be adjusted) should balance out the push/ret, but actually call +0 is a special case that doesn't go on the RAS in most CPUs: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks. (calling over a nop could change that I guess, but the whole thing is totally insane vs. call rax unless it's trying to defend against Spectre exploits.) Normally on x86-64, you use a RIP-relative LEA to get a nearby address into a register, not call/pop.


  1. inline mov r64, imm64 / call reg

This is probably better than 3; The front-end cost of larger code-size is probably lower than the cost of calling through a stub that uses jmp.

But this is also probably good enough, especially if your alloc-within-2GiB methods work well enough most of the time on most of the targets you care about.

There may be cases where it's slower than 5. though. Branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well. (And usually it will, or else it runs so infrequently that it's not performance relevant.)


  1. call qword [rel nearby_func_ptr]

This is how gcc -fno-plt compiles calls to shared-library functions on Linux (call [rip + symbol@GOTPCREL]), and how Windows DLL function calls are normally done. (This is like one of the suggestions in http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative] is 6 bytes, only 1 byte larger than call rel32, so it has a negligible impact on code-size vs. calling a stub. Fun fact: you will sometimes see addr32 call rel32 in machine code (the address size prefix has no effect except for padding). This comes from a linker relaxing a call [RIP + symbol@GOTPCREL] to a call rel32 if the symbol with non-hidden ELF visibility was found in another .o during linking, not a different shared object after all.

For shared library calls, this is usually better than PLT stubs, with the only downside being slower program startup because it requires early binding (non-lazy dynamic linking). This isn't an issue for you; the target address is known ahead of code-gen time.

The patch author tested its performance vs. a traditional PLT on some unknown x86-64 hardware. Clang is maybe a worst-case scenario for shared library calls, because it makes many calls to small LLVM functions that don't take much time, and it's long running so early-binding startup overhead is negligible. After using gcc and gcc -fno-plt to compile clang, the time for clang -O2 -g to compile tramp3d goes from 41.6s (PLT) to 36.8s (-fno-plt). clang --help becomes slightly slower.

(x86-64 PLT stubs use jmp qword [symbol@GOTPLT], not mov r64,imm64/jmp though. A memory-indirect jmp is only a single uop on modern Intel CPUs, so it's cheaper on a correct prediction, but maybe slower on an incorrect prediction, especially if the GOTPLT entry misses in cache. If it's used frequently, it will typically predict correctly, though. But anyway a 10-byte movabs and a 2-byte jmp can fetch as a block (if it fits in a 16-byte aligned fetch block), and decode in a single cycle, so 3. is not totally unreasonable. But this is better.)

When allocating space for your pointers, remember that they're fetched as data, into L1d cache, and with a dTLB entry not iTLB. Don't interleave them with code, that would waste space in the I-cache on this data, and waste space in D-cache on lines that contain one pointer and mostly code. Group your pointers together in a separate 64-byte chunk from code so the line doesn't need to be in both L1I and L1D. It's fine if they're in the same page as some code; they're read-only so won't cause self-modifying-code pipeline nukes.

这篇关于从 JITed 代码处理对(可能)很远的提前编译函数的调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆