处理来自 JITed 代码的(可能)很远的提前编译函数的调用 [英] Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

查看:19
本文介绍了处理来自 JITed 代码的(可能)很远的提前编译函数的调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题因过于宽泛而被搁置,大概是因为我为了展示我的工作"而不是提出一个不费力的问题而进行的研究.为了解决这个问题,请允许我用一句话总结整个问题(这句话归功于@PeterCordes):

<块引用>

如何从 JITed 代码(我正在生成)中有效地调用 (x86-64) 提前编译的函数(我控制的,可能超过 2GB)?

我怀疑,仅此一项就会因为过于宽泛"而被搁置.特别是,它缺少您尝试过什么".所以,我觉得有必要添加额外的信息来展示我的研究/想法和我的尝试.以下是对此的一些意识流.

请注意,这里提出的所有问题都不是我希望得到回答的问题;他们更多的是修辞.他们的目的是证明为什么我不能回答上述问题(尽管我进行了研究,但我缺乏这方面的经验来做出明确的陈述,例如@PeterCordes 的分支预测隐藏了从内存中获取并检查函数指针,假设它预测良好.").另请注意,这里的 Rust 组件在很大程度上无关紧要,因为这是一个组装问题.我之所以包含它,是因为提前编译的函数是用 Rust 编写的,所以我不确定 Rust 所做的(或指示 LLVM 做的)是否在这种情况下可能是有利的.完全不考虑 Rust 的答案是完全可以接受的.事实上,我预计会是这样.

将以下内容视为数学考试背后的草稿:

<小时>

注意:我在这里混淆了内在函数这个术语.正如评论中所指出的,提前编译的函数"是一个更好的描述.下面我将简称​​AOTC函数.

我正在用 Rust 编写 JIT(虽然 Rust 只与我的一些问题相关,但大部分都与 JIT 约定有关).我在 Rust 中实现了 AOTC 函数,我需要能够从我的 JIT 发出的代码中 call .我的 JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s 一些页面用于 jitted 代码.我有我的 AOTC 函数的地址,但不幸的是它们比 32 位偏移量要远得多.我现在正试图决定如何调用这些 AOTC 函数.我考虑了以下选项(这些不是要回答的问题,只是说明为什么我自己不能回答这个 SO 线程的核心问题):

  1. (Rust 特定)不知何故让 Rust 将 AOTC 函数放置在堆附近(也许在?),以便 call 将在 32位偏移.尚不清楚 Rust 是否有可能(有一种方法可以指定 自定义链接器参数,但我不知道它们应用了什么以及是否可以针对单个函数进行重定位.即使我可以把它放在哪里?).如果堆足够大,这似乎也可能失败.

  2. (Rust 特定)将我的 JIT 页面分配到更靠近 AOTC 函数的位置.这可以通过 mmap(_, _, PROT_EXEC, MAP_FIXED) 来实现,但我不确定如何选择一个不会破坏现有 Rust 代码的地址(并保持在拱限制内——是有一种健全的方法来获得这些限制吗?).

  3. 在处理绝对跳转的 JIT 页面中创建存根(下面的代码),然后 调用 存根.这样做的好处是 JITted 代码中的(初始)调用站点是一个不错的小型相对调用.但是必须跳过某些东西感觉不对.这似乎对性能有害(可能会干扰 RAS/跳转地址预测).此外,这种跳转似乎会更慢,因为它的地址是间接的,并且取决于该地址的 mov.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}jmp 拉克斯

  1. 与(3)相反,只是在 JITed 代码中的每个内部调用站点内联上述内容.这解决了间接问题,但使 JITted 代码更大(也许这有指令缓存和解码后果).它仍然存在跳转是间接的并且依赖于mov的问题.

  2. AOTC 函数的地址放在 JIT 页面附近的 PROT_READ(仅限)页面上.使所有呼叫站点靠近绝对间接呼叫(下面的代码).这从 (2) 中删除了第二级间接.但是不幸的是这条指令的编码很大(6字节),所以它和(4)有同样的问题.此外,现在不必要的跳转(只要地址在 JIT 时间已知)依赖于内存,而不是依赖于寄存器,这肯定会影响性能(尽管可能会缓存此页面?).

aotc_function_address:.quad 0xDEADBEEF# 然后在调用现场调用 qword ptr [rip+aotc_function_address]

  1. Futz 带有一个段寄存器,使其更接近 AOTC 函数,以便可以相对于该段寄存器进行调用.这种调用的编码很长(所以这可能有解码管道问题),但除此之外,这在很大程度上避免了之前所有事情的许多棘手位.但是,相对于非 cs 段的调用可能表现不佳.或者这样的 futzing 是不明智的(例如,与 Rust 运行时混淆).(正如@prl 所指出的,如果没有远调用,这是行不通的,这对性能来说很糟糕)

  2. 这不是一个真正的解决方案,但我可以将编译器设为 32 位,而完全没有这个问题.这并不是一个很好的解决方案,它还会阻止我使用扩展的通用寄存器(我使用了所有寄存器).

提出的所有选项都有缺点.简而言之,1 和 2 是唯一似乎不会对性能产生影响的方法,但目前尚不清楚是否有一种非 hacky 的方式来实现它们(或任何方式).3-5 独立于 Rust,但有明显的性能缺陷.

鉴于这种意识流,我得出了以下修辞问题(不需要明确的答案),以证明我缺乏自己回答这个 SO 线程的核心问题的知识.我已经让他们非常清楚地表明,我并不是提出所有这些都是我的问题的一部分.

<罢工>

  1. 对于方法(1),是否可以强制 Rust 在特定地址(堆附近)链接某些 extern "C" 函数?我应该如何选择这样的地址(在编译时)?假设 mmap 返回的任何地址(或 Rust 分配的)将在该位置的 32 位偏移范围内是否安全?

  2. 对于方法(2),我怎样才能找到一个合适的地方来放置 JIT 页面(这样它就不会破坏现有的 Rust 代码)?

还有一些 JIT(非 Rust)特定的问题:

  1. 对于方法 (3),存根是否会影响性能,以至于我应该关心?那么间接jmp呢?我知道这有点类似于链接器存根,除非我理解链接器存根至少只解析一次(所以它们不需要是间接的?).是否有任何 JIT 采用了这种技术?

  2. 对于方法 (4),如果 3 中的间接调用没问题,那么内联调用是否值得?如果 JIT 通常采用方法 (3/4),这个选项会更好吗?

  3. 对于方法(5),跳转对内存的依赖性(假设地址在编译时已知)是否不好?这会使其性能不如 (3) 或 (4) 吗?是否有任何 JIT 采用了这种技术?

  4. 对于方法(6),这样的做法是不明智的吗?(特定于 Rust)是否有可用的段寄存器(运行时或 ABI 未使用)用于此目的?与非 cs 段相关的调用是否与与 cs 相关的调用一样高效?

  5. 最后(也是最重要的),有没有更好的方法(可能更常被 JIT 采用)我在这里遗漏了?

如果我的 Rust 问题没有答案,我无法实现 (1) 或 (2).当然,我可以实现和基准测试 3-5(也许是 6,尽管事先了解段寄存器 futzing 会很好),但鉴于这些方法大不相同,我希望有关于这方面的现有文献我找不到,因为我不知道谷歌的正确术语(我目前也在研究这些基准).或者,深入研究 JIT 内部的人可以分享他们的经验或他们经常看到的东西?

我知道这个问题:Jumps for a JIT (x86_64).它与我的不同,因为它正在谈论将基本块串在一起(并且公认的解决方案对于经常称为内在的指令来说太多了).我也知道 在 x86 机器代码中调用绝对指针,虽然它讨论了与我相似的主题,但有所不同,因为我并不认为绝对跳转是必要的(例如,方法 1-2 会避免它们).

解决方案

总结:尝试在静态代码附近分配内存.但是对于使用 rel32 无法到达的调用,回退到 call qword [rel pointer] 或内联 mov r64,imm64/调用r64.

如果您无法使 2. 工作,您的机制 5. 可能最适合性能,但 4. 很容易并且应该没问题.直接 call rel32 也需要一些分支预测,但肯定还是更好.


术语:内在函数"应该是帮手"职能.内在"通常表示内置语言(例如 Fortran 含义)或不是真正的函数,只是内联到机器指令的东西";(C/C++/Rust 意思,比如 SIMD 或其他东西例如 _mm_popcnt_u32()_pdep_u32()_mm_mfence()).您的 Rust 函数将编译为存在于机器代码中的真实函数,您将使用 call 指令调用这些函数.


是的,在目标函数的 +-2GiB 范围内分配 JIT 缓冲区显然是理想的,允许 rel32 直接调用.

最直接的方法是在 BSS 中使用一个大型静态数组(链接器会将其放置在您的代码的 2GiB 内)并从中分割出您的分配.(使用 mprotect (POSIX) 或 VirtualProtect (Windows) 使其可执行).

大多数操作系统(包括 Linux)对 BSS 进行惰性分配(COW 映射到零页,仅在写入时分配物理页框以支持该分配,就像没有 MAP_POPULATE 的 mmap 一样),所以它只会浪费虚拟地址空间在 BSS 中有一个 512MiB 的数组,你只使用底部的 10kB.

不过,不要让它大于或接近 2GiB,因为这会将 BSS 中的其他东西推得太远.默认的小"代码模型(如 x86-64 System V ABI 中所述)将所有静态地址放在 2GiB 内,用于 RIP 相对数据寻址和 rel32 调用/jmp.

缺点:您必须自己编写至少一个简单的内存分配器,而不是使用 mmap/munmap 处理整个页面.但是,如果您不需要释放任何东西,这很容易.也许只是从地址开始生成代码,并在到达末尾并发现代码块有多长时更新指针.(但这不是多线程的……)为了安全起见,请记住检查何时到达此缓冲区的末尾并中止,或回退到 mmap.


如果您的绝对目标地址位于低 2GiB 的虚拟地址空间中,请在 Linux 上使用 mmap(MAP_32BIT).(例如,如果您的 Rust 代码被编译为 x86-64 Linux 的非 PIE 可执行文件.但 PIE 可执行文件并非如此(现在很常见),或用于共享库中的目标.您可以通过检查地址在运行时检测到这一点您的辅助功能之一.)

一般来说(如果 MAP_32BIT 没有帮助/不可用),您最好的选择可能是 mmap 没有 MAP_FIXED,但带有一个您认为免费的非 NULL 提示地址.

Linux 4.17 引入 MAP_FIXED_NOREPLACE 可以让您轻松搜索附近未使用的区域(例如,步进 64MB 并在获得 EEXIST 时重试,然后记住该地址以避免下次搜索).否则,您可以在启动时解析一次 /proc/self/maps 以在包含您的辅助函数之一的地址的映射附近找到一些未映射的空间.会很靠近.

<块引用>

请注意,无法识别 MAP_FIXED_NOREPLACE 标志的旧内核通常会(在检测到与预先存在的映射发生冲突时)回退到非 MAP_FIXED"标志.行为类型:它们将返回与请求地址不同的地址.

在下一个更高或更低的空闲页面中,对于具有非稀疏内存映射来说是理想的,因此页表不需要太多不同的顶级页面目录.(HW 页表是一个基数树.)一旦你找到一个工作的地方,让未来的分配与它相邻.如果您最终在那里使用了大量空间,内核可以机会性地使用 2MB 的大页面,并且让您的页面再次连续意味着它们在硬件页表中共享相同的父页面目录,因此 iTLB 错过触发页面遍历可能是 稍微便宜一点(如果那些更高的级别在数据缓存中保持热状态,或者甚至缓存在 pagewalk 硬件本身中).并且为了有效地让内核跟踪为一个更大的映射.当然,如果有空间,使用更多已分配的页面会更好.页面级别更好的代码密度有助于指令 TLB,也可能在 DRAM 页面内(但不一定与虚拟内存页面的大小相同).


然后,当您为每个调用生成代码时,只需检查目标是否在 call rel32 的范围内>off == (off as i32) as i64
否则回退到 10 字节 mov r64,imm64/call r64.(rustcc 会将其编译为 movsxd/cmp,因此每次检查 JIT 编译时间的成本都是微不足道的.)

(或 5 字节 mov r32,imm32 如果可能的话.不支持 MAP_32BIT 的操作系统可能仍然有目标地址.用 target == (target as u32) as u64.第三个 mov-立即编码,7-byte mov r/m64, sign_extended_imm32 可能不是有趣的是,除非您正在为映射在高 2GiB 虚拟地址空间中的内核 JITing 内核代码.)

尽可能检查和使用直接调用的美妙之处在于,它将代码生成与任何有关分配附近页面或地址来自何处的知识分离,并且只是机会性地编写好代码.(您可能会记录一次计数器或日志,以便您/您的用户至少注意到附近的分配机制是否失败,因为性能差异通常不容易测量.)


mov-imm/call reg 的替代方法

mov r64,imm64 是一个 10 字节的指令,对于获取/解码和 uop-cache 存储来说有点大.根据 Agner Fog 的 microarch pdf (https://agner.org/优化).但现代 CPU 具有相当不错的代码提取带宽和强大的前端.

如果分析发现前端瓶颈是代码中的一个大问题,或者大代码量导致其他有价值的代码从 L1 I-cache 中逐出,我会选择选项 5.

顺便说一句,如果您的任何函数是可变参数,x86-64 System V 要求您传递 AL=number of XMM args,您可以使用 r11 作为函数指针.它是 call-clobbered 并且不用于 arg-passing.但是 RAX(或其他传统"寄存器)会在 call 上保存一个 REX 前缀.


<块引用>

  1. mmap 将分配的位置附近分配 Rust 函数

不,我认为没有任何机制可以让您的静态编译函数靠近 mmap 可能碰巧放置新页面的位置.

mmap 有超过 4GB 的可用虚拟地址空间可供选择.你不知道它会提前分配到哪里.(虽然我认为 Linux 至少确实保留了一些局部性,以优化 HW 页表.)

理论上你可以复制你的 Rust 函数的机器代码,但它们可能会引用 other 具有 RIP 相对寻址模式的静态代码/数据.


<块引用>

  1. call rel32 到使用 mov/jmp reg
  2. 的存根

这似乎会损害性能(可能会干扰 RAS/跳转地址预测).

性能的缺点只是前端总共有 2 条调用/跳转指令才能通过,然后才能为后端提供有用的指令.这不是很好;5. 好多了.

这基本上是 PLT 在 Unix/Linux 上调用共享库函数的工作方式,并且将执行相同的操作.通过 PLT(过程链接表)存根函数调用几乎就是这样.因此,性能影响已得到充分研究,并与其他做事方式进行了比较.我们知道动态库调用不会造成性能灾难.

星号前地址和推送指令,它被推送到哪里? 显示 AT&T 反汇编一个或单步 C 程序,如 main(){puts(hello");puts("world");} 如果你好奇的话.(第一次调用时,它压入一个 arg 并跳转到惰性动态链接器函数;在后续调用中,间接跳转目标是共享库中函数的地址.)

为什么除了 GOT 之外还存在 PLT,而不是仅仅使用 GOT? 解释更多.通过惰性链接更新地址的jmpjmp qword [xxx@GOTPLT].(是的,PLT 确实在这里使用了内存间接 jmp,即使在 i386 上,重写的 jmp rel32 也可以工作.如果 GNU/Linux 历史上曾经有过 IDK用于重写 jmp rel32 中的偏移量.)

jmp 只是一个标准的尾调用,不会不平衡返回地址预测器堆栈.目标函数中的最终ret会返回到原来call之后的指令,即call压入调用栈的地址,到微架构 RAS 上.仅当您使用 push/ret(例如用于 Spectre 缓解的retpoline")时,您才会使 RAS 失衡.

JIT (x86_64) 中的代码 很糟糕(请参阅我的评论).它打破 RAS 以获得未来的回报.你会认为它只会破坏这个调用(以获取要调整的返回地址)应该平衡 push/ret,但实际上 call +0 是一种特殊情况在大多数 CPU 中不会在 RAS 上运行:http://blog.stuffedcow.net/2018/04/ras-microbenchmarks.(我猜调用 nop 可能会改变这一点,但与 call rax 相比,整个事情完全是疯狂的,除非它试图防御 Spectre 漏洞.)通常在 x86-64,您使用 RIP-relative LEA 将附近的地址放入寄存器,而不是 call/pop.


<块引用>

  1. 内联mov r64, imm64/调用reg

这可能比 3 好;较大代码大小的前端成本可能低于通过使用 jmp 的存根调用的成本.

但这也可能足够好,特别是如果您的 alloc-within-2GiB 方法在大多数时间对您关心的大多数目标都运行良好.

可能会有比 5 慢的情况.不过.假设它预测良好,分支预测隐藏了从内存中获取和检查函数指针的延迟.(通常它会运行,否则运行频率太低以至于与性能无关.)


<块引用>

  1. 调用qword [rel near_func_ptr]

这就是 gcc -fno-plt 如何编译 Linux 上对共享库函数的调用(call [rip + symbol@GOTPCREL]),以及如何通常完成 Windows DLL 函数调用.(这就像 http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative] 是 6 个字节,仅比 call rel32 大 1 个字节,因此它对代码大小的影响与调用存根相比可以忽略不计.有趣的事实:您有时会在机器代码中看到 addr32 call rel32(地址大小前缀除了填充之外没有任何影响).如果在另一个 中找到具有非隐藏 ELF 可见性的符号,则链接器将 call [RIP + symbol@GOTPCREL] 放松为 call rel32.o 在链接期间,毕竟不是不同的共享对象.

对于共享库调用,这通常比 PLT 存根更好,唯一的缺点是程序启动速度较慢,因为它需要早期绑定(非惰性动态链接).这对您来说不是问题;目标地址在代码生成时间之前就已经知道了.

补丁作者测试了它的性能 vs. 一些未知 x86-64 硬件上的传统 PLT.对于共享库调用来说,Clang 可能是最坏的情况,因为它对小型 LLVM 函数进行 许多 调用,这些调用不需要太多时间,而且它运行时间很长,因此早期绑定的启动开销可以忽略不计.使用gccgcc -fno-plt编译clang后,clang -O2 -g编译tramp3d的时间从41.6s(PLT) 到 36.8 秒 (-fno-plt).clang --help 变慢了一点.

(x86-64 PLT 存根使用 jmp qword [symbol@GOTPLT],而不是 mov r64,imm64/jmp.内存-indirect jmp 在现代 Intel CPU 上只是一个微指令,所以它在正确预测时更便宜,但在不正确预测时可能更慢,特别是如果 GOTPLT 条目在缓存中丢失.如果经常使用,不过,它通常会正确预测.但无论如何,一个 10 字节的 movabs 和一个 2 字节的 jmp 可以作为一个块获取(如果它适合 16 字节对齐fetch block),并在一个周期内解码,所以 3. 并非完全不合理.但这样更好.)

为指针分配空间时,请记住它们是作为数据提取到 L1d 缓存中的,并且使用 dTLB 条目而不是 iTLB.不要将它们与代码交错,这会浪费 I-cache 中的空间 处理这些数据,并浪费 D-cache 中包含一个指针且主要是代码的行的空间.将您的指针与代码组合在一个单独的 64 字节块中,因此该行不需要同时位于 L1I 和 L1D 中.如果它们与某些代码在同一个 page 中,那很好;它们是只读的,因此不会导致自修改代码管道核弹.

This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase):

How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)?

This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information showing my research/thinking and what I have tried. Below is a somewhat stream of consciousness of this.

Note that none of the questions posed below here are ones I expect to be answered; they are more rhetorical. Their purpose is to demonstrate why I can't answer the above question (despite my research, I lack the experience in this area to make definitive statements such as @PeterCordes's "branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well."). Also note that the Rust component is largely irrelevant here as this is an assembly issue. My reasoning for including it was the ahead-of-time compiled functions are written in Rust, so I was unsure if there was something that Rust did (or instructed LLVM to do) that could be advantageous in this situation. It is totally acceptable for an answer to not consider Rust at all; in fact, I expect this will be the case.

Think of the following as scratch work on the back of a math exam:


Note: I muddled the term intrinsics here. As pointed out in the comments, "ahead-of-time compiled functions" is a better description. Below I'll abbreviate that AOTC functions.

I'm writing a JIT in Rust (although Rust is only relevant to a bit of my question, the bulk of it relates to JIT conventions). I have AOTC functions that I've implemented in Rust that I need to be able to call from code emitted by my JIT. My JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s some pages for the jitted code. I have the addresses of my AOTC functions, but unfortunately they are much further away than a 32-bit offset. I'm trying to decide now how to emit calls to these AOTC functions. I've considered the following options (these are not questions to be answered, just demonstrating why I can't answer the core question of this SO thread myself):

  1. (Rust specific) Somehow make Rust place the AOTC functions close to (maybe on?) the heap so that the calls will be within a 32-bit offset. It's unclear that that is possible with Rust (There is a way to specify custom linker args, but I can't tell to what those are applied and if I could target a single function for relocation. And even if I could where do I put it?). It also seems like this could fail if the heap is large enough.

  2. (Rust specific) Allocate my JIT pages closer to the AOTC functions. This could be achieved with mmap(_, _, PROT_EXEC, MAP_FIXED), but I'm unsure how to pick an address that wouldn't clobbering existing Rust code (and keeping within arch restrictions--is there a sane way to get those restrictions?).

  3. Create stubs within the JIT pages that handle the absolute jump (code below), then call the stubs. This has the benefit of the (initial) call site in the JITted code being a nice small relative call. But it feels wrong to have to jump through something. This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction). Additionally, it seems like this jump would be slower since its address is indirect and it depends on the mov for that address.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}
jmp rax

  1. The reverse of (3), just inlining the above at each intrinsic call site in the JITed code. This resolves the indirection issue, but makes the JITted code larger (perhaps this has instruction cache and decoding consequences). It still has the issue that the jump is indirect and depends on the mov.

  2. Place the addresses of the AOTC functions on a PROT_READ (only) page near the JIT pages. Make all the call sites near, absolute indirect calls (code below). This removes the second level of indirection from (2). But the encoding of this instruction is unfortunately large (6 bytes), so it has the same issues as (4). Additionally, now instead of depending on a register, jumps unnecessarily (insofar as the address is known at JIT time) depend on memory, which certainly has performance implications (despite perhaps this page being cached?).

aotc_function_address:
    .quad 0xDEADBEEF

# Then at the call site
call qword ptr [rip+aotc_function_address]

  1. Futz with a segment register to place it closer to the AOTC functions so that calls can be made relative to that segment register. The encoding of such a call is long (so maybe this has decoding pipeline issues), but other than that this largely avoids lots of the tricky bits of everything before it. But, maybe calling relative to a non-cs segment performs poorly. Or maybe such futzing is not wise (messes with the Rust runtime, for example). (as pointed out by @prl, this doesn't work without a far call, which is terrible for performance)

  2. Not really a solution, but I could make the compiler 32-bit and not have this problem at all. That's not really a great solution and it also would prevent me from using the extended general purpose registers (of which I utilize all).

All of the options presented have drawbacks. Briefly, 1 and 2 are the only ones that don't seem to have performance impacts, but it's unclear if there is a non-hacky way to achieve them (or any way at all for that matter). 3-5 are independent of Rust, but have obvious performance drawbacks.

Given this stream of consciousness, I arrived at the following rhetorical question (which don't need explicit answers) to demonstrate that I lack the knowledge to answer the core question of this SO thread by myself. I have struck them to make it abundantly clear that I am not posing all of these are part of my question.

  1. For approach (1), is it possible to force Rust to link certain extern "C" functions at a specific address (near the heap)? How should I choose such an address (at compile time)? Is it safe to assume that any address returned by mmap (or allocated by Rust) will be within a 32 bit offset of this location?

  2. For approach (2), how can I find a suitable place to place the JIT pages (such that it doesn't clobber existing Rust code)?

And some JIT (non-Rust) specific questions:

  1. For approach (3), will the stubs hamper performance enough that I should care? What about the indirect jmp? I know this somewhat resembles linker stubs, except as I understand linker stubs are at least only resolved once (so they don't need to be indirect?). Do any JITs employ this technique?

  2. For approach (4), if the indirect call in 3 is okay, is inlining the calls worth it? If JITs typically employ approach (3/4) is this option better?

  3. For approach (5), is the dependence of the jump on memory (given that the address is known at compile time) bad? Would that make it less performant that (3) or (4)? Do any JITs employ this technique?

  4. For approach (6), is such futzing unwise? (Rust specific) Is there a segment register available (not used by the runtime or ABI) for this purpose? Will calls relative to a non-cs segment be as performant as those relative to cs?

  5. And finally (and most importantly), is there a better approach (perhaps employed more commonly by JITs) that I'm missing here?

I can't implement (1) or (2) without my Rust questions having answers. I could, of course, implement and benchmark 3-5 (perhaps 6, although it would be nice to know about the segment register futzing beforehand), but given that these are vastly different approaches, I was hoping there was existing literature about this that I couldn't find, because I didn't know the right terms to google for (I'm also currently working on those benchmarks). Alternatively maybe someone who's delved into JIT internals can share their experience or what they've commonly seen?

I am aware of this question: Jumps for a JIT (x86_64). It differs from mine because it is talking about stringing together basic blocks (and the accepted solution is way too many instructions for a frequently called intrinsic). I am also aware of Call an absolute pointer in x86 machine code, which while it discusses similar topics to mine, is different, because I am not assuming that absolute jumps are necessary (approaches 1-2 would avoid them, for example).

解决方案

Summary: try to allocate memory near your static code. But for calls that can't reach with rel32, fall back to call qword [rel pointer] or inline mov r64,imm64 / call r64.

Your mechanism 5. is probably best for performance if you can't make 2. work, but 4. is easy and should be fine. Direct call rel32 needs some branch prediction, too, but it's definitely still better.


Terminology: "intrinsic functions" should probably be "helper" functions. "Intrinsic" usually means either language built-in (e.g. Fortran meaning) or "not a real function, just something that inlines to a machine instruction" (C/C++ / Rust meaning, like for SIMD, or stuff like _mm_popcnt_u32(), _pdep_u32(), or _mm_mfence()). Your Rust functions are going to compile to real functions that exist in machine code that you're going to call with call instructions.


Yes, allocating your JIT buffers within +-2GiB of your target functions is obviously ideal, allowing rel32 direct calls.

The most straightforward would be to use a large static array in the BSS (which the linker will place within 2GiB of your code) and carve your allocations out of that. (Use mprotect (POSIX) or VirtualProtect (Windows) to make it executable).

Most OSes (Linux included) do lazy allocation for the BSS (COW mapping to the zero page, only allocating physical page frames to back that allocation when it's written, just like mmap without MAP_POPULATE), so it only wastes virtual address space to have a 512MiB array in the BSS that you only use the bottom 10kB of.

Don't make it larger than or close to 2GiB, though, because that will push other things in the BSS too far away. The default "small" code model (as described in the x86-64 System V ABI) puts all static addresses within 2GiB of each other for RIP-relative data addressing and rel32 call/jmp.

Downside: you'd have to write at least a simple memory allocator yourself, instead of working with whole pages with mmap/munmap. But that's easy if you don't need to free anything. Maybe just generate code starting at an address, and update a pointer once you get to the end and discover how long your code block is. (But that's not multi-threaded...) For safety, remember to check when you get to the end of this buffer and abort, or fall back to mmap.


If your absolute target addresses are in the low 2GiB of virtual address space, use mmap(MAP_32BIT) on Linux. (e.g. if your Rust code is compiled into a non-PIE executable for x86-64 Linux. But that won't be the case for PIE executables (common these days), or for targets in shared libraries. You can detect this at run-time by checking the address of one of your helper functions.)

In general (if MAP_32BIT isn't helpful/available), your best bet is probably mmap without MAP_FIXED, but with a non-NULL hint address that you think is free.

Linux 4.17 introduced MAP_FIXED_NOREPLACE which would let you easily search for a nearby unused region (e.g. step by 64MB and retry if you get EEXIST, then remember that address to avoid searching next time). Otherwise you could parse /proc/self/maps once at startup to find some unmapped space near the mapping that contains the address of one of your helper functions. The will be close together.

Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a "non-MAP_FIXED" type of behavior: they will return an address that is different from the requested address.

In the next higher or lower free page(s) would be ideal for having a non-sparse memory map so the page table doesn't need too many different top-level page directories. (HW page tables are a radix tree.) And once you find a spot that works, make future allocations contiguous with that. If you end up using a lot of space there, the kernel can opportunistically use a 2MB hugepage, and having your pages contiguous again means they share the same parent page directory in the HW page tables so iTLB misses triggering page walks may be slightly cheaper (if those higher levels stay hot in data caches, or even cached inside the pagewalk hardware itself). And for efficient for the kernel to track as one larger mapping. Of course, using more of an already-allocated page is even better, if there's room. Better code density on a page level helps the instruction TLB, and possibly also within a DRAM page (but that's not necessarily the same size as a virtual memory page).


Then as you do code-gen for each call, just check whether the target is in range for a call rel32 with off == (off as i32) as i64
else fall back to 10-byte mov r64,imm64 / call r64. (rustcc will compile that to movsxd/cmp, so checking every time only has trivial cost for JIT compile times.)

(Or 5-byte mov r32,imm32 if possible. OSes that don't support MAP_32BIT might still have the target addresses down there. Check for that with target == (target as u32) as u64. The 3rd mov-immediate encoding, 7-byte mov r/m64, sign_extended_imm32 is probably not interesting unless you're JITing kernel code for a kernel mapped in the high 2GiB of virtual address space.)

The beauty of checking and using a direct call whenever possible is that it decouples code-gen from any knowledge about allocating nearby pages or where the addresses come from, and just opportunistically makes good code. (You might record a counter or log once so you / your users at least notice if your nearby allocation mechanism is failing, because the perf diff won't typically be easily measurable.)


Alternatives to mov-imm / call reg

mov r64,imm64 is a 10-byte instruction that's a bit large to fetch/decode, and for the uop-cache to store. And may take an extra cycle to read from the uop cache on SnB-family according to Agner Fog's microarch pdf (https://agner.org/optimize). But modern CPUs have pretty good bandwidth for code-fetch, and robust front-ends.

If profiling finds that front-end bottlenecks are a big problem in your code, or large code size is causing eviction of other valuable code from L1 I-cache, I'd go with option 5.

BTW, if any of your functions are variadic, x86-64 System V requires that you pass AL=number of XMM args, you could use r11 for the function pointer. It's call-clobbered and not used for arg-passing. But RAX (or other "legacy" register) will save a REX prefix on the call.


  1. Allocate Rust functions near where mmap will allocate

No, I don't think there's any mechanism to get your statically compiled functions near where mmap might happen to put new pages.

mmap has more than 4GB of free virtual address space to pick from. You don't know ahead of time where it's going to allocate. (Although I think Linux at least does keep some amount of locality, to optimize the HW page tables.)

You in theory could copy the machine code of your Rust functions, but they probably reference other static code/data with RIP-relative addressing modes.


  1. call rel32 to stubs that use mov/jmp reg

This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction).

The perf downside is only from having 2 total call/jump instructions for the front-end to get past before it can feed the back-end with useful instructions. It's not great; 5. is much better.

This is basically how the PLT works for calls to shared-library functions on Unix/Linux, and will perform the same. Calling through a PLT (Procedure Linking Table) stub function is almost exactly like this. So the performance impacts have been well-studied and compared with other ways of doing things. We know that dynamic library calls aren't a performance disaster.

Asterisk before an address and push instructions, where is it being pushed to? shows AT&T disassembly of one, or single-step a C program like main(){puts("hello"); puts("world");} if you're curious. (On the first call, it pushes an arg and jumps to a lazy dynamic linker function; on subsequent calls the indirect jump target is the address of the function in the shared library.)

Why does the PLT exist in addition to the GOT, instead of just using the GOT? explains more. The jmp whose address is updated by lazy linking is jmp qword [xxx@GOTPLT]. (And yes, the PLT really does use a memory-indirect jmp here, even on i386 where a jmp rel32 that gets rewritten would work. IDK if GNU/Linux ever historically used to rewrite the offset in a jmp rel32.)

The jmp is just a standard tailcall, and does not unbalance the Return-Address predictor Stack. The eventual ret in the target function will return to the instruction after the original call, i.e. to the address that call pushed onto the call stack and onto the microarchitectural RAS. Only if you used a push / ret (like a "retpoline" for Spectre mitigation) would you unbalance the RAS.

But the code in Jumps for a JIT (x86_64) that you linked is unfortunately terrible (see my comment under it). It will break the RAS for future returns. You'd think it would only break it for this one with the call (to get a return address to be adjusted) should balance out the push/ret, but actually call +0 is a special case that doesn't go on the RAS in most CPUs: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks. (calling over a nop could change that I guess, but the whole thing is totally insane vs. call rax unless it's trying to defend against Spectre exploits.) Normally on x86-64, you use a RIP-relative LEA to get a nearby address into a register, not call/pop.


  1. inline mov r64, imm64 / call reg

This is probably better than 3; The front-end cost of larger code-size is probably lower than the cost of calling through a stub that uses jmp.

But this is also probably good enough, especially if your alloc-within-2GiB methods work well enough most of the time on most of the targets you care about.

There may be cases where it's slower than 5. though. Branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well. (And usually it will, or else it runs so infrequently that it's not performance relevant.)


  1. call qword [rel nearby_func_ptr]

This is how gcc -fno-plt compiles calls to shared-library functions on Linux (call [rip + symbol@GOTPCREL]), and how Windows DLL function calls are normally done. (This is like one of the suggestions in http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative] is 6 bytes, only 1 byte larger than call rel32, so it has a negligible impact on code-size vs. calling a stub. Fun fact: you will sometimes see addr32 call rel32 in machine code (the address size prefix has no effect except for padding). This comes from a linker relaxing a call [RIP + symbol@GOTPCREL] to a call rel32 if the symbol with non-hidden ELF visibility was found in another .o during linking, not a different shared object after all.

For shared library calls, this is usually better than PLT stubs, with the only downside being slower program startup because it requires early binding (non-lazy dynamic linking). This isn't an issue for you; the target address is known ahead of code-gen time.

The patch author tested its performance vs. a traditional PLT on some unknown x86-64 hardware. Clang is maybe a worst-case scenario for shared library calls, because it makes many calls to small LLVM functions that don't take much time, and it's long running so early-binding startup overhead is negligible. After using gcc and gcc -fno-plt to compile clang, the time for clang -O2 -g to compile tramp3d goes from 41.6s (PLT) to 36.8s (-fno-plt). clang --help becomes slightly slower.

(x86-64 PLT stubs use jmp qword [symbol@GOTPLT], not mov r64,imm64/jmp though. A memory-indirect jmp is only a single uop on modern Intel CPUs, so it's cheaper on a correct prediction, but maybe slower on an incorrect prediction, especially if the GOTPLT entry misses in cache. If it's used frequently, it will typically predict correctly, though. But anyway a 10-byte movabs and a 2-byte jmp can fetch as a block (if it fits in a 16-byte aligned fetch block), and decode in a single cycle, so 3. is not totally unreasonable. But this is better.)

When allocating space for your pointers, remember that they're fetched as data, into L1d cache, and with a dTLB entry not iTLB. Don't interleave them with code, that would waste space in the I-cache on this data, and waste space in D-cache on lines that contain one pointer and mostly code. Group your pointers together in a separate 64-byte chunk from code so the line doesn't need to be in both L1I and L1D. It's fine if they're in the same page as some code; they're read-only so won't cause self-modifying-code pipeline nukes.

这篇关于处理来自 JITed 代码的(可能)很远的提前编译函数的调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆