处理(可能)远离JITed代码的提前编译的函数的调用 [英] Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

查看:91
本文介绍了处理(可能)远离JITed代码的提前编译的函数的调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题之所以被搁置得太广泛,大概是因为我为了展示我的作品"而不是提出一个费力的问题而进行的研究.为了解决这个问题,请允许我用一个句子来概括整个问题(此短语的信用为@PeterCordes):

This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase):

如何有效地从我生成的JIT代码中调用(x86-64)提前编译的函数(我控制的功能,可能比2GB的空间还远)?

How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)?

我怀疑仅此一项将被搁置为太宽泛".特别是,它缺少您尝试了什么".因此,我感到有必要添加其他信息,以显示我的研究/想法和尝试过的内容.下面是对此的一些意识流.

This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information showing my research/thinking and what I have tried. Below is a somewhat stream of consciousness of this.

请注意,下面提出的所有问题都不是我希望得到回答的问题;他们的目的是为了说明为什么我不能回答上述问题(尽管我的研究是,我缺乏在该领域做出明确声明的经验,例如@PeterCordes的分支预测掩盖了假设它预测良好,则从内存中获取并检查函数指针.").还要注意,Rust组件在这里基本上无关紧要,因为这是装配问题.我之所以将它包括在内,是因为提前编译的函数是用Rust编写的,因此我不确定Rust是否做了某些事情(或指示LLVM做)在这种情况下可能是有利的.完全不考虑Rust的答案是完全可以接受的.实际上,我希望情况会如此.

Note that none of the questions posed below here are ones I expect to be answered; they are more rhetorical. Their purpose is to demonstrate why I can't answer the above question (despite my research, I lack the experience in this area to make definitive statements such as @PeterCordes's "branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well."). Also note that the Rust component is largely irrelevant here as this is an assembly issue. My reasoning for including it was the ahead-of-time compiled functions are written in Rust, so I was unsure if there was something that Rust did (or instructed LLVM to do) that could be advantageous in this situation. It is totally acceptable for an answer to not consider Rust at all; in fact, I expect this will be the case.

将以下内容视为数学考试背面的草稿工作:

Think of the following as scratch work on the back of a math exam:

注意:在这里我混淆了术语内在函数.正如评论中指出的那样,提前编译的函数"是一个更好的描述.在下面,我将简称为 AOTC 函数.

Note: I muddled the term intrinsics here. As pointed out in the comments, "ahead-of-time compiled functions" is a better description. Below I'll abbreviate that AOTC functions.

我正在用Rust编写JIT(尽管Rust仅与我的问题有关,但其中大部分与JIT约定有关).我有在Rust中实现的 AOTC 函数,我需要能够从我的JIT发出的代码中call.我的JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)包含一些页面,用于显示jitted代码.我有我的 AOTC 函数的地址,但是不幸的是,它们比32位偏移量还要远.我正在尝试决定如何发出对这些 AOTC 函数的调用.我考虑了以下选项(这些不是要回答的问题,只是说明为什么我自己不能回答这个SO线程的核心问题):

I'm writing a JIT in Rust (although Rust is only relevant to a bit of my question, the bulk of it relates to JIT conventions). I have AOTC functions that I've implemented in Rust that I need to be able to call from code emitted by my JIT. My JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s some pages for the jitted code. I have the addresses of my AOTC functions, but unfortunately they are much further away than a 32-bit offset. I'm trying to decide now how to emit calls to these AOTC functions. I've considered the following options (these are not questions to be answered, just demonstrating why I can't answer the core question of this SO thread myself):

  1. (特定于锈迹)使Rust将 AOTC 函数放置在靠近堆(也许在堆上)的位置,以便call位于32位偏移量之内.目前尚不清楚Rust是否可以实现(有一种方法可以指定自定义链接程序args ,但是我无法确定应用了哪些参数以及是否可以将单个函数作为目标进行重定位(即使我可以将其放置在哪里?).如果堆足够大,这似乎也可能失败.

  1. (Rust specific) Somehow make Rust place the AOTC functions close to (maybe on?) the heap so that the calls will be within a 32-bit offset. It's unclear that that is possible with Rust (There is a way to specify custom linker args, but I can't tell to what those are applied and if I could target a single function for relocation. And even if I could where do I put it?). It also seems like this could fail if the heap is large enough.

(特定于锈迹)分配我的JIT页面,使其更靠近 AOTC 函数.这可以通过mmap(_, _, PROT_EXEC, MAP_FIXED)来实现,但是我不确定如何选择不会破坏现有Rust代码的地址(并保持在拱门限制之内-是否有理智的方式来获得这些限制?).

(Rust specific) Allocate my JIT pages closer to the AOTC functions. This could be achieved with mmap(_, _, PROT_EXEC, MAP_FIXED), but I'm unsure how to pick an address that wouldn't clobbering existing Rust code (and keeping within arch restrictions--is there a sane way to get those restrictions?).

在处理绝对跳转的JIT页面中创建存根(下面的代码),然后在存根中call创建存根.这样做的好处是,JITted代码中的(初始)调用站点是一个不错的小型相对调用.但是,必须跳过某些内容是不对的.这似乎会对性能产生不利影响(可能会干扰RAS/跳转地址预测).另外,由于地址是间接的,并且取决于该地址的mov,因此似乎跳转速度较慢.

Create stubs within the JIT pages that handle the absolute jump (code below), then call the stubs. This has the benefit of the (initial) call site in the JITted code being a nice small relative call. But it feels wrong to have to jump through something. This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction). Additionally, it seems like this jump would be slower since its address is indirect and it depends on the mov for that address.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}
jmp rax

  1. 与(3)相反,只是在JITed代码中的每个内部调用站点上插入上述内容.这解决了间接问题,但使JITted代码变大了(也许这会导致指令缓存和解码的后果).仍然存在跳转是间接跳转且取决于mov的问题.

AOTC 函数的地址放在JIT页面附近的PROT_READ(仅)页面上.使所有呼叫站点都靠近绝对的间接呼叫(下面的代码).这从(2)中删除了第二个间接级别.但是不幸的是,该指令的编码很大(6个字节),因此它具有与(4)相同的问题.此外,现在不再依赖寄存器,而不必要地跳转(因为在JIT时已知地址),这取决于内存,这肯定会对性能产生影响(尽管可能正在缓存此页面?).

Place the addresses of the AOTC functions on a PROT_READ (only) page near the JIT pages. Make all the call sites near, absolute indirect calls (code below). This removes the second level of indirection from (2). But the encoding of this instruction is unfortunately large (6 bytes), so it has the same issues as (4). Additionally, now instead of depending on a register, jumps unnecessarily (insofar as the address is known at JIT time) depend on memory, which certainly has performance implications (despite perhaps this page being cached?).

aotc_function_address:
    .quad 0xDEADBEEF

# Then at the call site
call qword ptr [rip+aotc_function_address]

  1. Futz,带有段寄存器,使其更靠近 AOTC 函数,以便可以相对于该段寄存器进行调用.此类呼叫的编码很长(因此可能存在解码流水线问题),但除此之外,这很大程度上避免了之前所有内容的许多棘手问题.但是,也许相对于非cs段的调用执行效果很差.也许这样的混乱不是明智的(例如,与Rust运行时有关).(如@prl所指出的,如果没有远距离调用,这是行不通的,这对于性能而言是可怕的)

  1. Futz with a segment register to place it closer to the AOTC functions so that calls can be made relative to that segment register. The encoding of such a call is long (so maybe this has decoding pipeline issues), but other than that this largely avoids lots of the tricky bits of everything before it. But, maybe calling relative to a non-cs segment performs poorly. Or maybe such futzing is not wise (messes with the Rust runtime, for example). (as pointed out by @prl, this doesn't work without a far call, which is terrible for performance)

并不是真正的解决方案,但我可以使编译器为32位,并且完全没有此问题.这并不是一个很好的解决方案,而且还会阻止我使用扩展的通用寄存器(我利用了所有寄存器).

Not really a solution, but I could make the compiler 32-bit and not have this problem at all. That's not really a great solution and it also would prevent me from using the extended general purpose registers (of which I utilize all).

提出的所有选项都有缺点.简而言之,1和2似乎对性能没有影响,但是目前尚不清楚是否有非hacky的方法来实现它们(或者就此而言根本没有任何方法). 3-5独立于Rust,但具有明显的性能缺陷.

All of the options presented have drawbacks. Briefly, 1 and 2 are the only ones that don't seem to have performance impacts, but it's unclear if there is a non-hacky way to achieve them (or any way at all for that matter). 3-5 are independent of Rust, but have obvious performance drawbacks.

鉴于这种意识流,我得出了以下反问句(不需要明确的答案),以证明我缺乏自己回答该SO线程核心问题的知识. 我已经打败了他们,以明确表明我并没有提出所有这些问题.

Given this stream of consciousness, I arrived at the following rhetorical question (which don't need explicit answers) to demonstrate that I lack the knowledge to answer the core question of this SO thread by myself. I have struck them to make it abundantly clear that I am not posing all of these are part of my question.

  1. 对于方法(1),是否可以强制Rust在特定地址(堆附近)链接某些extern "C"函数?我应该如何选择这样的地址(在编译时)?是否可以安全地假设mmap返回的任何地址(或由Rust分配的地址)都将在此位置的32位偏移量之内?

  1. For approach (1), is it possible to force Rust to link certain extern "C" functions at a specific address (near the heap)? How should I choose such an address (at compile time)? Is it safe to assume that any address returned by mmap (or allocated by Rust) will be within a 32 bit offset of this location?

对于方法(2),如何找到合适的放置JIT页面的位置(这样它不会破坏现有的Rust代码)?

For approach (2), how can I find a suitable place to place the JIT pages (such that it doesn't clobber existing Rust code)?

以及一些有关JIT(非锈蚀)的问题:

  1. 对于方法(3),存根会妨碍性能到我应注意的程度吗?间接jmp呢?我知道这有点类似于链接器存根,但据我了解,链接器存根至少只能解析一次(因此它们不需要是间接的吗?).是否有准时生产技术采用这种技术?

  1. For approach (3), will the stubs hamper performance enough that I should care? What about the indirect jmp? I know this somewhat resembles linker stubs, except as I understand linker stubs are at least only resolved once (so they don't need to be indirect?). Do any JITs employ this technique?

对于方法(4),如果3中的间接调用还可以,则内联调用是否值得?如果准时制生产商通常采用方法(3/4),那么此选项更好吗?

For approach (4), if the indirect call in 3 is okay, is inlining the calls worth it? If JITs typically employ approach (3/4) is this option better?

对于方法(5),跳转对内存的依赖性(假设地址在编译时是已知的)是否不好?这会使它的性能不及(3)或(4)吗?是否有准时生产技术采用这种技术?

For approach (5), is the dependence of the jump on memory (given that the address is known at compile time) bad? Would that make it less performant that (3) or (4)? Do any JITs employ this technique?

对于方法(6),这样的混淆是不明智的吗? (特定于铁锈)是否有用于此目的的段寄存器(运行时或ABI未使用)?相对于非cs段的调用是否会像相对于cs的调用一样有效?

For approach (6), is such futzing unwise? (Rust specific) Is there a segment register available (not used by the runtime or ABI) for this purpose? Will calls relative to a non-cs segment be as performant as those relative to cs?

并且最后(也是最重要的),我在这里是否缺少更好的方法(也许是JIT经常采用的方法)?

And finally (and most importantly), is there a better approach (perhaps employed more commonly by JITs) that I'm missing here?

在我的Rust问题没有答案的情况下,我无法实现(1)或(2).我当然可以实现3-5并进行基准测试(也许是6,尽管事先了解分段寄存器的信息会很不错),但是鉴于这些方法截然不同,我希望有关于此的现有文献.我找不到,因为我不知道适用于google的正确术语(我目前也在制定这些基准).或者,也许某个精通JIT内部的人可以分享他们的经验或他们通常看到的东西?

I can't implement (1) or (2) without my Rust questions having answers. I could, of course, implement and benchmark 3-5 (perhaps 6, although it would be nice to know about the segment register futzing beforehand), but given that these are vastly different approaches, I was hoping there was existing literature about this that I couldn't find, because I didn't know the right terms to google for (I'm also currently working on those benchmarks). Alternatively maybe someone who's delved into JIT internals can share their experience or what they've commonly seen?

我知道这个问题: JIT跳转(x86_64).它与我的有所不同,因为它是在谈论将基本块串在一起(对于经常被称为内在函数的方法来说,接受的解决方案太多了).我也知道在x86机器代码中调用绝对指针,虽然它讨论了与我类似的主题,但是却有所不同,因为我不认为绝对跳跃是必要的(例如,方法1-2可以避免这种情况).

I am aware of this question: Jumps for a JIT (x86_64). It differs from mine because it is talking about stringing together basic blocks (and the accepted solution is way too many instructions for a frequently called intrinsic). I am also aware of Call an absolute pointer in x86 machine code, which while it discusses similar topics to mine, is different, because I am not assuming that absolute jumps are necessary (approaches 1-2 would avoid them, for example).

推荐答案

摘要:尝试在静态代码附近分配内存.但是对于rel32无法接通的呼叫,请退回call qword [rel pointer]或内联mov r64,imm64/call r64.

Summary: try to allocate memory near your static code. But for calls that can't reach with rel32, fall back to call qword [rel pointer] or inline mov r64,imm64 / call r64.

如果您不能使2.工作正常,则您的机制5.可能是性能最好的,但是4.很简单,应该没问题.直接call rel32也需要一些分支预测,但是绝对更好.

Your mechanism 5. is probably best for performance if you can't make 2. work, but 4. is easy and should be fine. Direct call rel32 needs some branch prediction, too, but it's definitely still better.

术语:固有功能"指应该可能是帮助者"功能. 固有"通常是指内置的语言(例如Fortran含义)或不是真正的功能,只是内联于机器指令的某种东西". (C/C ++/ Rust 的含义,例如对于SIMD或其他内容例如_mm_popcnt_u32()_pdep_u32()_mm_mfence()).您的Rust函数将编译为要用call指令调用的机器代码中存在的真实函数.

Terminology: "intrinsic functions" should probably be "helper" functions. "Intrinsic" usually means either language built-in (e.g. Fortran meaning) or "not a real function, just something that inlines to a machine instruction" (C/C++ / Rust meaning, like for SIMD, or stuff like _mm_popcnt_u32(), _pdep_u32(), or _mm_mfence()). Your Rust functions are going to compile to real functions that exist in machine code that you're going to call with call instructions.

是的,在目标函数的+ -2GiB内分配JIT缓冲区显然是理想的选择,允许rel32直接调用.

Yes, allocating your JIT buffers within +-2GiB of your target functions is obviously ideal, allowing rel32 direct calls.

最直接的方法是在BSS中使用大型静态数组(链接器将其放置在您代码的2GiB之内),然后从中分配您的分配. (使用mprotect(POSIX)或VirtualProtect(Windows)使其可执行).

The most straightforward would be to use a large static array in the BSS (which the linker will place within 2GiB of your code) and carve your allocations out of that. (Use mprotect (POSIX) or VirtualProtect (Windows) to make it executable).

大多数操作系统(包括Linux)为BSS进行延迟​​分配(COW映射到零页面,仅在写入时分配物理页面框架以支持该分配,就像不带MAP_POPULATE的mmap一样),因此它仅浪费虚拟空间地址空间,以便在BSS中仅使用底部10kB的512MiB阵列.

Most OSes (Linux included) do lazy allocation for the BSS (COW mapping to the zero page, only allocating physical page frames to back that allocation when it's written, just like mmap without MAP_POPULATE), so it only wastes virtual address space to have a 512MiB array in the BSS that you only use the bottom 10kB of.

不要将其设置为大于或接近2GiB,因为这会将BSS中的其他内容推到太远.默认的小"字样为小"字样.代码模型(如x86-64 System V ABI中所述)将所有静态地址都放置在彼此的2GiB之内,以进行相对于RIP的数据寻址和rel32 call/jmp.

Don't make it larger than or close to 2GiB, though, because that will push other things in the BSS too far away. The default "small" code model (as described in the x86-64 System V ABI) puts all static addresses within 2GiB of each other for RIP-relative data addressing and rel32 call/jmp.

缺点:您必须自己编写至少一个简单的内存分配器,而不是使用mmap/munmap处理整个页面.但这很容易,如果您不需要释放任何东西.也许只是从一个地址开始生成代码,一旦到达末尾就更新一个指针,并发现代码块有多长时间. (但这不是多线程...)为了安全起见,请记住检查何时到达此缓冲区的末尾并中止,或者退回到mmap.

Downside: you'd have to write at least a simple memory allocator yourself, instead of working with whole pages with mmap/munmap. But that's easy if you don't need to free anything. Maybe just generate code starting at an address, and update a pointer once you get to the end and discover how long your code block is. (But that's not multi-threaded...) For safety, remember to check when you get to the end of this buffer and abort, or fall back to mmap.

如果您的绝对目标地址在虚拟地址空间的低2GiB中,请在Linux上使用mmap(MAP_32BIT) . (例如,如果您的Rust代码被编译为适用于x86-64 Linux的非PIE可执行文件.但是,对于PIE可执行文件则不是这种情况(

If your absolute target addresses are in the low 2GiB of virtual address space, use mmap(MAP_32BIT) on Linux. (e.g. if your Rust code is compiled into a non-PIE executable for x86-64 Linux. But that won't be the case for PIE executables (common these days), or for targets in shared libraries. You can detect this at run-time by checking the address of one of your helper functions.)

通常(如果MAP_32BIT没有帮助/可用),最好的选择可能是 mmap 没有 MAP_FIXED,但是您认为免费的非NULL提示地址.

In general (if MAP_32BIT isn't helpful/available), your best bet is probably mmap without MAP_FIXED, but with a non-NULL hint address that you think is free.

Linux 4.17引入了 MAP_FIXED_NOREPLACE 您可以轻松地搜索附近的未使用区域(例如,逐步增加64MB,如果得到EEXIST,请重试,然后记住该地址以避免下次搜索).否则,您可以在启动时解析/proc/self/maps一次,以在包含您的一个辅助函数的地址的映射附近找到一些未映射的空间.会紧在一起.

Linux 4.17 introduced MAP_FIXED_NOREPLACE which would let you easily search for a nearby unused region (e.g. step by 64MB and retry if you get EEXIST, then remember that address to avoid searching next time). Otherwise you could parse /proc/self/maps once at startup to find some unmapped space near the mapping that contains the address of one of your helper functions. The will be close together.

请注意,无法识别MAP_FIXED_NOREPLACE标志的较旧内核通常会(在检测到与现有映射的冲突时)回落到"non-MAP_FIXED"状态.行为类型:他们将返回与请求的地址不同的地址.

Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a "non-MAP_FIXED" type of behavior: they will return an address that is different from the requested address.

在下一个更高或更低的空闲页面中,非常适合具有非稀疏的内存映射,因此页面表不需要太多不同的顶级页面目录. (硬件页表是一棵基数树.)一旦找到有效的位置,就可以与该位置连续进行将来的分配.如果最终在那儿使用了很多空间,则内核可以机会使用2MB的大页面,并且再次使页面连续,这意味着它们在HW页面表中共享相同的父页面目录,因此iTLB错过了触发页面遍历的可能是 (如果这些较高级别在数据缓存中保持高温,甚至在Pagewalk硬件本身中缓存),则价格会稍微便宜一些.并且为了使内核高效地跟踪为一个较大的映射.当然,如果有空间的话,使用更多的已分配页面甚至更好.在页面级别上更好的代码密度有助于指令TLB,并且可能也在DRAM页面内(但这不一定与虚拟内存页面大小相同).

In the next higher or lower free page(s) would be ideal for having a non-sparse memory map so the page table doesn't need too many different top-level page directories. (HW page tables are a radix tree.) And once you find a spot that works, make future allocations contiguous with that. If you end up using a lot of space there, the kernel can opportunistically use a 2MB hugepage, and having your pages contiguous again means they share the same parent page directory in the HW page tables so iTLB misses triggering page walks may be slightly cheaper (if those higher levels stay hot in data caches, or even cached inside the pagewalk hardware itself). And for efficient for the kernel to track as one larger mapping. Of course, using more of an already-allocated page is even better, if there's room. Better code density on a page level helps the instruction TLB, and possibly also within a DRAM page (but that's not necessarily the same size as a virtual memory page).

然后在为每个调用进行代码生成时,检查目标是否在call rel32 范围之内off == (off as i32) as i64
否则回落到10字节mov r64,imm64/call r64. (rustcc会将其编译为movsxd/cmp,因此每次检查对于JIT编译时间来说都是微不足道的.)

Then as you do code-gen for each call, just check whether the target is in range for a call rel32 with off == (off as i32) as i64
else fall back to 10-byte mov r64,imm64 / call r64. (rustcc will compile that to movsxd/cmp, so checking every time only has trivial cost for JIT compile times.)

(或5字节mov r32,imm32(如果可能的话.不支持MAP_32BIT的操作系统)可能仍有目标地址.使用target == (target as u32) as u64进行检查.第3个mov即时编码7 -byte mov r/m64, sign_extended_imm32可能并不有趣,除非您正在为映射在虚拟地址空间高2GiB中的内核JIT调试内核代码.)

(Or 5-byte mov r32,imm32 if possible. OSes that don't support MAP_32BIT might still have the target addresses down there. Check for that with target == (target as u32) as u64. The 3rd mov-immediate encoding, 7-byte mov r/m64, sign_extended_imm32 is probably not interesting unless you're JITing kernel code for a kernel mapped in the high 2GiB of virtual address space.)

检查并尽可能使用直接调用的好处在于,它可以将代码源与任何有关分配附近页面或地址来自何处的知识分离开来,并且仅凭机会就可以编写出良好的代码. (您可能会记录一个计数器或登录一次,因此您/您的用户至少会注意到附近的分配机制是否失败,因为性能差异通常无法轻松衡量.)

The beauty of checking and using a direct call whenever possible is that it decouples code-gen from any knowledge about allocating nearby pages or where the addresses come from, and just opportunistically makes good code. (You might record a counter or log once so you / your users at least notice if your nearby allocation mechanism is failing, because the perf diff won't typically be easily measurable.)

mov r64,imm64是一个10字节的指令,对于读取/解码来说有点大,并且可以存储到uop缓存中.根据Agner Fog的microarch pdf( https://agner,这可能需要一个额外的周期才能从SnB系列的uop缓存中读取数据.org/optimize ).但是现代CPU具有相当不错的代码提取带宽和强大的前端.

mov r64,imm64 is a 10-byte instruction that's a bit large to fetch/decode, and for the uop-cache to store. And may take an extra cycle to read from the uop cache on SnB-family according to Agner Fog's microarch pdf (https://agner.org/optimize). But modern CPUs have pretty good bandwidth for code-fetch, and robust front-ends.

如果分析发现前端瓶颈是您的代码中的大问题,或者较大的代码大小导致从L1 I缓存中逐出其他有价值的代码,我将选择选项5.

If profiling finds that front-end bottlenecks are a big problem in your code, or large code size is causing eviction of other valuable code from L1 I-cache, I'd go with option 5.

顺便说一句,如果您的任何函数都是可变参数,则x86-64 System V要求您传递AL = XMM args数量,可以将r11用作函数指针.它被称为呼叫,不用于arg传递.但是RAX(或其他旧式"寄存器)会将REX前缀保存在call上.

BTW, if any of your functions are variadic, x86-64 System V requires that you pass AL=number of XMM args, you could use r11 for the function pointer. It's call-clobbered and not used for arg-passing. But RAX (or other "legacy" register) will save a REX prefix on the call.

  1. mmap将分配的位置附近分配Rust函数
  1. Allocate Rust functions near where mmap will allocate

不,我认为没有任何机制可以使您的静态编译函数靠近mmap可能放置新页面的位置.

No, I don't think there's any mechanism to get your statically compiled functions near where mmap might happen to put new pages.

mmap具有超过4GB的可用虚拟地址空间可供选择.您事先都不知道它将在哪里分配. (尽管我认为Linux至少确实保留了一定数量的局部性,以优化硬件页表.)

mmap has more than 4GB of free virtual address space to pick from. You don't know ahead of time where it's going to allocate. (Although I think Linux at least does keep some amount of locality, to optimize the HW page tables.)

从理论上讲,您可以复制 Rust函数的机器代码,但是它们可能引用具有RIP相对寻址模式的 other 静态代码/数据.

You in theory could copy the machine code of your Rust functions, but they probably reference other static code/data with RIP-relative addressing modes.

  1. call rel32到使用mov/jmp reg
  2. 的存根
  1. call rel32 to stubs that use mov/jmp reg

这似乎会对性能产生不利影响(可能会干扰RAS/跳转地址预测).

This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction).

性能下降的一个缺点是,前端只有2条呼叫/跳转指令才能通过,从而可以为后端提供有用的指令.不好5.更好.

The perf downside is only from having 2 total call/jump instructions for the front-end to get past before it can feed the back-end with useful instructions. It's not great; 5. is much better.

基本上,这就是PLT在Unix/Linux上对共享库函数的调用的工作方式,并且将执行相同的操作.通过PLT(过程链接表)存根函数进行调用几乎与此完全相同.因此,已经对性能影响进行了深入研究,并将其与其他处理方式进行了比较.我们知道动态库调用不是性能灾难.

This is basically how the PLT works for calls to shared-library functions on Unix/Linux, and will perform the same. Calling through a PLT (Procedure Linking Table) stub function is almost exactly like this. So the performance impacts have been well-studied and compared with other ways of doing things. We know that dynamic library calls aren't a performance disaster.

星号之前地址和推送指令,将其推送到何处?显示AT& T的反汇编,或者如果您好奇的话,单步执行C语言程序,例如main(){puts("hello"); puts("world");}. (在第一次调用时,它推送一个arg并跳转到一个懒惰的动态链接器函数;在随后的调用中,间接跳转目标是该函数在共享库中的地址.)

Asterisk before an address and push instructions, where is it being pushed to? shows AT&T disassembly of one, or single-step a C program like main(){puts("hello"); puts("world");} if you're curious. (On the first call, it pushes an arg and jumps to a lazy dynamic linker function; on subsequent calls the indirect jump target is the address of the function in the shared library.)

Why does the PLT exist in addition to the GOT, instead of just using the GOT? explains more. The jmp whose address is updated by lazy linking is jmp qword [xxx@GOTPLT]. (And yes, the PLT really does use a memory-indirect jmp here, even on i386 where a jmp rel32 that gets rewritten would work. IDK if GNU/Linux ever historically used to rewrite the offset in a jmp rel32.)

jmp只是一个标准的尾调用,并且不会失衡返回地址预测变量堆栈.目标函数中最终的ret将返回原始call之后的指令,即返回call推送到调用堆栈和微体系结构RAS上的地址.仅当您使用推/回(例如"retpoline"来减轻Spectre)时,您才会失衡RAS.

The jmp is just a standard tailcall, and does not unbalance the Return-Address predictor Stack. The eventual ret in the target function will return to the instruction after the original call, i.e. to the address that call pushed onto the call stack and onto the microarchitectural RAS. Only if you used a push / ret (like a "retpoline" for Spectre mitigation) would you unbalance the RAS.

但是 JIT的跳转(x86_64)中的代码太糟糕了(请参阅我在下面的评论).它会 打破RAS,以获取将来的回报.您可能会认为只有在调用/返回(平衡返回/返回)的情况下,才可以通过调用中断此操作(以调整返回地址),但实际上call +0是一种特殊情况,不会在RAS上进行在大多数CPU中: http://blog.stuffedcow.net/2018/04/ras-微基准测试. (我猜想调用nop可能会发生变化,但是整个过程与call rax相比完全是疯狂的,除非它试图防御Spectre漏洞.)通常在x86-64上,您使用相对RIP的LEA来获取附近的地址,而不是call/pop.

But the code in Jumps for a JIT (x86_64) that you linked is unfortunately terrible (see my comment under it). It will break the RAS for future returns. You'd think it would only break it for this one with the call (to get a return address to be adjusted) should balance out the push/ret, but actually call +0 is a special case that doesn't go on the RAS in most CPUs: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks. (calling over a nop could change that I guess, but the whole thing is totally insane vs. call rax unless it's trying to defend against Spectre exploits.) Normally on x86-64, you use a RIP-relative LEA to get a nearby address into a register, not call/pop.

  1. 内联mov r64, imm64/call reg

这可能比3好;较大代码长度的前端成本可能低于通过使用jmp的存根进行调用的成本.

This is probably better than 3; The front-end cost of larger code-size is probably lower than the cost of calling through a stub that uses jmp.

但这也可能足够好,尤其是如果您的alloc-within-2GiB方法在大多数时候对您关心的大多数目标都运行良好的话.

But this is also probably good enough, especially if your alloc-within-2GiB methods work well enough most of the time on most of the targets you care about.

在某些情况下,它的速度可能慢于5.假设分支预测很好,则分支预测会从内存中隐藏获取和检查函数指针的等待时间. (而且通常会,否则它将很少运行,以至于与性能无关.)

There may be cases where it's slower than 5. though. Branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well. (And usually it will, or else it runs so infrequently that it's not performance relevant.)

  1. call qword [rel nearby_func_ptr]

这是gcc -fno-plt在Linux(call [rip + symbol@GOTPCREL])上编译对共享库函数的调用的方式,以及通常如何完成Windows DLL函数的调用.(这就像其中的建议之一) http://www.macieira .org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

This is how gcc -fno-plt compiles calls to shared-library functions on Linux (call [rip + symbol@GOTPCREL]), and how Windows DLL function calls are normally done. (This is like one of the suggestions in http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative]是6个字节,仅比call rel32大1个字节,因此与调用存根相比,它对代码大小的影响可忽略不计.有趣的事实:您有时会在机器代码中看到addr32 call rel32(地址大小前缀除填充外没有任何作用).如果链接期间在另一个.o中找到了具有非隐藏ELF可见性的符号,则该链接器将call [RIP + symbol@GOTPCREL]放到call rel32,毕竟不是另一个共享对象.

call [RIP-relative] is 6 bytes, only 1 byte larger than call rel32, so it has a negligible impact on code-size vs. calling a stub. Fun fact: you will sometimes see addr32 call rel32 in machine code (the address size prefix has no effect except for padding). This comes from a linker relaxing a call [RIP + symbol@GOTPCREL] to a call rel32 if the symbol with non-hidden ELF visibility was found in another .o during linking, not a different shared object after all.

对于共享库调用,这通常比PLT存根更好,唯一的缺点是程序启动较慢,因为它需要早期绑定(非延迟动态链接).这对您来说不是问题;目标地址在代码生成时间之前就知道了.

For shared library calls, this is usually better than PLT stubs, with the only downside being slower program startup because it requires early binding (non-lazy dynamic linking). This isn't an issue for you; the target address is known ahead of code-gen time.

补丁作者测试了其性能在某些未知的x86-64硬件上的传统PLT.对于共享库调用,Clang可能是最坏的情况,因为它对不需要很多时间的小型LLVM函数进行许多调用,而且运行时间长,因此早期绑定的启动开销可以忽略不计.使用gccgcc -fno-plt编译clang后,clang -O2 -g编译tramp3d的时间从41.6s(PLT)变为36.8s(-fno-plt). clang --help会稍微变慢.

The patch author tested its performance vs. a traditional PLT on some unknown x86-64 hardware. Clang is maybe a worst-case scenario for shared library calls, because it makes many calls to small LLVM functions that don't take much time, and it's long running so early-binding startup overhead is negligible. After using gcc and gcc -fno-plt to compile clang, the time for clang -O2 -g to compile tramp3d goes from 41.6s (PLT) to 36.8s (-fno-plt). clang --help becomes slightly slower.

(x86-64 PLT存根使用的是jmp qword [symbol@GOTPLT],而不是mov r64,imm64/jmp.间接存储的jmp在现代Intel CPU上仅是一个uop,因此根据正确的预测它便宜一些,但是可能会因为预测错误而变慢,特别是如果GOTPLT条目在高速缓存中丢失的话.如果使用频繁,它通常会正确预测,但是无论如何,10字节的movabs和2字节的jmp可以作为块(如果它适合16字节对齐的获取块),并在单个周期内解码,因此3.并非完全不合理.但这会更好.)

(x86-64 PLT stubs use jmp qword [symbol@GOTPLT], not mov r64,imm64/jmp though. A memory-indirect jmp is only a single uop on modern Intel CPUs, so it's cheaper on a correct prediction, but maybe slower on an incorrect prediction, especially if the GOTPLT entry misses in cache. If it's used frequently, it will typically predict correctly, though. But anyway a 10-byte movabs and a 2-byte jmp can fetch as a block (if it fits in a 16-byte aligned fetch block), and decode in a single cycle, so 3. is not totally unreasonable. But this is better.)

为指针分配空间时,请记住,它们是作为数据提取到L1d缓存中的,并且是dTLB条目而不是iTLB. 不要让它们与代码交织,这会浪费I-cache中的数据,并浪费D-cache中包含一个指针和大部分代码的行上的空间.将代码中的指针分组到一个单独的64字节代码块中,因此该行不必同时位于L1I和L1D中.如果它们与某些代码位于相同的 page 中,则很好;它们是只读的,因此不会引起自修改代码管道的破坏.

When allocating space for your pointers, remember that they're fetched as data, into L1d cache, and with a dTLB entry not iTLB. Don't interleave them with code, that would waste space in the I-cache on this data, and waste space in D-cache on lines that contain one pointer and mostly code. Group your pointers together in a separate 64-byte chunk from code so the line doesn't need to be in both L1I and L1D. It's fine if they're in the same page as some code; they're read-only so won't cause self-modifying-code pipeline nukes.

这篇关于处理(可能)远离JITed代码的提前编译的函数的调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆