如何编写在现代 x64 处理器上高效运行的自修改代码? [英] How can I write self-modifying code that runs efficiently on modern x64 processors?

查看:39
本文介绍了如何编写在现代 x64 处理器上高效运行的自修改代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加速可变位宽整数压缩方案,并且我对即时生成和执行汇编代码很感兴趣.目前很多时间都花在了错误预测的间接分支上,根据所发现的一系列位宽生成代码似乎是避免这种惩罚的唯一方法.

I'm trying to speed up a variable-bitwidth integer compression scheme and I'm interested in generating and executing assembly code on-the-fly. Currently a lot of time is spent on mispredicted indirect branches, and generating code based on the series of bitwidths as found seems to be the only way avoid this penalty.

一般技术称为子程序线程";(或调用线程",尽管这也有其他定义).目标是利用处理器高效的调用/返回预测以避免停顿.该方法在这里得到了很好的描述:http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf

The general technique is referred to as "subroutine threading" (or "call threading", although this has other definitions as well). The goal is to take advantage of the processors efficient call/ret prediction so as to avoid stalls. The approach is well described here: http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf

生成的代码只是一系列调用,然后是返回.如果有 5 个块"的宽度 [4,8,8,4,16],它看起来像:

The generated code will be simply a series of calls followed by a return. If there were 5 'chunks' of widths [4,8,8,4,16], it would look like:

call $decode_4
call $decode_8
call $decode_8
call $decode_4
call $decode_16
ret

在实际使用中,这将是一个较长的调用系列,具有足够的长度,每个系列很可能是唯一的,并且只被调用一次.生成和调用代码在此处和其他地方都有详细记录.但除了简单的不要这样做"之外,我还没有找到太多关于效率的讨论.或深思熟虑的有龙".甚至 英特尔文档 也大多是概括性的:

In actual use, it will be a longer series of calls, with a sufficient length that each series will likely be unique and only called once. Generating and calling the code is well documented, both here and elsewhere. But I haven't found much discussion of efficiency beyond a simple "don't do it" or a well-considered "there be dragons". Even the Intel documentation speaks mostly in generalities:

8.1.3 处理自修改和交叉修改代码

8.1.3 Handling Self- and Cross-Modifying Code

处理器将数据写入当前正在执行的代码的行为旨在将数据作为代码执行的段被调用自修改代码.IA-32 处理器表现出特定于模型的行为执行自我修改的代码时,取决于提前多远当前执行指针代码已被修改....自修改代码的执行性能水平低于非自修改或普通代码.表现程度恶化将取决于修改的频率和代码的具体特征.

The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing self modified code, depending upon how far ahead of the current execution pointer the code has been modified. ... Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.

11.6 自我修改代码

11.6 SELF-MODIFYING CODE

写入当前代码段中的内存位置缓存在处理器中会导致相关的缓存线(或多条线)无效.这个检查是基于物理地址的操作说明.此外,P6 系列和奔腾处理器检查对代码段的写入是否可以修改具有被预取执行.如果写入影响预取指令,预取队列失效.后者检查是基于指令的线性地址.对于奔腾 4 和英特尔至强处理器,代码中的一条指令的写入或窥探段,其中目标指令已被解码并驻留在跟踪缓存中,使整个跟踪缓存无效.后者行为意味着自修改代码的程序会导致严重的在 Pentium 4 和 Intel Xeon 上运行时性能下降处理器.

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

虽然有一个性能计数器来确定是否发生了坏事(C3 04 MACHINE_CLEARS.SMC:检测到自修改代码机器清除次数)我想知道更多细节,特别是对哈斯韦尔.我的印象是,只要我能提前足够远地编写生成的代码,指令预取还没有到达那里,只要我不通过修改同一页面上的代码来触发 SMC 检测器(四分之一-页面?)作为当前正在执行的任何内容,那么我应该获得良好的性能.但所有的细节似乎都非常模糊:多近才算太近?多远才够远?

While there is a performance counter to determine whether bad things are happening (C3 04 MACHINE_CLEARS.SMC: Number of self-modifying-code machine clears detected) I'd like to know more specifics, particularly for Haswell. My impression is that as long as I can write the generated code far enough ahead of time that the instruction prefetch has not gotten there yet, and as long as I don't trigger the SMC detector by modifying code on the same page (quarter-page?) as anything currently being executed, then I should get good performance. But all the details seem extremely vague: how close is too close? How far is far enough?

尝试将这些变成具体的问题:

Trying to make these into specific questions:

  1. 当前指令的最大距离是多少Haswell 预取器曾经运行过吗?

  1. What is the maximum distance ahead of the current instruction that the Haswell prefetcher ever runs?

当前指令后面的最大距离是多少Haswell 的跟踪缓存"可能包含?

What is the maximum distance behind the current instruction that the Haswell 'trace cache' might contain?

MACHINE_CLEARS.SMC 事件的实际周期惩罚是多少在 Haswell 上?

What is the actual penalty in cycles for a MACHINE_CLEARS.SMC event on Haswell?

如何在预测循环中运行生成/执行循环,而防止预取器吃自己的尾巴?

How can I run the generate/execute cycle in a predicted loop while preventing the prefetcher from eating its own tail?

如何安排流程,让每一段生成的代码都总是第一次见到"不按指示行事已经缓存了吗?

How can I arrange the flow so that each piece of generated code is always "seen for the first time" and not stepping on instructions already cached?

推荐答案

这根本不必是自修改代码 - 它可以是动态创建的代码 相反,即运行时生成的蹦床".

This doesn't have to be self-modifying code at all - it can be dynamically created code instead, i.e. runtime-generated "trampolines".

这意味着您保留一个(全局)函数指针,它将重定向到内存的可写/可执行映射部分 - 然后您主动插入您希望进行的函数调用.

Meaning you keep a (global) function pointer around that'll redirect to a writable/executable mapped section of memory - in which you then actively insert the function calls you wish to make.

这样做的主要困难在于 call 是 IP 相关的(大多数 jmp 也是如此),因此您必须计算内存位置之间的偏移量你的蹦床和目标功能".这样就足够简单了 - 但是将它与 64 位代码结合起来,你会遇到 call 只能处理 +-2GB 范围内的位移的相对位移,它变得更加复杂 - 你'd 需要通过链接表调用.

The main difficulty with this is that call is IP-relative (as are most jmp), so that you'll have to calculate the offset between the memory location of your trampoline and the "target funcs". That as such is simple enough - but combine that with 64bit code, and you run into the relative displacement that call can only deal with displacements in the range of +-2GB, it becomes more complex - you'd need to call through a linkage table.

所以你基本上会创建这样的代码(/me 严重偏向 UN*X,因此 AT&T 汇编,以及一些对 ELF 主义的引用):

So you'd essentially create code like (/me severely UN*X biased, hence AT&T assembly, and some references to ELF-isms):

.Lstart_of_modifyable_section:
callq 0f
callq 1f
callq 2f
callq 3f
callq 4f
....
ret
.align 32
0:        jmpq tgt0
.align 32
1:        jmpq tgt1
.align 32
2:        jmpq tgt2
.align 32
3:        jmpq tgt3
.align 32
4:        jmpq tgt4
.align 32
...

这可以在编译时创建(只需创建一个可写的文本部分),也可以在运行时动态创建.

This can be created at compile time (just make a writable text section), or dynamically at runtime.

然后,您在运行时修补跳转目标.这类似于 .plt ELF 部分(PLT = 过程链接表)的工作方式 - 只是在那里,它是修补 jmp 插槽的动态链接器,而在您的情况下,您自己做.

You then, at runtime, patch the jump targets. That's similar to how the .plt ELF Section (PLT = procedure linkage table) works - just that there, it's the dynamic linker which patches the jmp slots, while in your case, you do that yourself.

如果您使用所有运行时,那么甚至可以通过 C/C++ 轻松创建上述表;从如下数据结构开始:

If you go for all runtime, then table like the above is easily creatable through C/C++ even; start with a data structures like:

typedef struct call_tbl_entry __attribute__(("packed")) {
    uint8_t call_opcode;
    int32_t call_displacement;
};
typedef union jmp_tbl_entry_t {
    uint8_t cacheline[32];
    struct {
        uint8_t jmp_opcode[2];    // 64bit absolute jump
        uint64_t jmp_tgtaddress;
    } tbl __attribute__(("packed"));
}

struct mytbl {
    struct call_tbl_entry calltbl[NUM_CALL_SLOTS];
    uint8_t ret_opcode;
    union jmp_tbl_entry jmptbl[NUM_CALL_SLOTS];
}

这里唯一关键且有点依赖于系统的事情是需要告诉编译器的打包"性质(即不要填充 call 数组),并且应该缓存行对齐跳转表.

The only critical and somewhat system-dependent thing here is the "packed" nature of this that one needs to tell the compiler about (i.e. not to pad the call array out), and that one should cacheline-align the jump table.

你需要使calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1]),用memset(&jmptbl, 0xC3/* RET */, sizeof(jmptbl)) 然后根据需要用跳转操作码和目标地址填充字段.

You need to make calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1]), initialize the empty/unused jump table with memset(&jmptbl, 0xC3 /* RET */, sizeof(jmptbl)) and then just fill the fields with the jump opcode and target address as you need.

这篇关于如何编写在现代 x64 处理器上高效运行的自修改代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆