如何编写可在现代x64处理器上有效运行的自修改代码? [英] How can I write self-modifying code that runs efficiently on modern x64 processors?

查看:134
本文介绍了如何编写可在现代x64处理器上有效运行的自修改代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加快可变位宽整数压缩方案,并且对即时生成和执行汇编代码感兴趣。当前,在错误预测的间接分支上花费了大量时间,并且基于发现的一系列位宽生成代码似乎是避免这种代价的唯一方法。

I'm trying to speed up a variable-bitwidth integer compression scheme and I'm interested in generating and executing assembly code on-the-fly. Currently a lot of time is spent on mispredicted indirect branches, and generating code based on the series of bitwidths as found seems to be the only way avoid this penalty.

一般技术称为子例程线程(或调用线程,尽管也有其他定义)。目的是利用处理器有效的呼叫/重发预测,以避免停顿。该方法在此处进行了详细说明:
http:// webdocs。 cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf

The general technique is referred to as "subroutine threading" (or "call threading", although this has other definitions as well). The goal is to take advantage of the processors efficient call/ret prediction so as to avoid stalls. The approach is well described here: http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf

生成的代码将只是随后的一系列调用通过回报。如果有5个块的宽度[4,8,8,4,16],则看起来像是:

The generated code will be simply a series of calls followed by a return. If there were 5 'chunks' of widths [4,8,8,4,16], it would look like:

call $decode_4
call $decode_8
call $decode_8
call $decode_4
call $decode_16
ret

在实际使用中,它将是一个较长的调用序列,具有足够的长度,每个序列可能是唯一的,并且只能调用一次。无论是在此处还是在其他地方,生成和调用代码的方式都有据可查。但是,除了简单的不做或周密考虑的有龙之外,我还没有发现很多关于效率的讨论。甚至英特尔文档大多都笼统地说:

In actual use, it will be a longer series of calls, with a sufficient length that each series will likely be unique and only called once. Generating and calling the code is well documented, both here and elsewhere. But I haven't found much discussion of efficiency beyond a simple "don't do it" or a well-considered "there be dragons". Even the Intel documentation speaks mostly in generalities:


8.1.3处理自修改和交叉修改的代码

8.1.3 Handling Self- and Cross-Modifying Code

处理器将数据写入当前执行的代码$中的动作旨在将数据作为代码执行的b $ b段称为
自修改代码。 IA-32处理器在执行自我修改的代码时表现出特定于模型的行为
,具体取决于代码已被修改的当前执行指针
多远。 ...
自修改代码的执行性能将低于
非自修改代码或普通代码。
的性能降低程度取决于代码的修改频率和
的特定特征。

The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing self modified code, depending upon how far ahead of the current execution pointer the code has been modified. ... Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.

11.6自修改代码

对当前在处理器中缓存的
的代码段中的内存位置进行写操作会使与
相关的一条或多条缓存行无效。该检查基于
指令的物理地址。另外,P6系列和奔腾处理器检查
是否对代码段进行写操作可以修改已预取了
以便执行的指令。如果该写入影响到预取的
指令,则预取队列无效。后面的检查是基于指令的线性地址的
。对于奔腾4和
的Intel Xeon处理器,如果代码
段中的指令已被解码且驻留在跟踪高速缓存中的
,则在该代码
段中写入或监听一条指令会使指令无效。整个跟踪缓存。后一种
行为意味着自修改代码的程序在Pentium 4和Intel Xeon
处理器上运行时可能会导致
严重的性能下降。

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

虽然有一个性能计数器来确定是否正在发生不良情况(C3 04 MACHINE_CLEARS.SMC:检测到的自修改代码机器清除的数量 )我想了解更多详细信息,尤其是对于Haswell。我的印象是,只要我可以提前足够长的时间来编写生成的代码,以至于指令的预取尚未到达那里,并且只要我不通过修改同一页上的代码来触发SMC检测器(四分之一-页?)作为当前正在执行的操作,那么我应该会获得良好的效果。但是所有细节似乎都非常模糊:太近了太近了?多远就够了?

While there is a performance counter to determine whether bad things are happening (C3 04 MACHINE_CLEARS.SMC: Number of self-modifying-code machine clears detected) I'd like to know more specifics, particularly for Haswell. My impression is that as long as I can write the generated code far enough ahead of time that the instruction prefetch has not gotten there yet, and as long as I don't trigger the SMC detector by modifying code on the same page (quarter-page?) as anything currently being executed, then I should get good performance. But all the details seem extremely vague: how close is too close? How far is far enough?

试图将它们变成特定的问题:

Trying to make these into specific questions:


  1. 什么是Haswell预取器运行的当前指令之前
    的最大距离?

  1. What is the maximum distance ahead of the current instruction that the Haswell prefetcher ever runs?


Haswell跟踪缓存可能包含的当前指令之后的最大距离是多少?

What is the maximum distance behind the current instruction that the Haswell 'trace cache' might contain?

在Haswell上的MACHINE_CLEARS.SMC事件
的周期实际惩罚是多少?

What is the actual penalty in cycles for a MACHINE_CLEARS.SMC event on Haswell?

我该如何处理?在
阻止预取器吃掉自己的尾巴的同时,在预测的循环中运行生成/执行周期?

How can I run the generate/execute cycle in a predicted loop while preventing the prefetcher from eating its own tail?

我该如何安排流程,以便每个生成的代码都是
总是第一次出现,而不是踩到已经缓存的指令
吗?

How can I arrange the flow so that each piece of generated code is always "seen for the first time" and not stepping on instructions already cached?


推荐答案

这根本不必是自修改代码-它可以是动态创建的代码

This doesn't have to be self-modifying code at all - it can be dynamically created code instead, i.e. runtime-generated "trampolines".

这意味着您保留一个(全局)函数指针,该指针将重定向到可写/可执行的内存映射部分-然后您可以在其中主动插入要进行的函数调用。

Meaning you keep a (global) function pointer around that'll redirect to a writable/executable mapped section of memory - in which you then actively insert the function calls you wish to make.

此操作的主要困难是呼叫是IP相对的(大多数 jmp 都是IP),因此您将需要计算蹦床的内存位置和目标功能之间的偏移量。这样就很简单-但是将其与64位代码结合起来,就会遇到 call 只能处理+ -2GB范围内的位移的相对位移,它变得更加复杂-您需要通过链接表进行调用。

The main difficulty with this is that call is IP-relative (as are most jmp), so that you'll have to calculate the offset between the memory location of your trampoline and the "target funcs". That as such is simple enough - but combine that with 64bit code, and you run into the relative displacement that call can only deal with displacements in the range of +-2GB, it becomes more complex - you'd need to call through a linkage table.

因此,您基本上会创建类似(/ me的UN * X有偏差,因此AT& T汇编,以及对ELF原理的一些引用):

So you'd essentially create code like (/me severely UN*X biased, hence AT&T assembly, and some references to ELF-isms):

.Lstart_of_modifyable_section:
callq 0f
callq 1f
callq 2f
callq 3f
callq 4f
....
ret
.align 32
0:        jmpq tgt0
.align 32
1:        jmpq tgt1
.align 32
2:        jmpq tgt2
.align 32
3:        jmpq tgt3
.align 32
4:        jmpq tgt4
.align 32
...

这可以在编译时创建(只需创建一个可写的文本部分),也可以在运行时动态创建。

This can be created at compile time (just make a writable text section), or dynamically at runtime.

然后,在运行时修补 jump目标。这类似于 .plt ELF节(PLT =过程链接表)的工作原理-就是在那里,动态链接程序修补了jmp插槽,而在您的情况下,

You then, at runtime, patch the jump targets. That's similar to how the .plt ELF Section (PLT = procedure linkage table) works - just that there, it's the dynamic linker which patches the jmp slots, while in your case, you do that yourself.

如果您要进行所有运行时,那么甚至可以通过C / C ++轻松创建上述表格。从以下数据结构开始:

If you go for all runtime, then table like the above is easily creatable through C/C++ even; start with a data structures like:

typedef struct call_tbl_entry __attribute__(("packed")) {
    uint8_t call_opcode;
    int32_t call_displacement;
};
typedef union jmp_tbl_entry_t {
    uint8_t cacheline[32];
    struct {
        uint8_t jmp_opcode[2];    // 64bit absolute jump
        uint64_t jmp_tgtaddress;
    } tbl __attribute__(("packed"));
}

struct mytbl {
    struct call_tbl_entry calltbl[NUM_CALL_SLOTS];
    uint8_t ret_opcode;
    union jmp_tbl_entry jmptbl[NUM_CALL_SLOTS];
}

这里唯一关键且有些依赖系统的是打包性质其中之一是需要告诉编译器(即不要填充 call 数组),并且应该缓存行对齐跳转表。

The only critical and somewhat system-dependent thing here is the "packed" nature of this that one needs to tell the compiler about (i.e. not to pad the call array out), and that one should cacheline-align the jump table.

您需要制作 calltbl [i] .call_displacement =(int32_t)(& jmptbl [i]-&tbl [i + 1]),使用 memset(& jmptbl,0xC3 / * RET * /,sizeof(jmptbl))初始化空/未使用的跳转表,然后填充字段并根据需要添加跳转操作码和目标地址。

You need to make calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1]), initialize the empty/unused jump table with memset(&jmptbl, 0xC3 /* RET */, sizeof(jmptbl)) and then just fill the fields with the jump opcode and target address as you need.

这篇关于如何编写可在现代x64处理器上有效运行的自修改代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆