使用自修改代码观察 x86 上的陈旧指令提取 [英] Observing stale instruction fetching on x86 with self-modifying code

查看:24
本文介绍了使用自修改代码观察 x86 上的陈旧指令提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被告知并阅读了英特尔的手册,可以将指令写入内存,但指令预取队列已经获取过时的指令并将执行那些旧指令.我没有成功地观察到这种行为.我的方法如下.

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute those old instructions. I have been unsuccessful in observing this behavior. My methodology is as follows.

英特尔软件开发手册第 11.6 节指出

The Intel software development manual states from section 11.6 that

写入当前缓存在处理器中的代码段中的内存位置会导致相关的缓存线(或多条线)无效.该检查基于指令的物理地址.此外,P6 系列和奔腾处理器会检查对代码段的写入是否会修改已预取以供执行的指令.如果写入影响预取指令,则预取队列无效.后一种检查基于指令的线性地址.

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.

所以,看起来如果我希望执行过时的指令,我需要有两个不同的线性地址指向同一个物理页面.所以,我将一个文件内存映射到两个不同的地址.

So, it looks like if I hope to execute stale instructions, I need to have two different linear addresses refer to the same physical page. So, I memory map a file to two different addresses.

int fd = open("code_area", O_RDWR | O_CREAT, S_IRWXU | S_IRWXG | S_IRWXO);
assert(fd>=0);
write(fd, zeros, 0x1000);
uint8_t *a1 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
uint8_t *a2 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
assert(a1 != a2);

我有一个汇编函数,它接受一个参数,一个指向我要更改的指令的指针.

I have an assembly function that takes a single argument, a pointer to the instruction I want to change.

fun:
    push %rbp
    mov %rsp, %rbp

    xorq %rax, %rax # Return value 0

# A far jump simulated with a far return
# Push the current code segment %cs, then the address we want to far jump to

    xorq %rsi, %rsi
    mov %cs, %rsi
    pushq %rsi
    leaq copy(%rip), %r15
    pushq %r15
    lretq

copy:
# Overwrite the two nops below with `inc %eax'. We will notice the change if the
# return value is 1, not zero. The passed in pointer at %rdi points to the same physical
# memory location of fun_ins, but the linear addresses will be different.
    movw $0xc0ff, (%rdi)

fun_ins:
    nop   # Two NOPs gives enough space for the inc %eax (opcode FF C0)
    nop
    pop %rbp
    ret
fun_end:
    nop

在 C 中,我将代码复制到内存映射文件中.我从线性地址a1调用函数,但是我传递了一个指向a2的指针作为代码修改的目标.

In C, I copy the code to the memory mapped file. I invoke the function from linear address a1, but I pass a pointer to a2 as the target of the code modification.

#define DIFF(a, b) ((long)(b) - (long)(a))
long sz = DIFF(fun, fun_end);
memcpy(a1, fun, sz);
void *tochange = DIFF(fun, fun_ins);
int val = ((int (*)(void*))a1)(tochange);

如果 CPU 接收到修改后的代码,则 val==1.否则,如果执行了过时的指令(两个 nops),则 val==0.

If the CPU picked up the modified code, val==1. Otherwise, if the stale instructions were executed (two nops), val==0.

我已经在 1.7GHz 的 Intel Core i5(2011 macbook air)和 Intel(R) Xeon(R) CPU X3460 @ 2.80GHz 上运行了它.然而,每次我看到 val==1 表示 CPU 总是注意到新指令.

I've run this on a 1.7GHz Intel Core i5 (2011 macbook air) and an Intel(R) Xeon(R) CPU X3460 @ 2.80GHz. Every time, however, I see val==1 indicating the CPU always notices the new instruction.

有人经历过我想观察的行为吗?我的推理正确吗?我对手册中提到 P6 和 Pentium 处理器以及没有提到我的 Core i5 处理器感到有些困惑.也许还有其他原因导致 CPU 刷新其指令预取队列?任何见解都会非常有帮助!

Has anyone experience with the behavior I want to observe? Is my reasoning correct? I'm a little confused about the manual mentioning P6 and Pentium processors, and what the lack of mentioning my Core i5 processor. Perhaps something else is going on that causes the CPU to flush its instruction prefetch queue? Any insight would be very helpful!

推荐答案

我认为,你应该检查 MACHINE_CLEARS.SMC 性能计数器(MACHINE_CLEARS 的一部分 事件)的 CPU(可在 Sandy Bridge 1 中获得)a>,用于您的 Air powerbook;也可用于您的至强,即 Nehalem 2 - 搜索smc").您可以使用 oprofileperf 或 Intel 的 Vtune 来查找其值:

I think, you should check the MACHINE_CLEARS.SMC performance counter (part of MACHINE_CLEARS event) of the CPU (it is available in Sandy Bridge 1, which is used in your Air powerbook; and also available on your Xeon, which is Nehalem 2 - search "smc"). You can use oprofile, perf or Intel's Vtune to find its value:

http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm

机器清理

指标描述

某些事件需要在最后一条退出指令之后清除并重新启动整个管道.该指标衡量三个这样的事件:内存顺序违规、自修改代码以及对非法地址范围的某些加载.

Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.

可能的问题

执行时间的很大一部分用于处理机器清除.检查 MACHINE_CLEARS 事件以确定具体原因.

A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.

SMC:http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html

MACHINE_CLEARS 事件代码:0xC3SMC 掩码:0x04

MACHINE_CLEARS Event Code: 0xC3 SMC Mask: 0x04

检测到自修改代码 (SMC).

Self-modifying code (SMC) detected.

检测到的自修改代码机器清除次数.

Number of self-modifying-code machine clears detected.

英特尔还谈到了 smc http://software.intel.com/en-us/forums/topic/345561(链接自 英特尔性能瓶颈分析器分类法

Intel also says about smc http://software.intel.com/en-us/forums/topic/345561 (linked from Intel Performance Bottleneck Analyzer's taxonomy

当检测到自修改代码时会触发此事件.这通常可以由进行二进制编辑的人使用以强制其采用特定路径(例如黑客).此事件计算程序写入代码段的次数.在所有 Intel 64 和 IA-32 处理器中,自修改代码会导致严重的惩罚.修改后的缓存行被写回 L2 和 LLC 缓存.此外,需要重新加载指令,从而导致性能下降.

This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.

我想,你会看到一些这样的事件.如果是,则 CPU 能够检测到自我修改代码的行为并引发机器清除"——管道完全重启.第一阶段是 Fetch,他们会向 L2 缓存请求新的操作码.我对每次执行代码的 SMC 事件的确切计数非常感兴趣 - 这会给我们一些关于延迟的估计..(SMC 以某些单位计算,其中 1 个单位被假定为 1.5 个 cpu 周期 - B.6.2.intel优化手册6)

I think, you will see some such events. If they are, then CPU was able to detect act of self-modifying the code and raised the "Machine Clear" - full restart of pipeline. First stages are Fetch and they will ask L2 cache for new opcode. I'm very interested in the exact count of SMC events per execution of your code - this will give us some estimate about latencies.. (SMC is counted in some units where 1 unit is assumed to be 1.5 cpu cycles - B.6.2.6 of intel optimization manual)

我们可以看到英特尔说从最后一条退役指令之后重新启动",所以我认为最后一条退役指令将是mov;你的 nops 已经在筹备中.但是 SMC 将在 mov 退休时提高,它会杀死所有管道,包括 nops.

We can see that Intel says "restarted from just after the last retired instruction.", so I think last retired instruction will be mov; and your nops are already in the pipeline. But SMC will be raised at mov's retirement and it will kill everything in pipeline, including nops.

这个 SMC 引起的管道重启并不便宜,Agner 在 Optimizing_assembly.pdf - 17.10 自修改代码(所有处理器)"(我认为这里的任何 Core2/CoreiX 都像 PM):

This SMC induced pipeline restart is not cheap, Agner has some measurements in the Optimizing_assembly.pdf - "17.10 Self-modifying code (All processors)" (I think any Core2/CoreiX is like PM here):

修改后立即执行一段代码的代价是 P1 约 19 个时钟,PMMX 约 31 个时钟,PPro、P2、P3、PM 约 150-300 个时钟.P4 将在自修改代码后清除整个跟踪缓存.80486 和更早的处理器需要在修改代码和修改代码之间跳转,以便刷新代码缓存....

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache. ...

自修改代码不被视为良好的编程习惯.它应该只在以下情况下使用速度的提升是巨大的,修改后的代码被执行了很多次,以至于使用自修改代码的好处大于代价.

Self-modifying code is not considered good programming practice. It should be used only if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.

此处建议使用不同的线性地址来使 SMC 检测器失效:https://stackoverflow.com/a/10994728/196561 - 我会尝试找到实际的英特尔文档...现在实际上无法回答您的真正问题.

Usage of different linear addresses to fail SMC detector was recommended here: https://stackoverflow.com/a/10994728/196561 - I'll try to find actual intel documentation... Can't actually answer to your real question now.

这里可能有一些提示:优化手册,248966-026,2012 年 4 月3.6.9 混合代码和数据":

There may be some hints here: Optimization manual, 248966-026, April 2012 "3.6.9 Mixing Code and Data":

在代码段中放置可写数据可能无法区分来自自修改代码.代码段中的可写数据可能会受到与自修改代码相同的性能损失.

Placing writable data in the code segment might be impossible to distinguish from self-modifying code. Writable data in the code segment might suffer the same performance penalty as self-modifying code.

和下一节

软件应避免在相同的 1 KB 子页中写入代码页在相同的 2 KB 子页面中执行或获取代码书面.此外,共享包含直接或推测执行的页面使用另一个处理器作为数据页的代码会触发 SMC 条件,导致机器的整个管道和要清除的跟踪缓存.这是由于自修改代码条件.

Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.

因此,可能有一些控制可写和可执行子页面交叉的原理图.

So, there is possibly some schematics which controls intersections of writable and executable subpages.

您可以尝试从另一个线程进行修改(交叉修改代码)——但是需要非常小心的线程同步和管道刷新(您可能希望在写入线程中包含一些强制延迟;CPUID 只是在需要同步之后).但您应该知道他们已经使用核武器"解决了这个问题 - 检查 US6857064 专利.

You can try to do modification from the other thread (cross-modifying code) -- but the very careful thread synchronization and pipeline flushing is needed (you may want to include some brute-forcing of delays in writer thread; CPUID just after the synchronization is desired). But you should know that THEY already fixed this using "nukes" - check US6857064 patent.

我对手册中提到的 P6 和 Pentium 处理器有点困惑

I'm a little confused about the manual mentioning P6 and Pentium processors

如果您已经获取、解码并执行了一些过时的英特尔说明手册版本,这是可能的.您可以重置管道并检查此版本:订单号:325462-047US,2013 年 6 月 11.6 自我修改代码".这个版本仍然没有说明更新的 CPU,但提到当您使用不同的虚拟地址进行修改时,微架构之间的行为可能不兼容(它可能适用于您的 Nehalem/Sandy Bridge,可能不适用于 .. Skymont)

This is possible if you had fetched, decoded and executed some stale version of intel's instruction manual. You can reset the pipeline and check this version: Order Number: 325462-047US, June 2013 "11.6 SELF-MODIFYING CODE". This version still not says anything about newer CPUs, but mentions that when you are modifying using different virtual addresses, the behavior may be not compatible between microarchitectures (it may work on your Nehalem/Sandy Bridge and may not work on .. Skymont)

11.6 自我修改代码写入当前缓存在处理器中的代码段中的内存位置会导致相关的缓存线(或多条线)无效.该检查基于指令的物理地址.此外,P6 系列和奔腾处理器会检查对代码段的写入是否会修改已预取以供执行的指令.如果写入影响预取指令,则预取队列无效.后一种检查基于指令的线性地址.对于奔腾 4 和英特尔至强处理器,在代码段中写入或窥探指令,其中目标指令已被解码并驻留在跟踪缓存中,会使整个跟踪缓存无效.后一种行为意味着在奔腾 4 和英特尔至强处理器上运行时,自我修改代码的程序会导致性能严重下降.

11.6 SELF-MODIFYING CODE A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

实际上,对线性地址的检查不应在 IA-32 处理器之间产生兼容性问题.包含自修改代码的应用程序使用相同的线性地址来修改和获取指令.

In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction.

系统软件(例如调试器)可能会使用与用于获取指令的线性地址不同的线性地址来修改指令,将在执行修改后的指令之前执行序列化操作,例如 CPUID 指令,从而将自动重新同步指令缓存和预取队列.(有关使用自修改代码的更多信息,请参见第 8.1.3 节处理自修改和交叉修改代码".)

Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 8.1.3, "Handling Self- and Cross-Modifying Code," for more information about the use of self-modifying code.)

对于 Intel486 处理器,写入高速缓存中的指令会在高速缓存和内存中修改它,但如果该指令在写入之前被预取,则可能会执行旧版本的指令.为了防止旧指令被执行,在修改指令的任何写操作之后立即通过编码跳转指令来刷新指令预取单元

For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction

REAL Update,在 google 上搜索 SMC 检测"(带引号),并且有一些现代 Core2/Core iX 如何检测 SMC 的细节以及许多 Xeons 的勘误表和 Pentiums 挂在 SMC 检测器中:

REAL Update, googled for "SMC Detection" (with quotes) and there are some details how modern Core2/Core iX detects SMC and also many errata lists with Xeons and Pentiums hanging in SMC detector:

  1. http://www.google.com/patents/US6237088 系统和2001 年@2001 流水线中跟踪飞行指令的方法

  1. http://www.google.com/patents/US6237088 System and method for tracking in-flight instructions in a pipeline @ 2001

DOI 10.1535/itj.1203.03(谷歌搜索,在 citeseerx.ist.psu.edu 上有免费版本)——Penryn 中添加了包含过滤器"以减少错误的 SMC 检测;现有夹杂物检测机制"如图9

DOI 10.1535/itj.1203.03 (google for it, there is free version at citeseerx.ist.psu.edu) - the "INCLUSION FILTER" was added in Penryn to lower number of false SMC detections; the "existing inclusion detection mechanism" is pictured on Fig 9

http://www.google.com/patents/US6405307 - 较旧SMC检测逻辑专利

http://www.google.com/patents/US6405307 - older patent on SMC detection logic

根据专利 US6237088(图 5,摘要),有行地址缓冲区"(具有许多线性地址,每个提取指令一个地址——或者换句话说,缓冲区充满了具有缓存行精度的提取 IP).每个存储,或更确切地说每个存储的存储地址"阶段将被送入并行比较器以检查存储是否与当前正在执行的任何指令相交.

According to patent US6237088 (FIG5, summary) there is "Line address buffer" (with many linear addresses one address per fetched instruction -- or in other word the buffer full of fetched IPs with cache-line precision). Every store, or more exact "store address" phase of every store will be feed into parallel comparator to check, will store intersects to any of currently executing instructions or not.

两个专利都没有明确说明,SMC逻辑中会使用物理地址还是逻辑地址... Sandy Bridge中的L1i是VIPT(虚拟索引,物理标记,标记中索引的虚拟地址和物理地址.)根据http://nick-black.com/dankwiki/index.php/Sandy_Bridge 这样我们就有了一级缓存返回数据时的物理地址.我认为英特尔可能会在 SMC 检测逻辑中使用物理地址.

Both patents don't clearly say, will they use physical or logical address in SMC logic... L1i in Sandy bridge is VIPT (Virtually indexed, physically tagged, virtual address for the index and physical address in the tag. ) according to http://nick-black.com/dankwiki/index.php/Sandy_Bridge so we have the physical address at time when L1 cache returns data. I think intel may use physical addresses in SMC detection logic.

甚至更多,http://www.google.com/patents/US6594734 @ 1999(2003 年发布,请记住 CPU 设计周期大约是 3-5 年)在摘要"部分说 SMC 现在在 TLB 中并使用物理地址(或者换句话说 - 请不要试图欺骗 SMC 检测器):

Even more, http://www.google.com/patents/US6594734 @ 1999 (published 2003, just remember that CPU design cycle is around 3-5 years) says in the "Summary" section that SMC now is in TLB and uses physical addresses (or in other word - please, don't try to fool SMC detector):

使用翻译后备缓冲区检测自修改代码 .. [其中] 存储了物理页地址,可以使用物理页地址执行探听存储到内存中的内存地址....为了提供比地址页更精细的粒度,缓存中的每个条目都包含 FINE HIT 位,将缓存中的信息与内存中的页面部分相关联.

Self modifying code is detected using a translation lookaside buffer .. [which] has physical page addresses stored therein over which snoops can be performed using the physical memory address of a store into memory. ... To provide finer granularity than a page of addresses, FINE HIT bits are included with each entry in the cache associating information in the cache to portions of a page within memory.

(页面的一部分,在专利 US6594734 中称为象限,听起来像 1K 个子页面,不是吗?)

(portion of page, referred to as quadrants in the patent US6594734, sounds like 1K subpages, isn't it?)

然后他们说

因此,snoops,由将指令存入内存而触发,可以通过将存储在指令缓存中的所有指令的物理地址与存储在相关页面或多个页面中的所有指令的地址进行比较来执行 SMC 检测的记忆.如果存在地址匹配,则表明内存位置被修改.在地址匹配的情况下,指示 SMC 条件,指令缓存和指令流水线由引退单元刷新,并从内存中提取新指令以存储到指令缓存中.

Therefore snoops, triggered by store instructions into memory, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache with the address of all instructions stored within the associated page or pages of memory. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache and instruction pipeline are flushed by the retirement unit and new instructions are fetched from memory for storage into the instruction cache.

因为SMC检测的snoops是物理的,并且ITLB通常接受一个线性地址作为输入来转换成物理地址,所以ITLB额外形成物理地址上的内容可寻址存储器,并包括一个额外的输入比较端口(简称snoop端口或反向翻译端口)

Because snoops for SMC detection are physical and the ITLB ordinarily accepts as an input a linear address to translate into a physical address, the ITLB is additionally formed as a content-addressable memory on the physical addresses and includes an additional input comparison port (referred to as a snoop port or reverse translation port)

-- 因此,为了检测 SMC,它们强制存储通过监听将物理地址转发回指令缓冲区(类似的监听将从其他内核/CPU 或从 DMA 写入我们的缓存中传递......),如果窥探的物理.地址与存储在指令缓冲区中的缓存线冲突,我们将通过从 iTLB 传递到退休单元的 SMC 信号重新启动流水线.可以想象在这种从 dTLB 通过 iTLB 到退休单元的监听循环中将浪费多少 cpu 时钟(它不能退休下一个nop"指令,尽管它比 mov 早执行并且没有副作用).但是哇?ITLB有物理地址输入和第二个CAM(又大又热),只是为了支持和防御疯狂和作弊的自修改代码.

-- So, to detect SMC, they force the stores to forward physical address back to instruction buffer via snoop (similar snoops will be delivered from other cores/cpus or from DMA writes to our caches....), if snoop's phys. address conflicts with cache lines, stored in instruction buffer, we will restart pipeline via SMC signal delivered from iTLB to retirement unit. Can imagine how much cpu clocks will be wasted in such snoop loop from dTLB via iTLB and to retirement unit (it can't retire next "nop" instruction, although it was executed early than mov and has no side effects). But WAT? ITLB has physical address input and second CAM (big and hot) just to support and defend against crazy and cheating self-modifying code.

PS:如果我们使用大页面(4M 或可能是 1G)怎么办?L1TLB 有巨大的页面条目,对于 4 MB 页面的 1/4 可能会有很多错误的 SMC 检测......

PS: And what if we will work with huge pages (4M or may be 1G)? The L1TLB has huge page entries, and there may be a lot of false SMC detects for 1/4 of 4 MB page...

PPS:有一个变体,即在早期的 P6/Ppro/P2 中才存在对具有不同线性地址的 SMC 的错误处理...

PPS: There is a variant, that the erroneous handling of SMC with different linear addresses was present only in early P6/Ppro/P2...

这篇关于使用自修改代码观察 x86 上的陈旧指令提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆