观察陈旧取指令在x86与自我修改code [英] Observing stale instruction fetching on x86 with self-modifying code

查看:261
本文介绍了观察陈旧取指令在x86与自我修改code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人告诉我,并从英特尔的手册,它有可能写入到存储器指令已阅读,但是指令prefetch队列中已经取出了陈旧的指令并且将执行那些老的指令。我一直在观察这种行为未果。我的方法如下:

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute those old instructions. I have been unsuccessful in observing this behavior. My methodology is as follows.

这部分11.6英特尔软件开发手册指出

The Intel software development manual states from section 11.6 that

内存位置写在code段当前在处理器缓存导致关联的缓存行(或线)为无效。这一检查是根据指令的物理地址。 此外,P6系列和Pentium处理器检查到code段的写入是否会修改已经执行prefetched的指令。如果写影响到prefetched指令,prefetch队列无效。后者检查基于指令的线性地址。

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.

因此​​,它看起来就像如果我希望能够执行的指令陈旧,我需要有两个不同的线性地址指的是相同的物理页面。所以,我的内存文件映射到两个不同的地址。

So, it looks like if I hope to execute stale instructions, I need to have two different linear addresses refer to the same physical page. So, I memory map a file to two different addresses.

int fd = open("code_area", O_RDWR | O_CREAT, S_IRWXU | S_IRWXG | S_IRWXO);
assert(fd>=0);
write(fd, zeros, 0x1000);
uint8_t *a1 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
uint8_t *a2 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
assert(a1 != a2);

我有一个程序集函数,它接受一个参数,一个指向我想改变的指示。

I have an assembly function that takes a single argument, a pointer to the instruction I want to change.

fun:
    push %rbp
    mov %rsp, %rbp

    xorq %rax, %rax # Return value 0

# A far jump simulated with a far return
# Push the current code segment %cs, then the address we want to far jump to

    xorq %rsi, %rsi
    mov %cs, %rsi
    pushq %rsi
    leaq copy(%rip), %r15
    pushq %r15
    lretq

copy:
# Overwrite the two nops below with `inc %eax'. We will notice the change if the
# return value is 1, not zero. The passed in pointer at %rdi points to the same physical
# memory location of fun_ins, but the linear addresses will be different.
    movw $0xc0ff, (%rdi)

fun_ins:
    nop   # Two NOPs gives enough space for the inc %eax (opcode FF C0)
    nop
    pop %rbp
    ret
fun_end:
    nop

在C,我的code复制到内存映射文件。我援引从线性地址功能 A1 ,但我将指针传递给 A2 为$ C $的目标ç修改。

In C, I copy the code to the memory mapped file. I invoke the function from linear address a1, but I pass a pointer to a2 as the target of the code modification.

#define DIFF(a, b) ((long)(b) - (long)(a))
long sz = DIFF(fun, fun_end);
memcpy(a1, fun, sz);
void *tochange = DIFF(fun, fun_ins);
int val = ((int (*)(void*))a1)(tochange);

如果该CPU拿起修改code,VAL == 1。否则,如果过时的指令被执行(两条NOP),VAL == 0。

If the CPU picked up the modified code, val==1. Otherwise, if the stale instructions were executed (two nops), val==0.

我在1.7GHz的Intel酷睿i5(2011年的MacBook Air)和英特尔(R)至强(R)CPU X3460 @ 2.80GHz的运行这个。每一次,但是,我看到一个指示CPU VAL == 1总是注意到了新的指令。

I've run this on a 1.7GHz Intel Core i5 (2011 macbook air) and an Intel(R) Xeon(R) CPU X3460 @ 2.80GHz. Every time, however, I see val==1 indicating the CPU always notices the new instruction.

有与我要观察行为人的经验?是我的推理是否正确?我有点困惑的手工提P6和奔腾处理器,以及什么缺乏提我的酷睿i5处理器。也许别的东西是怎么回事,导致CPU的刷新其指令prefetch队列?任何有识之士将是非常有益的!

Has anyone experience with the behavior I want to observe? Is my reasoning correct? I'm a little confused about the manual mentioning P6 and Pentium processors, and what the lack of mentioning my Core i5 processor. Perhaps something else is going on that causes the CPU to flush its instruction prefetch queue? Any insight would be very helpful!

推荐答案

我想,你应该检查 MACHINE_CLEARS.SMC 性能计数器( CPU的 MACHINE_CLEARS 事件)的一部分(这是在的Sandy Bridge的 1 ,这是你的空气的PowerBook使用;也可以在你的至强,这也是Nehalem处理器的 2 - 搜索SMC)。您可以使用 oprofile的 PERF 或英特尔 VTune™可视化找它的值:

I think, you should check the MACHINE_CLEARS.SMC performance counter (part of MACHINE_CLEARS event) of the CPU (it is available in Sandy Bridge 1, which is used in your Air powerbook; and also available on your Xeon, which is Nehalem 2 - search "smc"). You can use oprofile, perf or Intel's Vtune to find its value:

<一个href=\"http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm\">http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm

机疤

公制说明

某些事件需要对整个管道被清除,从刚刚过去的退休指令后重新启动。该指标的措施三个这样的事件:内存排序违规行为,自修改code和某些负载非法地址范围

Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.

可能出现的问题。

的执行时间的显著部分都花在处理机清零。检查MACHINE_CLEARS事件来确定具体原因。

A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.

SMC:<一href=\"http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html\">http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html

MACHINE_CLEARS事件code:0xC3
  SMC面膜:0×04

MACHINE_CLEARS Event Code: 0xC3 SMC Mask: 0x04

自修改code(SMC)检测。

Self-modifying code (SMC) detected.

将清除检测到自我modifying- code机的数量。

Number of self-modifying-code machine clears detected.

此外,英特尔还表示,有关SMC <一个href=\"http://software.intel.com/en-us/forums/topic/345561\">http://software.intel.com/en-us/forums/topic/345561 (从英特尔性能瓶颈分析仪的分类链接

Intel also says about smc http://software.intel.com/en-us/forums/topic/345561 (linked from Intel Performance Bottleneck Analyzer's taxonomy

当检测自我修改code此事件。这可通常由乡亲谁做的二进制编辑,迫使它采取一定的路径(如黑客)。此事件统计一个程序写入到code段的次数。自修改code使所有英特尔64和IA-32处理器的严厉的惩罚。该修改的高速缓存线被写回L2和LLC缓存。此外,指示需要重新加载从而导致性能损失。

This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.

我想,你会看到一些这样的活动。如果是这样,那么CPU能够检测自我修改code的行为,并提出了机器清除 - 管道全面重启。第一阶段取,他们将要求新的运算code L2缓存。我在你的每code的执行SMC事件的确切数量非常感兴趣 - 这将会给我们介绍一下延迟一些估计。(SMC是在一些单位,其中1个单位被假定为1.5的CPU周期数 - B英特尔优化手册.6.2.6)

I think, you will see some such events. If they are, then CPU was able to detect act of self-modifying the code and raised the "Machine Clear" - full restart of pipeline. First stages are Fetch and they will ask L2 cache for new opcode. I'm very interested in the exact count of SMC events per execution of your code - this will give us some estimate about latencies.. (SMC is counted in some units where 1 unit is assumed to be 1.5 cpu cycles - B.6.2.6 of intel optimization manual)

我们可以看到,英特尔表示,所以我觉得最后退休的指令将被 MOV ;和你的NOP已经在酝酿中。但是,SMC将在MOV退休提高,它会在管道杀死一切,包括空指令。

We can see that Intel says "restarted from just after the last retired instruction.", so I think last retired instruction will be mov; and your nops are already in the pipeline. But SMC will be raised at mov's retirement and it will kill everything in pipeline, including nops.

这SMC引起管道重启并不便宜,瓦格纳已经在 Optimizing_assembly.pdf 一些测量 - 17.10自修改code(所有的处理器)(我认为,任何的Core2 / CoreiX就像PM这里):

This SMC induced pipeline restart is not cheap, Agner has some measurements in the Optimizing_assembly.pdf - "17.10 Self-modifying code (All processors)" (I think any Core2/CoreiX is like PM here):

刑罚执行了一块code修改它大约是19个时钟周期为P1,31 PMMX,和150-300的PPro,P2,P3,PM之后。在P4将吹扫后自行修改code整个跟踪缓存。 80486和早期的处理器需要修改和改进的code之间的一个跳跃,以冲洗code缓存。
  ...

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache. ...

自修改code不被认为是良好的编程习惯。应当仅用于
  在速度上的增益是实质性的修改code执行这么多次的
  优势超过了惩罚使用自修改code。

Self-modifying code is not considered good programming practice. It should be used only if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.

在这里推荐不同的线性地址失败SMC探测器使用方法:
http://stackoverflow.com/a/10994728/196561 - 我会尽力找到实际的Intel文档...实际上并不能回答你的真正的问题了。

Usage of different linear addresses to fail SMC detector was recommended here: http://stackoverflow.com/a/10994728/196561 - I'll try to find actual intel documentation... Can't actually answer to your real question now.

有可能是这里一些提示:<一href=\"http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf\">Optimization手册,248966-026,2012年4月3.6.9混合code和数据:

There may be some hints here: Optimization manual, 248966-026, April 2012 "3.6.9 Mixing Code and Data":

在code段配售可写数据可能无法区分
  从自我修改code。在code段可写数据可能会受到影响的
  相同的性能损失为自修改code。

Placing writable data in the code segment might be impossible to distinguish from self-modifying code. Writable data in the code segment might suffer the same performance penalty as self-modifying code.

和下一节

软件应尽量避免在同一个1 KB的子页面是写入code页面
  被执行或在这同一个2 KB的子页面获取code正在
  书面。此外,共享含有直接或推测执行的一个页
  code。与其它处理器为数据页可以触发导致SMC条​​件
  机器和跟踪缓存的整个管道被清除。这是由于
  自修改code状态。

Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.

因此​​,有可能是一些图表控制可写和可执行子页面交叉点。

So, there is possibly some schematics which controls intersections of writable and executable subpages.

您可以尝试做修改从另一个线程(交叉修改code) - 但是需要非常小心线程同步和管道冲洗(你可能要包括一些暴力破解的作家线程延误;同步期望CPUID刚刚之后)。但是你应该知道,他们使用这种核武器已定 - 检查 US6857064专利

You can try to do modification from the other thread (cross-modifying code) -- but the very careful thread synchronization and pipeline flushing is needed (you may want to include some brute-forcing of delays in writer thread; CPUID just after the synchronization is desired). But you should know that THEY already fixed this using "nukes" - check US6857064 patent.

我有点困惑的手工提P6和奔腾处理器

I'm a little confused about the manual mentioning P6 and Pentium processors

这是可能的,如果你有进账,德codeD和执行英特尔的使用说明书的一些陈旧版本。您可以重置管道,检查此版本:订单编号:325462-047US,2013年6月 11.6自修改code。这个版本仍然没有说,关于新的CPU什么,但是提到,当您使用不同的虚拟地址修改,该行为可能(也可能在你的Nehalem / Sandy Bridge的工作,可能无法正常工作的.. Skymont)微体系结构之间的不兼容

This is possible if you had fetched, decoded and executed some stale version of intel's instruction manual. You can reset the pipeline and check this version: Order Number: 325462-047US, June 2013 "11.6 SELF-MODIFYING CODE". This version still not says anything about newer CPUs, but mentions that when you are modifying using different virtual addresses, the behavior may be not compatible between microarchitectures (it may work on your Nehalem/Sandy Bridge and may not work on .. Skymont)

11.6自修改code
  内存位置写在code段当前在处理器缓存导致关联的缓存行(或线)为无效。这一检查是根据指令的物理地址。此外,P6系列和Pentium处理器检查到code段的写入是否会修改已经执行prefetched的指令。如果写影响到prefetched指令,prefetch队列无效。后者检查基于指令的线性地址。对于Pentium 4和Intel Xeon处理器,写或在code段,其中目标指令已经去codeD和驻留在追踪缓存指令的窥探,无效整个跟踪缓存。后者的行为意味着当在Pentium 4和Intel Xeon处理器上运行,自我修改code程序可能会导致性能严重下降。

11.6 SELF-MODIFYING CODE A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

在实践中,线性地址检查不应造成IA-32处理器之间的兼容性问题。包括自修改code应用程序使用相同的线性地址修改与获取指令。

In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction.

系统软件,诸如调试,这可能使用比用于获取指示一个不同的线性地址可能修改的指令时,将执行串行化操作,例如一个CPUID指令,在执行修正的指令之前,该会自动重新同步指令高速缓存和prefetch队列。 (参见8.1.3节,处理自我与跨修改code,关于使用自修改code的更多信息。)

Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 8.1.3, "Handling Self- and Cross-Modifying Code," for more information about the use of self-modifying code.)

有关英特尔486处理器,在缓存的指令的写入将修改它在高速缓存和存储器两者,但是如果指令是在写之前pfetched $ P $,旧版本的指令可以执行的之一。以prevent从正在执行的指令老,通过修改指令的任何写操作之后立即编写一个跳转指令刷新指令prefetch单元

For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction

REAL更新,用Google搜索SMC检测(带引号),并有一些细节酷睿/酷睿xi如何现代化的检测SMC和许多勘误表列出了与Xeon处理器和奔腾挂在SMC探测器:

REAL Update, googled for "SMC Detection" (with quotes) and there are some details how modern Core2/Core iX detects SMC and also many errata lists with Xeons and Pentiums hanging in SMC detector:


  1. 跟踪IN-

    http://www.google.com/patents/US6237088 的系统和方法在管道@ 2001飞行指令

  1. http://www.google.com/patents/US6237088 System and method for tracking in-flight instructions in a pipeline @ 2001

DOI 10.1535 / itj.1203.03(谷歌它,有一个在citeseerx.ist.psu.edu免费版) - 在包含过滤器Penryn的加入降低假SMC检测的数量;在加入现有的检测机制被描绘图9

DOI 10.1535/itj.1203.03 (google for it, there is free version at citeseerx.ist.psu.edu) - the "INCLUSION FILTER" was added in Penryn to lower number of false SMC detections; the "existing inclusion detection mechanism" is pictured on Fig 9

http://www.google.com/patents/US6405307 - 对SMC检测逻辑年长专利

http://www.google.com/patents/US6405307 - older patent on SMC detection logic

根据专利US6237088(FIG5,摘要)有行地址缓冲区(与许多线性地址每次取指令地址 - 或者换句话说缓冲区满缓存线precision获取IP地址的) 。每家商店的每一家商店,或者更精确实体店地址阶段将被送入平行比较检查,会相交存储任何当前执行的指令或没有。

According to patent US6237088 (FIG5, summary) there is "Line address buffer" (with many linear addresses one address per fetched instruction -- or in other word the buffer full of fetched IPs with cache-line precision). Every store, or more exact "store address" phase of every store will be feed into parallel comparator to check, will store intersects to any of currently executing instructions or not.

两项专利不说清楚,他们会在SMC逻辑中使用的物理或逻辑地址... L1I在Sandy Bridge的是VIPT(的实际上索引,物理标记时,虚拟地址在标记的索引和物理地址。)<一href=\"http://nick-black.com/dankwiki/index.php/Sandy_Bridge\">http://nick-black.com/dankwiki/index.php/Sandy_Bridge所以我们在时间有物理地址时,L1缓存返回的数据。我认为,英特尔可能在SMC检测逻辑使用的物理地址。

Both patents don't clearly say, will they use physical or logical address in SMC logic... L1i in Sandy bridge is VIPT (Virtually indexed, physically tagged, virtual address for the index and physical address in the tag. ) according to http://nick-black.com/dankwiki/index.php/Sandy_Bridge so we have the physical address at time when L1 cache returns data. I think intel may use physical addresses in SMC detection logic.

更, http://www.google.com/patents/US6594734 @ 1999年(2003年出版的,只是请记住,CPU的设计周期大约是3-5年)摘要一节SMC现在在TLB和使用物理地址在说(或其他字 - 请不要试图愚弄SMC检测器):

Even more, http://www.google.com/patents/US6594734 @ 1999 (published 2003, just remember that CPU design cycle is around 3-5 years) says in the "Summary" section that SMC now is in TLB and uses physical addresses (or in other word - please, don't try to fool SMC detector):

自修改code是用转换后备缓冲器的... [这]具有存储物理页地址,其中在其上的探听的可以使用执行<检测EM>商店的物理内存地址的内存。 ......为了提供更好的粒度比地址的网页,FINE HIT位都包含在内存中缓存中的信息关联的每个条目在高速缓存中的网页的部分。

Self modifying code is detected using a translation lookaside buffer .. [which] has physical page addresses stored therein over which snoops can be performed using the physical memory address of a store into memory. ... To provide finer granularity than a page of addresses, FINE HIT bits are included with each entry in the cache associating information in the cache to portions of a page within memory.

(页部分,被称为在专利US6594734象限,听起来像1K子页面,是不是?)

(portion of page, referred to as quadrants in the patent US6594734, sounds like 1K subpages, isn't it?)

然后他们说:

因此​​的探听,通过存储指令触发到内存的,可以通过比较存储指令高速缓存内的存储相关的网页或网页中的所有指令地址的所有指令的物理地址进行SMC检测的存储器。如果地址匹配,则表示一个存储器位置已被修改。在地址匹配的情况下,表示一个SMC条件,指令高速缓冲存储器和指令流水线由退役单元冲洗和新的指令是从存储器存储到指令高速缓冲存储器取出

Therefore snoops, triggered by store instructions into memory, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache with the address of all instructions stored within the associated page or pages of memory. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache and instruction pipeline are flushed by the retirement unit and new instructions are fetched from memory for storage into the instruction cache.

由于对于SMC检测窥探是物理和ITLB通常接受作为输入翻译成一个物理地址的线性地址,ITLB被另外形成为物理地址的内容寻址存储器,并且包括附加输入比较端口(被称为一个探听端口或反向翻译端口)

Because snoops for SMC detection are physical and the ITLB ordinarily accepts as an input a linear address to translate into a physical address, the ITLB is additionally formed as a content-addressable memory on the physical addresses and includes an additional input comparison port (referred to as a snoop port or reverse translation port)

- 所以,检测SMC,它们迫使商店通过窥探转发物理地址返回指令缓冲器(类似于探听将从其他内核/ CPU或DMA从交付写入缓存我们....),如果窥探的物理学。与高速缓存行,存储在指令缓冲区地址冲突,我们将通过从ITLB退休单元传输信号SMC重启管道。可以想像多少CPU时钟将在从DTLB这种窥探循环通过ITLB和退休单元(不能退下一个NOP指令,虽然它比早期执行MOV且没有副作用)被浪费。但WAT? ITLB具有物理地址输入和第二CAM(大和热)只是为了支持和抵御疯狂作弊自修改code。

-- So, to detect SMC, they force the stores to forward physical address back to instruction buffer via snoop (similar snoops will be delivered from other cores/cpus or from DMA writes to our caches....), if snoop's phys. address conflicts with cache lines, stored in instruction buffer, we will restart pipeline via SMC signal delivered from iTLB to retirement unit. Can imagine how much cpu clocks will be wasted in such snoop loop from dTLB via iTLB and to retirement unit (it can't retire next "nop" instruction, although it was executed early than mov and has no side effects). But WAT? ITLB has physical address input and second CAM (big and hot) just to support and defend against crazy and cheating self-modifying code.

PS:如果我们将与大内存页的工作是什么(4M或可能是1G)?该L1TLB有巨大的网页条目,并有可能成为很多虚假的SMC检测到的4 MB页...

PS: And what if we will work with huge pages (4M or may be 1G)? The L1TLB has huge page entries, and there may be a lot of false SMC detects for 1/4 of 4 MB page...

PPS:有一个变种,即SMC的错误处理不同的线性地址是present仅在年初P6 / PPRO / P2 ...

PPS: There is a variant, that the erroneous handling of SMC with different linear addresses was present only in early P6/Ppro/P2...

这篇关于观察陈旧取指令在x86与自我修改code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆