x86 CPU有多少个内存屏障指令? [英] How many memory barriers instructions does an x86 CPU have?

查看:792
本文介绍了x86 CPU有多少个内存屏障指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现x86 CPU具有以下内存屏障指令:mfencelfencesfence.

x86 CPU仅具有这三个内存屏障指令,还是还有更多?

解决方案

sfence(SSE1)和mfence/lfence(SSE2)是为其内存防护/屏障命名的唯一说明.功能.除非您使用NT加载或存储和/或WC内存,否则只需mfence即可进行内存排序.

(请注意,Intel CPU上的 lfence 也是阻碍其发展的障碍-order执行,因此它可以序列化rdtsc,对于缓解Spectre缓解以防止推测性执行很有用.在AMD上,必须设置一个MSR,否则lfence基本上是一个nop(4/循环吞吐量).该MSR是通过Spectre-mitigation微代码更新引入的,通常是由更新的内核设置的.)


lock add [mem], eax这样的

lock指令也是完全内存障碍. 锁xchg与mfence具有相同的行为吗? . (尽管从WC内存订购NT负载的强度可能不及mfence:在Skylake上进行的测试lock ed指令确实会使用此代码 https://godbolt.org/来阻止NT商店与常规商店的重新排序g/7Q9xgz .

xchg似乎是进行seq-cst存储的好方法,尤其是在像 Skylake这样的英特尔硬件上,其中mfence还会阻止无序执行纯ALU指令,类似于lfence:请参见底部答案.

AMD还建议使用xchg或其他锁定的指令代替mfence. (mfence在AMD手册中记录为在AMD上序列化,因此始终会阻塞OoO执行程序.)


对于没有SSE的32位目标上的顺序一致性存储或完全屏障,编译器通常使用 lock or [esp], 0 或其他无操作锁定的指令 just 屏障效应. 这就是g++7.3 -O3 -m32 -mno-ssestd::atomic_thread_fence(std::memory_order_seq_cst);所做的.

但是无论如何,mfencelock ed insns都没有架构上定义为在英特尔上进行序列化,而与某些CPU的实现细节无关.


完整的序列化指令(如cpuid)也是完整的内存屏障,耗尽存储缓冲区以及刷新管道. 锁xchg与mfence具有相同的行为吗?引用了英特尔手册中的相关报价.

在Intel处理器上,以下是架构上的序列化说明(来自: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html ):

  • 特权序列化指令-INVD,INVEPT,INVLPG,INVVPID,LGDT,LIDT,LLDT,LTR,MOV到控制寄存器,MOV(用于调试寄存器),WBINVD和WRMSR.

    例外:MOV CR8未序列化. WRMSR到IA32_TSC_DEADLINE MSR(MSR索引6E0H)和X2APIC MSR(MSR索引802H至83FH)未序列化.

  • 非特权序列化指令-CPUID,IRET 1 和RSM

在AMD处理器上,以下是架构上的序列化说明:

  • 特权序列化指令-INVD,INVLPG,LGDT,LIDT,LLDT,LTR,用于控制寄存器的MOV,用于调试寄存器的MOV,WBINVD,WRMSR和SWAPGS. p>

  • 非特权序列化指令-MFENCE,CPUID,IRET和RSM

Intel处理器上的"[完全]序列化指令"一词与AMD处理器具有相同的含义,唯一的区别是:从CLFLUSH(但不是CLFLUSHOPT)订购了高速缓存行刷新操作稍后仅由MFENCE在AMD处理器上进行说明.


in/ out (及其字符串复制版本ins outs )是完整的内存屏障,并且也进行了部分序列化(如lfence).文档说他们将下一条指令的执行推迟到I/O事务的数据阶段"之后.


脚注:

(1)根据BJ137(桑迪桥),HSD152(Haswell),BDM103(Broadwell):

问题:IRET指令导致通过以下命令切换任务 从嵌套任务返回不会序列化处理器 (与题为《软件开发人员手册》第3卷的部分相反 序列化说明").

含义:依赖于序列化的软件 任务切换过程中IRET的属性可能无法像 预期的.英特尔尚未观察到这种错误影响 任何商用软件的操作.

解决方法:未确定.软件可以执行MFENCE 如果序列化,则紧接在IRET指令之前的指令 是必需的.

I have found out that an x86 CPU have the following memory barriers instructions: mfence, lfence, and sfence.

Does an x86 CPU only have these three memory barriers instructions, or are there more?

解决方案

sfence (SSE1) and mfence / lfence (SSE2) are the only instructions that are named for their memory fence/barrier functionality. Unless you're using NT loads or stores and/or WC memory, only mfence is needed for memory ordering.

(Note that lfence on Intel CPUs is also a barrier for out-of-order execution, so it can serialize rdtsc, and is useful for Spectre mitigation to prevent speculative execution. On AMD, there's an MSR that has to be set, otherwise lfence is basically a nop (4/cycle throughput). That MSR was introduced with Spectre-mitigation microcode updates, and is normally set by updated kernels.)


locked instructions like lock add [mem], eax are also full memory barriers. Does lock xchg have the same behavior as mfence?. (Although possibly not as strong as mfence for ordering NT loads from WC memory: Do locked instructions provide a barrier between weakly-ordered accesses?). xchg [mem], reg has an implicit lock prefix, so it's also a barrier.

In my testing on Skylake, locked instructions do block reordering of NT stores with regular stores with this code https://godbolt.org/g/7Q9xgz.

xchg seems to be a good way to do a seq-cst store, especially on Intel hardware like Skylake where mfence also blocks out-of-order execution of pure ALU instructions, like lfence: See the bottom of this answer.

AMD also recommends using xchg or other locked instructions instead of mfence. (mfence is documented in the AMD manuals as serializing on AMD, so it will always have the penalty of blocking OoO exec).


For sequential-consistency stores or full barriers on 32-bit targets without SSE, compilers typically use lock or [esp], 0 or other no-op locked instruction just for the memory-barrier effect. That's what g++7.3 -O3 -m32 -mno-sse does for std::atomic_thread_fence(std::memory_order_seq_cst);.

But anyway, neither mfence nor locked insns are architecturally defined as serializing on Intel, regardless of implementation details on some CPUs.


Full serializing instructions like cpuid are also full memory barriers, draining the store buffer as well as flushing the pipeline. Does lock xchg have the same behavior as mfence? has relevant quotes from Intel's manual.

On Intel processors, the following are architecturally serializing instructions (From: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html):

  • Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, and WRMSR.

    Exceptions: MOV CR8 isn't serializing. WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.

  • Non-privileged serializing instructions — CPUID, IRET1, and RSM

On AMD processors, the following are architecturally serializing instructions:

  • Privileged serializing instructions — INVD, INVLPG, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, WRMSR, and SWAPGS.

  • Non-privileged serializing instructions — MFENCE, CPUID, IRET, and RSM

The term "[fully] serializing instruction" on Intel processors means the same exact thing as on AMD processors except for one difference: a cache line flushing operation from CLFLUSH (but not CLFLUSHOPT) is ordered with respect to later instructions by only MFENCE on AMD processors.


in / out (and their string-copy versions ins and outs) are full memory barriers, and also partially serializing (like lfence). The docs say they delay execution of the next instruction until after "the data phase" of the I/O transaction.


Footnotes:

(1) According to BJ137 (Sandy Bridge), HSD152 (Haswell), BDM103 (Broadwell):

Problem: An IRET instruction that results in a task switch by returning from a nested task does not serialize the processor (contrary to the Software Developer’s Manual Vol. 3 section titled "Serializing Instructions").

Implication: Software which depends on the serialization property of IRET during task switching may not behave as expected. Intel has not observed this erratum to impact the operation of any commercially available software.

Workaround: None identified. Software can execute an MFENCE instruction immediately prior to the IRET instruction if serialization is needed.

这篇关于x86 CPU有多少个内存屏障指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆