x86 CPU 有多少条内存屏障指令? [英] How many memory barriers instructions does an x86 CPU have?

查看:30
本文介绍了x86 CPU 有多少条内存屏障指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现 x86 CPU 具有以下内存屏障指令:mfencelfencesfence.

x86 CPU 是否只有这三个内存屏障指令,还是还有更多?

解决方案

sfence (SSE1) 和 mfence/lfence (SSE2) 是唯一以其内存栅栏/屏障功能命名的指令.除非您使用 NT 加载或存储和/或 WC 内存,否则只需要 mfence 进行内存排序.

(请注意,lfence 在 Intel CPU 上也是一个乱序执行的屏障,因此它可以序列化 rdtsc,并且对于缓解 Spectre 以防止推测执行很有用.在 AMD 上,必须设置一个 MSR,否则 lfence 基本上是一个 nop(4/cycle 吞吐量).该 MSR 是通过 Spectre-mitigation 微码更新引入的,通常由更新的内核设置.)


locked 指令如 lock add [mem], eax 也是满内存屏障.lock xchg 的行为是否与 mfence 相同?.(虽然可能不如 mfence 从 WC 内存中订购 NT 加载那么强大:锁定指令是否在弱顺序访问之间提供屏障?).xchg [mem], reg 有一个隐含的 lock 前缀,所以它也是一个屏障.

我在 Skylake 上的测试, locked 指令使用此代码阻止 NT 存储与常规存储的重新排序 https://godbolt.org/g/7Q9xgz.

xchg 似乎是进行 seq-cst 存储的好方法,尤其是在像 Skylake 这样的 Intel 硬件上,其中 mfence 也阻止了乱序执行纯 ALU 指令,例如 lfence:参见 这个答案的底部.

AMD 还建议使用 xchg 或其他锁定指令代替 mfence.(mfence 在 AMD 手册中记录为在 AMD 上进行序列化,因此它总是会受到阻塞 OoO exec 的惩罚.


对于没有 SSE 的 32 位目标上的顺序一致性存储或完整屏障,编译器通常使用 lock 或 [esp], 0 或其他无操作锁定指令只是为了记忆屏障效应.这就是g++7.3 -O3 -m32 -mno-sse 所做的对于 std::atomic_thread_fence(std::memory_order_seq_cst);.

但无论如何,mfencelocked insns 都架构在英特尔上定义为序列化,,不管一些 CPU 的实现细节.


cpuid 这样的完整序列化指令也是完整的内存屏障,它会耗尽存储缓冲区并刷新管道.lock xchg 的行为是否与 mfence 相同?有英特尔手册中的相关引述.

在英特尔处理器上,以下是架构序列化指令(来自:https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html):

  • 特权序列化指令 — INVD、INVEPT、INVLPG、INVVPID、LGDT、LIDT、LLDT、LTR、控制寄存器的 MOV、MOV(调试寄存器)、WBINVD 和 WRMSR.

    例外:MOV CR8 未序列化.WRMSR 到 IA32_TSC_DEADLINE MSR(MSR 索引 6E0H)和 X2APIC MSR(MSR 索引 802H 到 83FH)未序列化.

  • 非特权序列化指令 — CPUID、IRET1 和 RSM

在 AMD 处理器上,以下是架构序列化指令:

  • 特权序列化指令 — INVD、INVLPG、LGDT、LIDT、LLDT、LTR、控制寄存器的 MOV、MOV(调试寄存器)、WBINVD、WRMSR 和 SWAPGS.

  • 非特权序列化指令 — MFENCE、CPUID、IRET 和 RSM

术语[完全]序列化指令"在 Intel 处理器上的意思与在 AMD 处理器上完全相同,除了一个区别:来自 CLFLUSH(但不是 CLFLUSHOPT)的缓存行刷新操作是根据后面的指令排序的仅在 AMD 处理器上使用 MFENCE.


in/out(以及它们的字符串复制版本 insouts) 是完整的内存屏障,也部分序列化(如 lfence).文档说他们将下一条指令的执行延迟到数据阶段"之后.I/O 事务.


脚注:

(1) 根据 BJ137 (Sandy Bridge)、HSD152 (Haswell)、BDM103 (Broadwell):

<块引用>

问题:导致任务切换的 IRET 指令从嵌套任务返回不会序列化处理器(与软件开发人员手册第 3 卷标题为序列化说明").

含义:依赖序列化的软件任务切换期间 IRET 的属性可能不会表现得像预期的.英特尔尚未观察到此错误会影响操作任何商用软件.

解决方法:未确定.软件可以执行 MFENCE如果序列化,则紧接在 IRET 指令之前的指令需要.

I have found out that an x86 CPU have the following memory barriers instructions: mfence, lfence, and sfence.

Does an x86 CPU only have these three memory barriers instructions, or are there more?

解决方案

sfence (SSE1) and mfence / lfence (SSE2) are the only instructions that are named for their memory fence/barrier functionality. Unless you're using NT loads or stores and/or WC memory, only mfence is needed for memory ordering.

(Note that lfence on Intel CPUs is also a barrier for out-of-order execution, so it can serialize rdtsc, and is useful for Spectre mitigation to prevent speculative execution. On AMD, there's an MSR that has to be set, otherwise lfence is basically a nop (4/cycle throughput). That MSR was introduced with Spectre-mitigation microcode updates, and is normally set by updated kernels.)


locked instructions like lock add [mem], eax are also full memory barriers. Does lock xchg have the same behavior as mfence?. (Although possibly not as strong as mfence for ordering NT loads from WC memory: Do locked instructions provide a barrier between weakly-ordered accesses?). xchg [mem], reg has an implicit lock prefix, so it's also a barrier.

In my testing on Skylake, locked instructions do block reordering of NT stores with regular stores with this code https://godbolt.org/g/7Q9xgz.

xchg seems to be a good way to do a seq-cst store, especially on Intel hardware like Skylake where mfence also blocks out-of-order execution of pure ALU instructions, like lfence: See the bottom of this answer.

AMD also recommends using xchg or other locked instructions instead of mfence. (mfence is documented in the AMD manuals as serializing on AMD, so it will always have the penalty of blocking OoO exec).


For sequential-consistency stores or full barriers on 32-bit targets without SSE, compilers typically use lock or [esp], 0 or other no-op locked instruction just for the memory-barrier effect. That's what g++7.3 -O3 -m32 -mno-sse does for std::atomic_thread_fence(std::memory_order_seq_cst);.

But anyway, neither mfence nor locked insns are architecturally defined as serializing on Intel, regardless of implementation details on some CPUs.


Full serializing instructions like cpuid are also full memory barriers, draining the store buffer as well as flushing the pipeline. Does lock xchg have the same behavior as mfence? has relevant quotes from Intel's manual.

On Intel processors, the following are architecturally serializing instructions (From: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-273.html):

  • Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, and WRMSR.

    Exceptions: MOV CR8 isn't serializing. WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.

  • Non-privileged serializing instructions — CPUID, IRET1, and RSM

On AMD processors, the following are architecturally serializing instructions:

  • Privileged serializing instructions — INVD, INVLPG, LGDT, LIDT, LLDT, LTR, MOV to control register, MOV (to debug register), WBINVD, WRMSR, and SWAPGS.

  • Non-privileged serializing instructions — MFENCE, CPUID, IRET, and RSM

The term "[fully] serializing instruction" on Intel processors means the same exact thing as on AMD processors except for one difference: a cache line flushing operation from CLFLUSH (but not CLFLUSHOPT) is ordered with respect to later instructions by only MFENCE on AMD processors.


in / out (and their string-copy versions ins and outs) are full memory barriers, and also partially serializing (like lfence). The docs say they delay execution of the next instruction until after "the data phase" of the I/O transaction.


Footnotes:

(1) According to BJ137 (Sandy Bridge), HSD152 (Haswell), BDM103 (Broadwell):

Problem: An IRET instruction that results in a task switch by returning from a nested task does not serialize the processor (contrary to the Software Developer’s Manual Vol. 3 section titled "Serializing Instructions").

Implication: Software which depends on the serialization property of IRET during task switching may not behave as expected. Intel has not observed this erratum to impact the operation of any commercially available software.

Workaround: None identified. Software can execute an MFENCE instruction immediately prior to the IRET instruction if serialization is needed.

这篇关于x86 CPU 有多少条内存屏障指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆