为什么(或不是?)SFENCE + LFENCE 等价于 MFENCE? [英] Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?

查看:20
本文介绍了为什么(或不是?)SFENCE + LFENCE 等价于 MFENCE?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如我们从之前对 在 x86/x86_64 处理器中指令 LFENCE 有意义吗?我们不能使用 SFENCE 代替 MFENCE 来实现顺序一致性.

As we know from a previous answer to Does it make any sense instruction LFENCE in processors x86/x86_64? that we can not use SFENCE instead of MFENCE for Sequential Consistency.

那里的答案表明 MFENCE = SFENCE+LFENCE,即 LFENCE 做了一些我们没有的事情不能提供顺序一致性.

An answer there suggests that MFENCE = SFENCE+LFENCE, i.e. that LFENCE does something without which we can not provide Sequential Consistency.

LFENCE 无法重新排序:

SFENCE
LFENCE
MOV reg, [addr]

-- 至 -->

MOV reg, [addr]
SFENCE
LFENCE

例如重新排序 MOV [addr], reg LFENCE --> LFENCE MOV [addr], regmechanism - Store Buffer 提供,它重新排序 Store - Loads 以提高性能,因为 LFENCE 不会阻止它.而 SFENCE 禁用此机制.

For example reordering of MOV [addr], reg LFENCE --> LFENCE MOV [addr], reg provided by mechanism - Store Buffer, which reorders Store - Loads for performance increase, and beacause LFENCE does not prevent to it. And SFENCE disables this mechanism.

什么机制禁用了 LFENCE 以使其无法重新排序(x86 没有机制 - Invalidate-Queue)?

What mechanism disables the LFENCE to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?

并且是重新排序 SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE 仅在理论上可能还是在现实中可能?如果可能的话,实际上是什么机制,它是如何工作的?

And is reordering of SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE possible only in theory or perhaps in reality? And if possible, in reality, what mechanisms, how does it work?

推荐答案

x86围栏指令可以简述如下:

x86 fence instructions can be briefly described as follows:

  • MFENCE 可防止任何较晚的加载或存储在任何较早的加载或存储之前变得全局可见.它在以后的加载1可以执行之前耗尽存储缓冲区.

  • MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.

LFENCE 会阻止指令分派(Intel 的术语),直到所有较早的指令都失效为止.这目前是通过在以后的指令可以发出到后端之前排空 ROB(重新排序缓冲区)来实现的.

LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.

SFENCE 仅针对其他存储对存储进行排序,即防止 NT 存储在 SFENCE 本身之前从存储缓冲区提交.但除此之外,SFENCE 就像一个在存储缓冲区中移动的普通存储.把它想象成在杂货店结账传送带上放置一个隔板,可以防止 NT 商店过早被抢走.它不一定会强制在存储缓冲区从 ROB 退出之前将其排空,因此将 LFENCE 放在它与 MFENCE 不相加之后.

SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.

序列化指令"像 CPUID(和 IRET 等)在后面的指令可以发布到后端之前耗尽所有内容(ROB、存储缓冲区).MFENCE + LFENCE 也会这样做,但真正的序列化指令可能还有其他效果,我不知道.

A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end. MFENCE + LFENCE would also do that, but true serializing instructions might also have other effects, I don't know.

这些描述在订购什么样的操作方面有点含糊不清,并且不同供应商之间存在一些差异(例如 SFENCE 在 AMD 上更强),甚至来自同一供应商的处理器.有关详细信息,请参阅 Intel 的手册和规范更新以及 AMD 的手册和修订指南.在 SO 其他其他地方也有很多关于这些说明的其他讨论.但请先阅读官方资料.以上描述是我认为各供应商在纸上规定的最低限度行为.

These descriptions are a little ambiguous in terms of exactly what kind of operations are ordered and there are some differences across vendors (e.g. SFENCE is stronger on AMD) and even processors from the same vendor. Refer to the Intel's manual and specification updates and AMD's manual and revision guides for more information. There are also a lot of other discussions on these instructions on SO other other places. But read the official sources first. The descriptions above are I think the minimum specified on-paper behaviour across vendors.

脚注 1:以后商店的 OoO exec 不需要被 MFENCE 阻止;执行它们只是将数据写入存储缓冲区.按顺序提交已经在较早的存储之后对它们进行了排序,并在停用订单之后提交 wrt.加载(因为 x86 要求加载完成,而不仅仅是启动,然后才能退休,作为确保加载顺序的一部分).请记住,x86 硬件旨在禁止除 StoreLoad 之外的重新排序.

Footnote 1: OoO exec of later stores don't need to be blocked by MFENCE; executing them just writes data into the store buffer. In-order commit already orders them after earlier stores, and commit after retirement orders wrt. loads (because x86 requires loads to complete, not just to start, before they can retire, as part of ensuring load ordering). Remember that x86 hardware is built to disallow reordering other than StoreLoad.

英特尔手册第 2 卷编号 325383-072US 将 SFENCE 描述为确保在 SFENCE 之后的任何存储变为全局可见之前,SFENCE 之前的每个存储都是全局可见的".第 3 卷第 11.10 节说使用 SFENCE 时会耗尽存储缓冲区.对这条语句的正确解释正是第 2 卷中较早的语句.因此可以说 SFENCE 在这个意义上耗尽了存储缓冲区.无法保证在 SFENCE 的生命周期中较早的商店在什么时候实现 GO.对于任何较早的商店,它可能发生在 SFENCE 退休之前、之时或之后.关于 GO 的意义是什么,它取决于几个因素.这超出了问题的范围.请参阅:为什么movnti"后跟sfence"保证持久订购?.

The Intel manual Volume 2 number 325383-072US describes SFENCE as an instructions that "ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible." Volume 3 Section 11.10 says that the store buffer is drained when using the SFENCE. The correct interpretation of this statement is exactly the earlier statement from Volume 2. So SFENCE can be said to drain the store buffer in that sense. There is no guarantee at what point during SFENCE's lifetime earlier stores achieve GO. For any earlier store, it could happen before, at, or after retirement of SFENCE. Regarding what the point of GO is, it depends on serveral factors. This is beyond the scope of the question. See: Why "movnti" followed by an "sfence" guarantees persistent ordering?.

MFENCE 确实 必须防止 NT 存储与其他存储重新排序,因此它必须包括 SFENCE 所做的任何事情,以及耗尽存储缓冲区.并且还从 WC 内存重新排序弱排序的 SSE4.1 NT 加载,这更难,因为免费获得加载排序的正常规则不再适用于那些.保证这个 这就是为什么 Skylake 微代码更新加强(和减慢)MFENCE 也像 LFENCE 一样耗尽 ROB.MFENCE 仍然有可能比硬件支持更轻,以可选在管道中强制执行 NT 负载排序.

MFENCE does have to prevent NT stores from reordering with other stores, so it has to include whatever SFENCE does, as well as draining the store buffer. And also reordering of weakly-ordered SSE4.1 NT loads from WC memory, which is harder because the normal rules that get load ordering for free no longer apply to those. Guaranteeing this is why a Skylake microcode update strengthened (and slowed) MFENCE to also drain the ROB like LFENCE. It might still be possible for MFENCE to be lighter weight than that with HW support for optionally enforcing ordering of NT loads in the pipeline.

SFENCE + LFENCE 不等于 MFENCE 的主要原因是因为 SFENCE + LFENCE 不会阻止 StoreLoad 重新排序,因此不足以实现顺序一致性.只有mfence(或locked 操作,或像cpuid 这样的真正的序列化指令)才能做到这一点.请参阅 Jeff Preshing 的内存重新排序陷入困境只有完整屏障就足够的情况.

The main reason why SFENCE + LFENCE is not equal to MFENCE is because SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only mfence (or a locked operation, or a real serializing instruction like cpuid) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.

来自 英特尔的 sfence 指令集参考手册条目:

From Intel's instruction-set reference manual entry for sfence:

处理器确保在 SFENCE 之后的任何商店变得全局可见之前,SFENCE 之前的每个商店都是全局可见的.

The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible.

但是

它没有按照内存加载或 LFENCE 指令排序.


LFENCE 强制较早的指令在本地完成";(即从核心的乱序部分退出),但对于存储或 SFENCE 而言,这仅意味着将数据或标记放入内存顺序缓冲区中,而不是刷新它以使存储变得全局可见.即SFENCE完成";(从 ROB 退休)不包括刷新存储缓冲区.

这就像 Preshing 在 Memory Barriers Are Like源代码控制操作,其中 StoreStore 屏障不是即时"的.在那篇文章的后面,他解释了为什么#StoreStore + #LoadLoad + #LoadStore 屏障加起来不等于#StoreLoad 屏障.(x86 LFENCE 对指令流进行了一些额外的序列化,但由于它不会刷新存储缓冲区,因此推理仍然成立).

This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).

LFENCE 不像 cpuid 那样完全序列化(mfencelocked 指令 一样强大的内存屏障).它只是 LoadLoad + LoadStore 屏障,加上一些执行序列化的东西,这些东西可能作为一个实现细节开始,但现在被奉为保证,至少在 Intel CPU 上是这样.它对 rdtsc 很有用,并且可以避免分支推测以减轻 Spectre.

LFENCE is not fully serializing like cpuid (which is as strong a memory barrier as mfence or a locked instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful with rdtsc, and for avoiding branch speculation to mitigate Spectre.

顺便说一句,SFENCE 是 WB(普通)商店的空缺.

BTW, SFENCE is a no-op for WB (normal) stores.

它根据任何存储对 WC 存储(例如 movnt 或存储到视频 RAM)进行排序,但与加载或 LFENCE 无关.只有在通常弱排序的 CPU 上,商店-商店屏障才能为正常商店做任何事情.除非您使用 NT 存储或映射 WC 的内存区域,否则您不需要 SFENCE.如果它确实保证在它退役之前耗尽存储缓冲区,那么您可以使用 SFENCE+LFENCE 构建 MFENCE,但英特尔并非如此.

It orders WC stores (such as movnt, or stores to video RAM) with respect to any stores, but not with respect to loads or LFENCE. Only on a CPU that's normally weakly-ordered does a store-store barrier do anything for normal stores. You don't need SFENCE unless you're using NT stores or memory regions mapped WC. If it did guarantee draining the store buffer before it could retire, you could build MFENCE out of SFENCE+LFENCE, but that isn't the case for Intel.

真正关心的是 StoreLoad 在商店和负载之间重新排序,而不是在商店和屏障之间,所以您应该查看一个有商店、然后是屏障、然后是负载的情况.

The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.

mov  [var1], eax
sfence
lfence
mov   eax, [var2]

可以按以下顺序全局可见(即提交到 L1d 缓存):

can become globally visible (i.e. commit to L1d cache) in this order:

lfence
mov   eax, [var2]     ; load stays after LFENCE

mov  [var1], eax      ; store becomes globally visible before SFENCE
sfence                ; can reorder with LFENCE

这篇关于为什么(或不是?)SFENCE + LFENCE 等价于 MFENCE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆