记忆屏障是否既充当标记又充当指示? [英] Does a memory barrier acts both as a marker and as an instruction?

查看:85
本文介绍了记忆屏障是否既充当标记又充当指示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于记忆屏障的工作原理,我有不同的看法.

I have read different things about how a memory barrier works.

例如,用户 Johan

For example, the user Johan's answer in this question says that a memory barrier is an instruction that the CPU executes.

此问题中的用户 Peter Cordes 的评论关于CPU如何对指令重新排序的内容如下:

While the user Peter Cordes's comment in this question says the following about how the CPU reorders instructions:

它的读取速度快于其执行速度,因此可以看到一个窗口 即将发布的指示.有关详细信息,请参见x86中的某些链接. 标签Wiki,例如Agner Fog的microarch pdf,以及David Kanter的 英特尔Haswell设计的文章.当然,如果您只是 谷歌搜索乱序执行",您会找到维基百科的文章, 您应该阅读.

It reads faster than it can execute, so it can see a window of upcoming instructions. For details, see some of the links in the x86 tag wiki, like Agner Fog's microarch pdf, and also David Kanter's writeup of Intel's Haswell design. Of course, if you had simply googled "out of order execution", you'd find the wikipedia article, which you should read.

因此,我根据上述评论猜测,如果指令之间存在内存屏障,CPU将看到此内存屏障,这将导致CPU不对指令进行重新排序,因此这意味着内存屏障是标记"供CPU查看而不执行.

So I'm guessing based on the above comment that if a memory barrier exists between the instructions, the CPU will see this memory barrier, which causes the CPU not to reorder the instructions, so this means that a memory barrier is a "marker" for the CPU to see and not to execute.

现在我的猜测是,内存屏障既可以作为标记,也可以作为CPU执行的指令.

Now my guess is that a memory barrier acts both as a marker and as an instruction for the CPU to execute.

对于标记部分,CPU看到指令之间的内存屏障,这导致CPU不会对指令重新排序.

For the marker part, the CPU sees the memory barrier between the instructions, which causes the CPU not to reorder the instructions.

对于指令部分,CPU将执行内存屏障指令,这将导致CPU执行诸如刷新存储缓冲区之类的操作,然后CPU将在内存屏障之后继续执行指令.

As for the instruction part, the CPU will execute the memory barrier instruction, which causes the CPU to do things like flushing the store buffer, and then the CPU will continue to execute the instructions after the memory barrier.

我正确吗?

推荐答案

否,mfence未在指令流上序列化,并且lfence(即)不会刷新存储缓冲区.

(在Skylake中,mfence 确实阻止了更高版本ALU指令的乱序执行,而不仅仅是加载.(

No, mfence is not serializing on the instruction stream, and lfence (which is) doesn't flush the store buffer.

(In practice on Skylake, mfence does block out-of-order execution of later ALU instructions, not just loads. (Proof: experiment details at the bottom of this answer). So it's implemented as an execution barrier, even though on paper it's not required to be one. But lock xchg doesn't, and is also a full barrier.)

我建议阅读Jeff Preshing的内存壁垒像源代码控制操作"一文一样,可以更好地了解内存屏障需要做什么以及它们不需要需要做什么.他们一般不会(需要)阻止乱序执行.

I'd suggest reading Jeff Preshing's Memory Barriers Are Like Source Control Operations article, to get a better understanding of what memory barriers need to do, and what they don't need to do. They don't (need to) block out-of-order execution in general.

内存屏障限制了内存操作可以全局可见的顺序, (不一定)限制了指令的执行顺序.请阅读@BeeOnRope对您的更新后的答案再次回答上一个问题:是否需要对x86 CPU重新排序的说明?以了解更多信息关于没有OoO exec的情况下如何进行内存重新排序,以及没有内存重新排序的OoO exec如何发生.

A memory barrier restricts the order that memory operations can become globally visible, not (necessarily) the order in which instructions execute. Go read @BeeOnRope's updated answer to your previous question again: Does an x86 CPU reorder instructions? to learn more about how memory reordering can happen without OoO exec, and how OoO exec can happen without memory reordering.

对管道和刷新缓冲区进行总计是一种(低性能)实现屏障的方法,

Stalling the pipeline and flushing buffers is one (low-performance) way to implement barriers, used on some ARM chips, but higher-performance CPUs with more tracking of memory ordering can have cheaper memory barriers that only restrict ordering of memory operations, not all instructions. And for memory ops, they control order of access to L1d cache (at the other end of the store buffer), not necessarily the order that stores write their data into the store buffer.

对于x86,x86已经需要大量的内存顺序跟踪来实现正常的加载/存储,以实现高性能,同时保持其严格排序的内存模型,其中只有

x86 already needs lots of memory-order tracking for normal loads/stores for high performance while maintaining its strongly-ordered memory model where only StoreLoad reordering is allowed to be visible to observers outside the core (i.e. stores can be buffered until after later loads). (Intel's optimization manual uses the term Memory Order Buffer, or MOB, instead of store buffer, because it has to track load ordering as well. It has to do a memory-ordering machine clear if it turns out that a speculative load took data too early.) Modern x86 CPUs preserve the illusion of respecting the memory model while actually executing loads and stores aggressively out of order.

mfence只需将标记写入内存顺序缓冲区即可完成其工作,而不会妨碍以后的ALU指令乱序执行 .该标记必须至少阻止以后的加载执行,直到mfence标记到达存储缓冲区的末尾. (以及在弱排序的WC存储器上订购NT存储和操作).

mfence can do its job just by writing a marker into the memory-order buffer, without being a barrier for out-of-order execution of later ALU instructions. This marker must at least prevent later loads from executing until the mfence marker reaches the end of the store buffer. (As well as ordering NT stores and operations on weakly-ordered WC memory).

(但同样,更简单的行为是一个有效的实现选择,例如,在mfence之后不允许任何存储将数据写入存储缓冲区,直到所有较早的加载都已退出并且较早的存储已提交到L1d缓存为止.即,完全耗尽MOB/存储缓冲区.我不确切知道当前的Intel或AMD CPU做什么.)

(But again, simpler behaviour is a valid implementation choice, for example not letting any stores after an mfence write data to the store buffer until all earlier loads have retired and earlier stores have committed to L1d cache. i.e. fully drain the MOB / store buffer. I don't know exactly what current Intel or AMD CPUs do.)

专门在Skylake上,

On Skylake specifically, my testing shows mfence is 4 uops for the front-end (fused domain), and 2 uops that actually execute on execution ports (one for port2/3 (load/store-address), and one for port4 (store-data)). Presumably it's a special kind of uop that writes a marker into the memory-order buffer. The 2 uops that don't need an execution unit might be similar to lfence. I'm not sure if they block the front-end from even issuing a later load, but hopefully not because that would stop later independent ALU operations from being executed.

lfence是一个有趣的情况:除了作为LoadLoad + LoadStore屏障(即使对于弱排序的负载;已经订购了正常的负载/存储)之外,

lfence is an interesting case: as well as being a LoadLoad + LoadStore barrier (even for weakly-ordered loads; normal loads/stores are already ordered), lfence is also a weak execution barrier (note that mfence isn't, just lfence). It can't execute until all earlier instructions have "completed locally". Presumably that means "retired" from the out-of-order core.

但是,商店必须在之后退出后才能提交到L1d缓存(即,在它被认为是非推测性的之后),因此要等待商店从ROB退出(ReOrder Buffer for uops)与等待存储缓冲区为空不一样.参见为什么(或不是)? SFENCE + LFENCE是否等于MFENCE?.

But a store can't commit to L1d cache until after it retires anyway (i.e. after it's known to be non-speculative), so waiting for stores to retire from the ROB (ReOrder Buffer for uops) isn't the same thing as waiting for the store buffer to empty. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

是的,CPU管道在执行之前确实必须通知" lfence,大概是在发布/重命名阶段.我的理解是lfence在ROB为空之前不会发出. (根据Agner Fog的测试,在Intel CPU上,前端的lfence为2 oups,但是它们都不需要执行单元.

So yes, the CPU pipeline does have to "notice" lfence before it executes, presumably in the issue/rename stage. My understanding is that lfence can't issue until the ROB is empty. (On Intel CPUs, lfence is 2 uops for the front-end, but neither of them need execution units, according to Agner Fog's testing. http://agner.org/optimize/.)

lfence在AMD Bulldozer系列上甚至更便宜:1 uop,每4时钟吞吐量. IIRC,它不是在这些CPU上部分序列化的,因此您只能使用lfence; rdtsc来阻止rdtsc在Intel CPU上尽早采样时钟.

lfence is even cheaper on AMD Bulldozer-family: 1 uop with 4-per-clock throughput. IIRC, it's not partially-serializing on those CPUs, so you can only use lfence; rdtsc to stop rdtsc from sampling the clock early on Intel CPUs.

对于诸如cpuidiret之类的完全序列化指令,它还将等待直到存储缓冲区耗尽. (它们是完全的内存屏障,和mfence 一样强).或类似的东西;它们是多个微词,所以也许只有 last 可以序列化,所以我不确定cpuid的实际工作发生在障碍的哪一侧(或者如果不能与cpuid重叠)之前或之后的说明).无论如何,管道本身必须注意到序列化指令,但是完整的内存屏障效果可能来自执行mfence的uops.

For fully serializing instructions like cpuid or iret, it would also wait until the store buffer has drained. (They're full memory barriers, as strong as mfence). Or something like that; they're multiple uops so maybe only the last one does the serializing, I'm not sure which side of the barrier the actual work of cpuid happens on (or if it can't overlap with either earlier or later instructions). Anyway, the pipeline itself has to notice serializing instructions, but the full memory-barrier effect might be from uops that do what mfence does.

在AMD Bulldozer系列中,sfencemfence一样昂贵,并且可能是一个强大的障碍. (x86文档为每种障碍设置了最低强度;它们没有阻止它们变得更坚固,因为这不是正确性问题). Ryzen有所不同:sfence每20c吞吐量有1个,而mfence每70c吞吐量有1个.

On AMD Bulldozer-family, sfence is as expensive as mfence, and may be as strong a barrier. (The x86 docs set a minimum for how strong each kind of barrier is; they don't prevent them from being stronger because that's not a correctness problem). Ryzen is different: sfence has one per 20c throughput, while mfence is 1 per 70c.

sfence在Intel上非常便宜(用于port2/port3的uop,用于port4的uop),并且只需订购NT商店wrt.普通存储,而不刷新存储缓冲区或序列化执行.它可以每6个周期执行一次.

sfence is very cheap on Intel (a uop for port2/port3, and a uop for port4), and just orders NT stores wrt. normal stores, not flushing the store buffer or serializing execution. It can execute at one per 6 cycles.

sfence不会在退出之前耗尽存储缓冲区.直到之前的所有存储都首先变得全局可见,它本身才变得全局可见,但是存储缓冲区执行流水线.存储缓冲区始终尝试耗尽自身(即,将存储提交到L1d),因此sfence不必做任何特殊的事情,除了在MOB中放置一种特殊的标记以阻止NT存储对其重新排序之外,这与其他方法不同.常规商店放置的仅订购wrt的标记.定期存储和以后的加载.

sfence doesn't drain the store buffer before retiring. It doesn't become globally visible itself until all preceding stores have become globally visible first, but this is decoupled from the execution pipeline by the store buffer. The store buffer is always trying to drain itself (i.e. commit stores to L1d) so sfence doesn't have to do anything special, except for putting a special kind of mark in the MOB that stops NT stores from reordering past it, unlike the marks that regular stores put which only order wrt. regular stores and later loads.

它的读取速度快于其执行速度,因此它可以看到即将出现的指令的窗口.

It reads faster than it can execute, so it can see a window of upcoming instructions.

请参阅我写的这个答案是我的评论的更详细的版本.它介绍了现代x86 CPU如何通过查看尚未执行的指令来查找和利用指令级并行性的一些基础知识.

See this answer I wrote which is a more detailed version of my comment. It goes over some basics of how a modern x86 CPU finds and exploits instruction-level parallelism by looking at instructions that haven't executed yet.

在具有高ILP的代码中,最近的Intel CPU实际上很容易在前端出现瓶颈.后端具有如此多的执行单元,因此除非存在数据相关性或高速缓存未命中,否则您很少会成为瓶颈,或者您使用大量只能在有限端口上运行的单个指令. (例如,矢量随机播放).但是只要后端不跟上前端,乱序窗口就会开始填充指令以查找并行性.

In code with high ILP, recent Intel CPUs can actually bottleneck on the front-end fairly easily; the back-end has so many execution units that it's rarely a bottleneck unless there are data dependencies or cache misses, or you use a lot of a single instruction that can only run on limited ports. (e.g. vector shuffles). But any time the back-end doesn't keep up with the front-end, the out-of-order window starts to fill with instructions to find parallelism in.

这篇关于记忆屏障是否既充当标记又充当指示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆