全局不可见加载指令 [英] Globally Invisible load instructions

查看:13
本文介绍了全局不可见加载指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于存储加载转发,某些加载指令是否永远不会全局可见?换句话说,如果加载指令从存储缓冲区获取其值,则它永远不必从缓存中读取.
正如通常所说的那样,当从 L1D 缓存读取时,负载是全局可见的,不从 L1D 读取的负载应该使其全局不可见.

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache.
As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible.

推荐答案

加载全局可见性的概念很棘手,因为加载不修改内存的全局状态,而其他线程不能直接 观察它.

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

但是一旦乱序/推测执行后尘埃落定,我们可以知道如果线程将负载存储在某处或基于它的分支,负载得到了什么值.线程的这种可观察行为很重要.(或者,如果实验很困难,我们可以使用调试器观察它,和/或只是推断负载可能看到的值.)

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)

至少在像 x86 这样的强排序 CPU 上,所有 CPU 都可以就全局可见的存储总顺序达成一致,更新单个一致+一致的缓存+内存状态.在 x86 上,StoreStore 重新排序不是允许,此 TSO(总商店订单)与每个线程的程序顺序一致.(即总顺序是来自每个线程的程序顺序的一些交错).SPARC TSO 也是如此严格排序.

At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent+consistent cache+memory state. On x86, where StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(正确观察自己店铺相对于其他店铺的全局顺序需要 mfence 或类似的:否则 store-forwarding 意味着您可以立即看到自己的店铺,在它们对其他内核可见之前.x86 TSO 基本上是程序顺序加上存储转发.)

(Correctly observing the global order of your own stores relative to other stores requires mfence or similar: otherwise store-forwarding means you can see your own stores right away, before they become visible to other core. x86 TSO is basically program-order plus store-forwarding.)

(对于缓存绕过存储,全局可见性是当它们从私有写入组合缓冲区刷新到 DRAM 时.Intel Line Fill Buffers 或任何等效的私有写入组合机制,其中存储数据对其他 CPU 仍然不可见为我们重新排序的目的,实际上是存储缓冲区的一部分.)

(For cache-bypassing stores, global visibility is when they're flushed from private write-combining buffers into DRAM. Intel Line Fill Buffers or any equivalent private write-combining mechanism where store data is still not visible to other CPUs is effectively part of the store buffer for our reordering purposes.)

在弱排序的 ISA 上,线程 A 和 B 可能不会就线程 C 和 D 完成的存储 X 和 Y 的顺序达成一致,即使读取线程使用获取加载来确保它们自己的加载不是重新排序.即可能根本没有全局存储顺序,更不用说它与程序顺序不同了.

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

IBM POWER ISA 很弱,C++11 内存模型也很弱(其他线程是否总是以相同的顺序看到对不同线程中不同位置的两次原子写入?).但是在 POWER 上的实践机制是,(退休的,也就是毕业的)存储在通过提交到 L1d 缓存而成为全局可见之前对某些其他内核可见.即使在 POWER 系统中,缓存本身也确实是一致的,就像所有普通 CPU 一样,并且允许使用屏障恢复顺序一致性.这些多阶效应仅由于 SMT(一个物理 CPU 上的多个逻辑 CPU)提供一种无需通过缓存即可查看来自其他逻辑核心的存储的方法.

The IBM POWER ISA is that weak, and so is the C++11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). But the mechanism in practice on POWER is that (retired aka graduated) stores become visible to some other cores before they become globally visible by committing to L1d cache. Cache itself really is coherent even in POWER systems, like all normal CPUs, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) providing a way to see stores from other logical cores without going through cache.

(一种可能的机制是让其他逻辑线程在提交到 L1d 之前就从存储缓冲区中窥探非推测性存储,只将尚未退休的存储保留给逻辑线程私有.这可以减少线程间延迟稍微.x86 不能这样做,因为它会破坏强内存模型;当内核上有两个线程处于活动状态时,英特尔的 HT 静态地对存储缓冲区进行分区.但正如@BeeOnRope 评论的那样,允许重新排序的抽象模型可能是推理正确性的更好方法.仅仅因为您想不出导致重新排序的硬件机制并不意味着它不会发生.)

(One possible mechanism is be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

虽然不使用屏障或释放存储,但不如 POWER 弱的弱排序 ISA(在实践中和/或在纸上)仍会在每个内核的本地存储缓冲区中重新排序.在许多 CPU 上,所有存储都有一个全局顺序,但这不是程序顺序的某种交错.OoO CPU 必须跟踪内存顺序,因此单个线程不需要障碍来按顺序查看自己的存储,但允许存储从存储缓冲区提交到 L1d 的程序顺序肯定可以提高吞吐量(特别是如果有多个存储等待同一行,但程序顺序将从每个存储之间的组关联缓存中逐出该行.例如,令人讨厌的直方图访问模式.)

Weakly-ordered ISAs that aren't as weak as POWER (in practice and/or on paper) still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)

以上仍然只是关于商店可见性,而不是负载.我们能否将每次加载看到的值解释为在某个时刻从全局内存/缓存中读取(不考虑任何加载顺序规则)?

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

如果是这样,那么所有加载结果都可以通过将所有线程的所有存储和加载按某种组合顺序排列,读取和写入一致的全局内存状态来解释.

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

事实证明不,我们不能,存储缓冲区破坏了这一点:部分存储到加载转发为我们提供了一个反例(例如在 x86 上).在存储变得全局可见之前,窄存储和宽加载可以将来自存储缓冲区的数据与来自 L1d 缓存的数据合并.真正的 x86 CPU 确实做到了这一点,我们有真实的实验来证明这一点.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

如果您只查看完整的存储转发,其中加载仅从存储缓冲区中的一个存储中获取其数据,您可能会争辩说存储缓冲区延迟了加载.即加载出现在全局总加载-存储顺序中,紧跟在使该值全局可见的存储之后.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(这个全局总加载-存储顺序并不是试图创建一个替代的内存排序模型;它无法描述 x86 的实际加载排序规则.)

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)

如果来自另一个核心的存储更改了周围的字节,原子范围的加载可以读取一个从未存在的值,并且永远不会存在于全局一致状态.>

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

x86 是否可以重新排序一个狭窄的存储,并具有更宽的负载以完全包含它?,以及 Alex 的实验证明这种重新排序可以发生的答案,从而使该问题中提出的锁定方案无效.存储然后从同一地址重新加载不是 StoreLoad 内存屏障.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

有些人(例如Linus Torvalds)通过说存储缓冲区不连贯来描述这一点.(Linus 是在回复其他人,他独立发明了相同的无效锁定想法.)

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

另一个涉及存储缓冲区和一致性的问答:如何有效地并行设置位向量的位?.你可以做一些非原子 OR 来设置位,然后回来检查由于与其他线程冲突而错过的更新.但是您需要一个 StoreLoad 屏障(例如 x86 lock 或)以确保您在重新加载时不会看到自己的商店.

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.

此定义与 x86 手册一致,后者说负载不会与其他负载重新排序.即它们从本地内核的内存视图加载(按程序顺序).

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

加载本身可以变得全局可见,而与任何其他线程是否可以从该地址加载该值无关.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

虽然也许不谈论全球知名度"会更有意义;完全可缓存的负载,因为它们正在某处提取数据,而没有做任何具有可见效果的事情.只有不可缓存的加载(例如来自 MMIO 区域)才应被视为可见的副作用.

Although perhaps it would make more sense not to talk about "global visibility" of cacheable loads at all, because they're pulling data from somewhere, not doing anything with a visible effect. Only uncacheable loads (e.g. from an MMIO region) should be considered visible side-effects.

(在 x86 上,不可缓存的存储和加载的顺序非常严格,因此我认为将存储转发到不可缓存的存储是不可能的.除非该存储是通过与 UC 加载访问的同一物理页面的 WB 映射完成的.)

(On x86, uncacheable stores and loads are very strongly ordered, so store-forwarding to an uncachable store is I think impossible. Unless maybe the store was done via a WB mapping of the same physical page as the UC load is accessing.)

这篇关于全局不可见加载指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆