全局不可见的加载说明 [英] Globally Invisible load instructions

查看:93
本文介绍了全局不可见的加载说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于商店负载转发,某些负载指令是否可能在全局上不可见?换句话说,如果加载指令从存储缓冲区中获取其值,则它永远不必从高速缓存中读取。

正如通常所说的那样,当从L1D缓存中读取负载时,该负载在全局范围内可见,而未从L1D读取的负载应该使其在全局范围内不可见。

解决方案

负载的全局可见性概念很棘手,因为负载不会修改内存的全局状态,而其他线程不能直接 观察它。



但是,一旦无序/投机执行后尘埃落定,我们就可以知道如果线程将其存储在某个地方,负载将获得什么价值,或基于此的分支。重要的是线程的这种可观察到的行为。 (或者我们可以用调试器观察它,和/或仅是在实验困难的情况下说明负载可能看到的值的原因。)






至少在诸如x86之类的强排序CPU上,所有CPU都可以就全局可见的存储总顺序达成共识,从而更新单个一致的+一致的缓存+内存状态。在x86上,其中 StoreStore重新排序不是允许的情况下,此TSO(总存储订单)与每个线程的程序顺序一致。 (即总顺序是每个线程中程序顺序的某种交织)。 SPARC TSO的顺序也很严格。



(对于绕过高速缓存的存储,全局可见性是将它们从非一致性写合并缓冲区刷新到DRAM中时的结果。)



在弱排序的ISA上,即使读取线程使用了线程C和D,线程A和B可能也对线程X和线程Y进行的存储X和Y的顺序不一致获取负载,以确保自己的负载不会重新排序。也就是说,可能根本没有 全球商店的顺序,更不用说与程序顺序不同了。



IBM POWER ISA太弱了,C ++ 11内存模型也很弱(其他线程是否总是以相同的顺序看到对不同线程中不同位置的两次原子写操作?)。这似乎与存储模型从存储缓冲区提交到L1d缓存时变得全局可见是矛盾的。但是@BeeOnRope 在评论中说说缓存确实是连贯的,并允许使用障碍恢复顺序一致性。这些多阶效应仅是由于SMT(一个物理CPU上有多个逻辑CPU)引起异常奇怪的本地重新排序而发生。



(一种可能的机制是让其他逻辑线程甚至在它们提交到L1d之前就从存储缓冲区中监听非推测性存储,仅将尚未退休的存储保持为逻辑线程私有,这可以稍微减少线程间延迟。x86无法这样做,因为这会破坏一个强大的内存模型;当一个内核上有两个线程处于活动状态时,英特尔的HT会静态地划分存储缓冲区,但是正如@BeeOnRope注释所示,允许重新排序的抽象模型可能是推理正确性的更好方法。您不会想到导致重新排序的硬件机制并不意味着它不会发生。



不是的弱排序ISA与POWER一样弱的t仍然会在每个内核的本地存储缓冲区中进行重新排序,如果不使用障碍或释放存储,则t ough。在许多CPU上,所有存储都具有全局顺序,但这并不是程序顺序的某种交错。 OoO CPU必须跟踪内存顺序,因此单个线程不需要障碍就可以看到自己的存储,但是允许存储从存储缓冲区以程序顺序提交到L1d肯定可以提高吞吐量(特别是如果有多个存储)等待同一行,但是程序顺序会从每个商店之间的集合关联缓存中退出该行,例如讨厌的直方图访问模式。)






让我们对加载数据的来源进行一次思想实验



以上内容仍然只是关于商店可见性,而不是加载。 我们能否解释一下在某个时刻从全局内存/缓存中读取的每个负载所看到的值(不考虑任何负载排序规则)?



如果是这样,则可以通过将所有线程的所有存储和加载置于某种组合顺序,读取和写入一致的全局内存状态来解释所有的加载结果。

事实证明,不,我们不能,存储缓冲区会破坏:部分存储到加载转发为我们提供了一个反例(在x86上为例)。从狭窄的存储区到紧随其后的宽负载区,可以从存储区变为全局可见之前合并存储缓冲区中的数据与L1d高速缓存中的数据。 真正的x86 CPU确实可以做到这一点,并且我们有真实的实验来证明这一点。



如果您只看完整的存储转发,加载仅从存储缓冲区中的一个存储中获取其数据,您可能会认为加载受存储缓冲区延迟。也就是说,负载会显示在使该值全局可见的商店之后的全局总存储商店订单中。



(此全局总存储商店订单不是尝试创建替代的内存排序模型;它无法描述x86的实际负载排序规则。)






部分存储转发暴露了以下事实:装入数据并不总是来自全局一致的缓存域。



如果另一个核心的存储更改了周围的字节,则 >在全局相干状态下,原子范围的负载可以读取一个不存在且永不存在的值。



x86可以重新排序一个狭窄的存储库,而该存储库可以容纳更大的负载并完全包含它吗?,而Alex对于实验证明的回答表明,这种重新排序可能会发生笔,使在该问题中提出的锁定方案无效。 商店然后从同一地址重新加载不是StoreLoad内存障碍



有些人(例如Linus Torvalds)通过说存储缓冲区不连贯来描述这一点。 (Linus回答了独立发明了相同无效锁定想法的其他人。)



另一个涉及存储缓冲区和一致性的问题解答:如何有效地并行设置位向量的位? 。您可以执行一些非原子的OR来设置位,然后返回并检查是否由于与其他线程的冲突而错过更新。但是您需要一个StoreLoad屏障(例如x86 lock或)来确保您在重新加载时不仅看到自己的商店。






读取数据时,负载在全局范围内可见。通常从L1d开始,但是存储缓冲区或MMIO或不可缓存的内存是其他可能的来源。 。即从本地内核的内存视图中加载(按程序顺序)。



加载本身可以独立于全局可见是否有其他线程可以从该地址加载该值。


Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache.
As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible.

解决方案

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)


At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent+consistent cache+memory state. On x86, where StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(For cache-bypassing stores, global visibility is when they're flushed from non-coherent write-combining buffers into DRAM.)

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

The IBM POWER ISA is that weak, and so is the C++11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). That would seem to conflict with the model of stores becoming globally visible when they commit from the store buffer to L1d cache. But @BeeOnRope says in comments that the cache really is coherent, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) causing extra-weird local reordering.

(One possible mechanism would be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

Weakly-ordered ISAs that aren't as weak as POWER still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)


Let's do a thought experiment about where load data comes from

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)


Partial store-forwarding exposes the fact that load data doesn't always come from the global coherent cache domain.

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.


A load becomes globally visible when it reads its data. Normally from L1d, but the store buffer or MMIO or uncacheable memory are other possible sources.

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

这篇关于全局不可见的加载说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆