英特尔X86如何实现商店整体订单 [英] How does Intel X86 implements total order over stores

查看：61 发布时间：2021/4/24 21:08:10 x86 intel cpu-architecture memory-barriers micro-architecture

本文介绍了英特尔X86如何实现商店整体订单的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于X86的TSO内存模型，它可以保证所有商店的总订单.我的问题是，是否有人知道如何实现此目标.

我对如何实现所有四个栅栏都有很好的印象，因此我可以解释如何保留本地秩序.但是这四个栅栏只会给出PO.它不会给您TSO(我知道TSO允许较早的商店跳到较新的货物前面，因此仅需要4个栅栏中的3个).

单个地址上所有内存操作的总顺序是一致性的责任.但是我想知道英特尔(尤其是Skylake)如何在多个地址的商店中实现总订单.

解决方案

x86 TSO内存模型基本上相当于程序顺序加上具有存储转发功能的存储缓冲区.

从理论上讲，大多数产生的保证对于硬件而言，只需具有存储缓冲区和一致的共享内存就可以很容易地实现；存储缓冲区将OoO exec与有序提交要求(以及高速缓存未命中的存储)隔离开来，并可以推测性地执行存储和重新加载.

所有内核都可以就所有存储发生的总订单达成共识.或更准确地说，核不能在它们实际可以观察到的总顺序的任何部分上不同意.同时存储到2条不同的行是同时进行的，因此任何观察都可以与假设的总顺序中的任何顺序兼容.
如果使商店对任何其他核心可见的唯一方法是使其对所有核心同时可见，则这将自动发生.即通过承诺一致的L1d.这使得IRIW无法重新排序.(MESI确保除非该核心独家拥有商店，否则商店不能提交L1d:没有其他核心拥有有效副本.)(观察自己的商店的核心需要充分的障碍，否则它将通过商店转发来观察自己的商店，而不是全球总订单.典型的IRIW石蕊测试考虑了4个总线程，因此没有本地重新加载.)
事实上，很少有硬件 not 具有此属性；一些电源CPU 可以在同一物理内核上的SMT线程之间进行存储，从而使2个读取器可以不同意2个写入器的存储顺序(IRIW重新排序).即使x86 CPU也经常具有SMT(例如Intel的HyperThreading)，但是内存模型要求它们不要在逻辑内核之间进行存储转发.没关系;他们仍然静态地划分存储缓冲区.会发生什么用于线程之间的数据交换是在具有HT的一个Core上执行的?.还有问题与解答有关如何在内核之间建立同步的信息.)
- 存储按程序顺序可见:从存储缓冲区到L1d高速缓存的有序提交.(在发布/重命名期间，按程序顺序分配了存储缓冲区条目).这意味着高速缓存未命中存储必须使存储缓冲区停顿，而不允许较新的存储提交.请参阅为什么退休后没有RFO中断存储顺序?一个简单的心理模型，以及Skylake可能实际做什么的一些细节(在等待高速缓存行到达时将存储未命中的数据提交到LFB中).
- 加载不会在以后的存储中重新排序:容易:需要完全加载加载(已从L1d缓存中获取数据)才能退出.由于退货是有条理的，并且在退货之前，商店无法提交L1d(成为非投机性)，因此我们免费获得LoadStore订购.¹.>
- 加载按程序顺序从相干缓存(内存)中获取数据.这很困难:执行时加载访问全局状态(缓存)，这与存储区不同，在存储区中，存储缓冲区可以吸收OoO exec和有序提交之间的不匹配.实际上，使每个负载都依赖于先前的负载将防止未命中，并且会破坏涉及内存的代码的乱序执行的许多好处.
  在实践中，英特尔CPU积极地推测，当在结构上允许发生负载时(在较早的负载执行之后)，现在存在的高速缓存行将仍然存在.如果不是这种情况，请核对管道(内存顺序错误推测).有一个性能计数器事件.
实际上，为了追求更高的性能，一切都可能变得更加复杂，或者对于投机性的早期负载而言，这可能会更加复杂.
(在C ++术语中，这至少与 acq_rel 一样强，但也涵盖了在C ++中可能是UB的事物的行为.例如，部分负载什么兄弟姐妹与非超级兄弟姐妹之间的存储位置的生产者-消费者共享的延迟和吞吐成本是什么?导致 machine_clears.memory_ordering
C ++如何发布-仅使用MOV在x86上实现并获得?-MESI缓存一致性是所有这些的关键部分.
障碍物/栅栏如何以及获取，释放语义是通过微体系结构实现的?
C ++如何发布-仅使用MOV在x86上实现和获取?
全局不可见加载说明
为什么要刷新其他逻辑处理器引起的内存顺序违规的管道?
内存重新排序如何帮助处理器和编译器?
执行内存屏障以确保缓存的一致性已经完成?-甚至不是正确的思维模式.

脚注1:
一些OoO exec弱排序的CPU可以执行LoadStore re 排序，大概是通过让负载从ROB退出，只要负载检查了权限并请求了缓存行(对于未命中)，即使数据尚未真正到达.需要对寄存器尚未准备好进行一些单独的跟踪，而不是通常的指令调度程序.

在顺序流水线中，LoadStore重新排序实际上更容易理解，在这里我们知道需要对高速缓存未命中的负载进行特殊处理才能获得可接受的性能.如何通过-订单提交?

X86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented.

I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give PO; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are needed).

Total order over all memory actions over a single address is responsibility of coherence. But I would like know how Intel (Skylake in particular) implements a total order on stores over multiple addresses.

解决方案

The x86 TSO memory model basically amounts to program-order plus a store buffer with store-forwarding.

Most of the resulting guarantees are fairly easy in theory for hardware to implement by simply having a store buffer and coherent shared memory; a store buffer insulates OoO exec from the in-order commit requirement (and from cache-miss stores), and makes it possible to speculatively execute stores and reloads.

All cores can agree on a total order in which all stores happened. Or more accurately, cores can't disagree on any part of the total order they can actually observe. Simultaneous stores to 2 different lines are simultaneous, so any observations are compatible with either order in a hypothetical total order.

This happens automatically if the only way to make a store visible to any other core makes it visible to all cores simultaneously. i.e. by committing to coherent L1d. This makes IRIW reordering impossible. (MESI ensures that a store can't commit to L1d unless it's exclusively owned by this core: no other cores have a valid copy.) (A core observing its own stores needs a full barrier or it will observe its own stores via store forwarding, not the global total order. Typical IRIW litmus tests are considering 4 total threads so no local reloads.)

In fact it's rare for any hardware not to have this property; some POWER CPUs can store-forward between SMT threads on the same physical core, making it possible for 2 readers to disagree about the order of stores by 2 writers (IRIW reordering). Even though x86 CPUs also often have SMT (e.g. Intel's HyperThreading), the memory model requires them not to store-forward between logical cores. That's fine; they statically partition the store buffer anyway. What will be used for data exchange between threads are executing on one Core with HT?. And also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for experimental testing.

The only reordering that happens is local, within each CPU core, between its accesses to that globally coherent shared state. (That's why local memory barriers that just make this core wait for stuff to happen, e.g. for the store buffer to drain, can recover sequential consistency on top of x86 TSO. The same applies even to weaker memory models, BTW: just local reordering on top of MESI coherency.)

The rest of these guarantees apply to each (logical) CPU core individually. (Q&A about how this can create synchronization between cores.)

Stores become visible in program order: in-order commit from the store buffer to L1d cache. (Store buffer entries are allocated in program order during issue/rename). This means cache miss stores must stall the store buffer, not letting younger stores commit. See Why doesn't RFO after retirement break memory ordering? for a simple mental model of this, and some detail on what Skylake may actually do (with committing data from store misses into LFBs while waiting for the cache lines to arrive).
Loads don't reorder with later stores: easy: require loads to fully complete (have taken data from L1d cache) before they can retire. Since retirement is in order, and a store can't commit to L1d until after it retires (becomes non-speculative), we get LoadStore ordering for free¹.
Loads take data from coherent cache (memory) in program order. This is the hard one: loads access global state (cache) when they execute, unlike stores where the store buffer can absorb the mismatch between OoO exec and in-order commit. Actually making every load dependent on previous loads would prevent hit-under-miss and kill a lot of the benefits of out-of-order execution for code that involved memory.

In practice, Intel CPUs speculate aggressively that a cache line that's present now will still be present when it's architecturally allowed for the load to happen (after earlier loads execute). If that isn't the case, nuke the pipeline (memory order mis-speculation). There's a perf counter event for this.

In practice everything can be more complicated to chase a bit more performance, or a lot more for speculative early loads.

(In C++ terms, this is at least as strong as acq_rel, but also covers behaviour of things that might be UB in C++. For example, a load partially overlapping a recent store to a location another thread might also be reading or writing, allowing this core to load a value that never appeared or will appear in memory for other threads to load. Globally Invisible load instructions)

related Q&As:

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? causes machine_clears.memory_ordering
C++ How is release-and-acquire achieved on x86 only using MOV? - MESI cache coherency is a key part of all this.
how are barriers/fences and acquire, release semantics implemented microarchitecturally?
C++ How is release-and-acquire achieved on x86 only using MOV?
Globally Invisible load instructions
Why flush the pipeline for Memory Order Violation caused by other logical processors?
How does memory reordering help processors and compilers?
Does a memory barrier ensure that the cache coherence has been completed? - that's not even the right mental model.

Footnote 1:
Some OoO exec weakly-ordered CPUs can do LoadStore reordering, presumably by letting loads retire from the ROB as long as the load checked permissions and requested the cache line (for a miss), even if the data hasn't actually arrived yet. Some separate tracking of the register not being ready is needed, not the usual instruction scheduler.

LoadStore reordering is actually easier to understand on an in-order pipeline, where we know special handling for cache-miss loads is needed for acceptable performance. How is load->store reordering possible with in-order commit?

这篇关于英特尔X86如何实现商店整体订单的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

英特尔X86如何实现商店整体订单 [英] How does Intel X86 implements total order over stores

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

英特尔X86如何实现商店整体订单 [英] How does Intel X86 implements total order over stores

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭