英特尔X86如何实现商店整体订单 [英] How does Intel X86 implements total order over stores

查看:61
本文介绍了英特尔X86如何实现商店整体订单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于X86的TSO内存模型,它可以保证所有商店的总订单.我的问题是,是否有人知道如何实现此目标.

我对如何实现所有四个栅栏都有很好的印象,因此我可以解释如何保留本地秩序.但是这四个栅栏只会给出PO.它不会给您TSO(我知道TSO允许较早的商店跳到较新的货物前面,因此仅需要4个栅栏中的3个).

单个地址上所有内存操作的总顺序是一致性的责任.但是我想知道英特尔(尤其是Skylake)如何在多个地址的商店中实现总订单.

解决方案

x86 TSO内存模型基本上相当于程序顺序加上具有存储转发功能的存储缓冲区.

从理论上讲,大多数产生的保证对于硬件而言,只需具有存储缓冲区和一致的共享内存就可以很容易地实现;存储缓冲区将OoO exec与有序提交要求(以及高速缓存未命中的存储)隔离开来,并可以推测性地执行存储和重新加载.


脚注1:
一些OoO exec弱排序的CPU可以执行LoadStore re 排序,大概是通过让负载从ROB退出,只要负载检查了权限并请求了缓存行(对于未命中),即使数据尚未真正到达.需要对寄存器尚未准备好进行一些单独的跟踪,而不是通常的指令调度程序.

在顺序流水线中,LoadStore重新排序实际上更容易理解,在这里我们知道需要对高速缓存未命中的负载进行特殊处理才能获得可接受的性能.如何通过-订单提交?

X86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented.

I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give PO; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are needed).

Total order over all memory actions over a single address is responsibility of coherence. But I would like know how Intel (Skylake in particular) implements a total order on stores over multiple addresses.

解决方案

The x86 TSO memory model basically amounts to program-order plus a store buffer with store-forwarding.

Most of the resulting guarantees are fairly easy in theory for hardware to implement by simply having a store buffer and coherent shared memory; a store buffer insulates OoO exec from the in-order commit requirement (and from cache-miss stores), and makes it possible to speculatively execute stores and reloads.

  • All cores can agree on a total order in which all stores happened. Or more accurately, cores can't disagree on any part of the total order they can actually observe. Simultaneous stores to 2 different lines are simultaneous, so any observations are compatible with either order in a hypothetical total order.

    This happens automatically if the only way to make a store visible to any other core makes it visible to all cores simultaneously. i.e. by committing to coherent L1d. This makes IRIW reordering impossible. (MESI ensures that a store can't commit to L1d unless it's exclusively owned by this core: no other cores have a valid copy.) (A core observing its own stores needs a full barrier or it will observe its own stores via store forwarding, not the global total order. Typical IRIW litmus tests are considering 4 total threads so no local reloads.)

    In fact it's rare for any hardware not to have this property; some POWER CPUs can store-forward between SMT threads on the same physical core, making it possible for 2 readers to disagree about the order of stores by 2 writers (IRIW reordering). Even though x86 CPUs also often have SMT (e.g. Intel's HyperThreading), the memory model requires them not to store-forward between logical cores. That's fine; they statically partition the store buffer anyway. What will be used for data exchange between threads are executing on one Core with HT?. And also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for experimental testing.

The only reordering that happens is local, within each CPU core, between its accesses to that globally coherent shared state. (That's why local memory barriers that just make this core wait for stuff to happen, e.g. for the store buffer to drain, can recover sequential consistency on top of x86 TSO. The same applies even to weaker memory models, BTW: just local reordering on top of MESI coherency.)

The rest of these guarantees apply to each (logical) CPU core individually. (Q&A about how this can create synchronization between cores.)

  • Stores become visible in program order: in-order commit from the store buffer to L1d cache. (Store buffer entries are allocated in program order during issue/rename). This means cache miss stores must stall the store buffer, not letting younger stores commit. See Why doesn't RFO after retirement break memory ordering? for a simple mental model of this, and some detail on what Skylake may actually do (with committing data from store misses into LFBs while waiting for the cache lines to arrive).

  • Loads don't reorder with later stores: easy: require loads to fully complete (have taken data from L1d cache) before they can retire. Since retirement is in order, and a store can't commit to L1d until after it retires (becomes non-speculative), we get LoadStore ordering for free1.

  • Loads take data from coherent cache (memory) in program order. This is the hard one: loads access global state (cache) when they execute, unlike stores where the store buffer can absorb the mismatch between OoO exec and in-order commit. Actually making every load dependent on previous loads would prevent hit-under-miss and kill a lot of the benefits of out-of-order execution for code that involved memory.

    In practice, Intel CPUs speculate aggressively that a cache line that's present now will still be present when it's architecturally allowed for the load to happen (after earlier loads execute). If that isn't the case, nuke the pipeline (memory order mis-speculation). There's a perf counter event for this.

In practice everything can be more complicated to chase a bit more performance, or a lot more for speculative early loads.

(In C++ terms, this is at least as strong as acq_rel, but also covers behaviour of things that might be UB in C++. For example, a load partially overlapping a recent store to a location another thread might also be reading or writing, allowing this core to load a value that never appeared or will appear in memory for other threads to load. Globally Invisible load instructions)

related Q&As:


Footnote 1:
Some OoO exec weakly-ordered CPUs can do LoadStore reordering, presumably by letting loads retire from the ROB as long as the load checked permissions and requested the cache line (for a miss), even if the data hasn't actually arrived yet. Some separate tracking of the register not being ready is needed, not the usual instruction scheduler.

LoadStore reordering is actually easier to understand on an in-order pipeline, where we know special handling for cache-miss loads is needed for acceptable performance. How is load->store reordering possible with in-order commit?

这篇关于英特尔X86如何实现商店整体订单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆