退休后RFO为什么不中断存储顺序? [英] Why doesn't RFO after retirement break memory ordering?

查看:109
本文介绍了退休后RFO为什么不中断存储顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以为我了解如何处理L1D写入未命中,但是仔细思考却使我感到困惑.

I thought that I understood how L1D write miss is handled, but thinking carefully about it made me confused.

这是汇编语言片段:

;rdi contains some valid 64-bytes aligned pointer
;rsi contains some data
mov [rdi], rsi
mov [rdi + 0x40], rsi        
mov [rdi + 0x20], rsi

假定l1d中的 [rdi] [rdi + 0x40] 行未处于排他"或修改"状态.然后,我可以想象以下操作顺序:

Assume that [rdi] and [rdi + 0x40] lines are not in the Exclusive or Modified state in l1d. Then I can imagine the following sequence of actions:

  1. mov [rdi],rsi 退役.
  2. mov [rdi],rsi 尝试将数据写入l1d.启动RFO,将数据放入WC缓冲区.
  3. mov [rdi + 0x40],rsi 退役( mov [rdi],rsi 已经退役,所以有可能)
  4. mov [rdi + 0x40],rsi 为连续的高速缓存行启动RFO,将数据放入WC缓冲区.
  5. mov [rdi + 0x20],rsi 退役( mov [rdi + 0x40],rsi 已经退役,因此有可能)
  6. mov [rdi + 0x20],rsi 注意到正在进行 [rdi] 的RFO.数据被放入WC缓冲区.

  1. mov [rdi], rsi retires.
  2. mov [rdi], rsi tries to write data into l1d. RFO is initiated, data is placed into WC buffer.
  3. mov [rdi + 0x40], rsi retires (mov [rdi], rsi already retired, so it's possible)
  4. mov [rdi + 0x40], rsi initiates RFO for the consecutive cache line, data is placed into WC buffer.
  5. mov [rdi + 0x20], rsi retires (mov [rdi + 0x40], rsi already retired so it is possible)
  6. mov [rdi + 0x20], rsi notices that there is RFO for [rdi] in progress. The data is placed into WC buffer.

BOOM! [rdi] RFO恰好在 [rdi + 0x40] RFO之前完成,因此 mov [rdi],rsi 的数据现在可以将mov [rdi + 0x20],rsi 提交到缓存中.这样会破坏内存顺序.

BOOM! [rdi] RFO is happened to finish before [rdi + 0x40] RFO so the data of mov [rdi], rsi and mov [rdi + 0x20], rsi can now be commited to the cache. It breaks memory ordering.

如何处理这种情况以保持正确的内存顺序?

How is such case handled to maintain correct memory ordering?

推荐答案

启动RFO可以与将商店数据放入LFB分开;例如对于尚未位于存储缓冲区开头的条目提早启动RFO,可以允许存储的内存级别并行性.您已证明的是,要做到这一点,存储数据不能总是移入LFB(Line Fill Buffer,也用于NT/WC存储).

Starting an RFO can be separate from placing the store data into an LFB; e.g. starting RFOs early for entries that aren't yet at the head of the store buffer can allow memory-level parallelism for stores. What you've proved is that for that to happen, store data can't always move into an LFB (Line Fill Buffer, also used for NT / WC stores).

如果仅通过将存储数据从存储缓冲区(SB)移到LFB中才能发生RFO,那么可以,您只能将RFO用作SB头,而不能并行进行任何有序条目.(分级"商店是指已从ROB退出,即成为非投机性商店).但是,如果您没有此要求,您甚至可以更早地甚至是推测性地进行RFO,但您可能不想这样做. 1

If an RFO could only happen by moving store data from the store buffer (SB) into an LFB, then yes, you could only RFO for the head of the SB, not in parallel for any graduated entry. (A "graduated" store is one whose uops have retired from the ROB, i.e. become non-speculative). But if you don't have that requirement, you could RFO even earlier, even speculatively, but you probably wouldn't want to.1

(给出@BeeOnRope的发现,关于如何将多个缓存缺失存储到同一行可以提交到LFB中,然后提交到另一行的另一个LFB中,这可能是使多个RFO处于运行状态的机制,而不仅仅是SB头我们必须检查ABA存储模式是否限制了内存级别的并行性,如果是这样,那么启动RFO 就像将数据从SB移到LFB一样,释放该SB条目.但是请注意,直到那些待处理的RFO完成并从LFB提交存储后,SB的新负责人仍然无法提交.)

(Given @BeeOnRope's findings about how multiple cache-miss stores to the same line can commit into an LFB, and then another LFB for another line, this might be the mechanism for having multiple RFOs in flight, not just the SB head. We'd have to check if an ABA store pattern limited memory-level parallelism. If that's the case, then maybe starting an RFO is the same as moving the data from the SB to an LFB, freeing that SB entry. But note that the new head of the SB still couldn't commit until those pending RFOs complete and commit the stores from the LFBs.)

在商店未命中时,商店缓冲区条目将保留商店数据,直到RFO完成 ,然后直接提交到L1d(将行从排他"状态更改为已修改"状态).通过从存储缓冲区 2 的头部进行有序提交来确保有序的排序.

On a store miss, the store buffer entry holds the store data until the RFO is complete, and commits straight into L1d (flipping the line from Exclusive to Modified state). Strong ordering is ensured by in-order commit from the head of the store buffer2.

如@HadiBrais在回答的答复中所写缓冲区位于?x86

As @HadiBrais wrote in answer to Where is the Write-Combining Buffer located? x86

我的理解是,对于可缓存存储,只有RFO请求是保留在LFB中,但是要存储的数据在存储缓冲区中等待直到将目标行提取到为其分配的LFB条目中.以下的第2.4.5.2节中的声明支持了这一点.英特尔优化手册:

My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:

L1 DCache最多可以分配维护64个加载微操作直到退休.它最多可以维护36个商店操作分配,直到将存储值提交到缓存或写入如果是非临时存储,则将它们添加到行填充缓冲区(LFB).

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

这对于考虑性能调整非常好,但可能 MDS漏洞可以推测性地使用过时的数据,这些数据是从LFB或其他任何设备读取的故障负载.

This is pretty much fine for thinking about performance tuning, but probably not MDS vulnerabilities that can speculatively use stale data that faulting loads read from an LFB or whatever.

任何存储合并或其他技巧都必须遵守内存模型.

Any store coalescing or other tricks must necessarily respect the memory model.

我们知道CPU不能违反其内存模型,并且投递+回滚不是提交到L1d之类的全局可见状态的选项,也不是通常用于分级存储的选项,这是因为uops是从ROB移走的.就本地OoO执行人员而言,它们已经发生过,这只是它们何时可以被其他内核看到的问题.我们也知道LFB本身在全球范围内是 not 可见的.(有迹象表明,LFB被该内核的负载(例如存储缓冲区)监听了,但就MESI而言,它们更像是存储缓冲区的扩展.)

We know CPUs can't violate their memory model, and that speculation + roll back isn't an option for commit to globally-visible state like L1d, or for graduated stores in general because the uops are gone from the ROB. They've already happened as far as local OoO exec is concerned, it's just a matter of when they'll become visible to other cores. Also we know that LFBs themselves are not globally visible. (There's some indication that LFBs are snooped by loads from this core, like the store buffer, but as far as MESI states they're more like an extension of the store buffer.)

@BeeOnRope做了更多的实验,发现一些证据表明,诸如AAABBCCCC之类的一系列商店可以排入A,B,C行的三个LFB.

@BeeOnRope has done some more experiments, finding some evidence that a series of stores like AAABBCCCC can drain into three LFBs, for lines A, B, C. RWT thread with an experiment that demonstrates a 4x perf difference predicted by this theory.

这意味着CPU可以跟踪LFB之间的顺序,尽管当然之内还不能.诸如AAABBCCCCA(或ABA)之类的序列将无法提交到最终的 A 存储区,因为当前头" LFB用于C行,并且已经有一个LFB等待A行到达.第四行(D)可以,打开一个新的LFB,但是将其添加到已经打开的LFB中,以等待不是头部的RFO也不行.请参阅 @Bee的摘要摘要.

This implies that the CPU can track order between LFBs, although still not within a single LFB of course. A sequence like AAABBCCCCA (or ABA) would not be able to commit past the final A store because the "current head" LFB is for line C, and there's already an LFB waiting for line A to arrive. A 4th line (D) would be ok, opening a new LFB, but adding to an already-open LFB waiting for an RFO that isn't the head is not ok. See @Bee's summary in comments.

所有这些仅针对英特尔CPU AFAIK进行了测试.

All of this is only tested for Intel CPUs, AFAIK.

(此部分未根据@BeeOnRope的新发现进行更新).

(This section not updated in light of @BeeOnRope's new discovery).

也没有确凿的证据表明商店中有任何类型的商店合并/合并缓冲区在现代Intel或AMD CPU上使用,或者使用WC缓冲区(在Intel上为LFB)来保存存储数据,同时等待高速缓存行到达.请参阅最近的Intel上是否需要两个存储缓冲区条目用于拆分行/页面存储?.我们不能排除它在存储缓冲区的提交端附近的一些次要形式.

There's also no solid evidence of any kind of store merging / coalescing in the store buffer on modern Intel or AMD CPUs, or of using a WC buffer (LFB on Intel) to hold store data while waiting for a cache line to arrive. See discussion in comments under Are two store buffer entries needed for split line/page stores on recent Intel?. We can't rule out some minor form of it near the commit end of the store buffer.

我们知道

We know that some weakly-ordered RISCs microarchitectures definitely do merge stores before they commit, especially to create a full 4-byte or 8-byte write of a cache ECC granule to avoid an RMW cycle. But Intel CPUs don't have any penalty for narrow or unaligned stores within a cache line.

有一段时间,我和@BeeOnRope认为有一些商店合并的证据,但是我们改变了主意. Intel上存储缓冲区的大小硬件?究竟什么是存储缓冲区?有更多详细信息(以及指向较早讨论的链接).

For a while @BeeOnRope and I thought there was some evidence of store coalescing, but we've changed our minds. Size of store buffers on Intel hardware? What exactly is a store buffer? has some more details (and links to older discussions).

(更新:现在终于有了商店合并的证据,并解释了一种有意义的机制.)

(Update: and now there is finally evidence of store coalescing, and an explanation of a mechanism that makes sense.)

脚注1: RFO占用共享带宽并从其他内核中窃取线路,从而减慢了速度.如果您过早地进行RFO,您可能会再次失去该行,然后才真正投入其中.LFB对于负载也同样需要,您不想让它饿死(因为在等待负载结果时执行停顿了).负载与商店根本不同,并且通常优先考虑.

Footnote 1: An RFO costs shared bandwidth and steals the line from other cores, slowing them down. And you might lose the line again before you get to actually commit into it if you RFO too early. LFBs are also needed for loads, which you don't want to starve (because execution stalls when waiting for load results). Loads are fundamentally different from stores, and generally prioritized.

因此,至少等待商店毕业是一个不错的计划,并且可能仅在头部之前的最后几个商店缓冲区条目中启动RFO.(在启动RFO之前,您需要检查L1d是否已经拥有该行,并且至少要为标签使用一个高速缓存读取端口,尽管不是数据.我可能会猜测存储缓冲区一次要检查1个条目并标记一个条目还要注意,1个SB条目可能是未对齐的缓存拆分存储,并且触摸了2个缓存行,最多需要2个RFO ...

So waiting at least for the store to graduate is a good plan, and maybe only initiating RFOs for the last few store-buffer entries before the head. (You need to check if L1d already owns the line before starting an RFO, and that takes a cache read port for at least the tags, although not data. I might guess that the store buffer checks 1 entry at a time and marks an entry as likely not needing an RFO.) Also note that 1 SB entry could be a misaligned cache-split store and touch 2 cache lines, requiring up to 2 RFOs...

脚注2:存储缓冲区条目按程序顺序分配(在缓冲区的末尾),因为将指令/指令发布到乱序的后端,并为其分配了后端资源.(例如,用于写入寄存器的uops的物理寄存器,用于可能会预测错误的条件分支uops的分支顺序缓冲区条目.)另请参见

Footnote 2: Store buffer entries are allocated in program order (at the tail of the buffer), as instructions / uops are issued into the out-of-order back end and have back-end resources allocated for them. (e.g. a physical register for uops that write a register, a branch-order-buffer entry for conditional branch uops that might mispredict.) See also Size of store buffers on Intel hardware? What exactly is a store buffer?. In-order alloc and commit guarantee program-order visibility of stores. The store buffer insulates globally-visible commit from out-of-order speculative execution of store-address and store-data uops (which write store-buffer entries), and decouples execution in general from waiting for cache-miss stores, until the store buffer fills up.

PS 英特尔将存储缓冲区+加载缓冲区统称为内存顺序缓冲区(MOB),因为它们需要彼此了解才能跟踪推测性的早期加载.这与您的问题无关,仅适用于投机性早期负载和检测内存顺序的错误推测并破坏管道的情况.

PS Intel calls the store buffer + load buffers collectively the memory order buffer (MOB), because they need to know about each other to track speculative early loads. This isn't relevant to your question, only for the case of speculative early loads and detecting memory-order mis-speculation and nuking the pipeline.

对于已退休的存储指令(更具体地说,是其已分级"的存储缓冲区条目),只是存储缓冲区必须按程序顺序提交到L1d.

For retired store instructions (more specifically their "graduated" store buffer entries), it is just the store buffer that has to commit to L1d in program order.

这篇关于退休后RFO为什么不中断存储顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆