L1高速缓存控制器的顺序以处理来自CPU的内存请求 [英] The ordering of L1 cache controller to process memory requests from CPU

查看:108
本文介绍了L1高速缓存控制器的顺序以处理来自CPU的内存请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在总存储顺序(TSO)内存一致性模型下,x86 cpu将具有一个写缓冲区来缓冲写请求,并可以处理来自写缓冲区的重新排序的读请求。它说写缓冲区中的写请求将退出并以FIFO顺序向高速缓存层次结构发出,这与程序顺序相同。



我很好奇:



为了处理从写缓冲区发出的写请求,L1缓存控制器是否执行

解决方案

来处理写请求,完成写请求的缓存一致性并将数据插入到L1缓存中?

您的用语不寻常。您说完成高速缓存一致性;实际发生的情况是,内核必须先获得(独家)缓存行的所有权,然后才能对其进行修改。在修改发生的瞬间/周期,它成为缓存一致性协议中所有参与者共享的内存内容视图的一部分。


因此,您可以完成缓存连贯性=在存储区之前拥有专有权,甚至可以进入缓存并变得全局可见=可用于共享该缓存行的请求。缓存始终保持一致性(这是MESI的重点),不会保持不同步,然后等待一致性。我认为您的困惑源于您的心理模型 not 与现实情况不符。


(弱排序的架构具有令人mind目结舌的可能性,例如并非所有核心都从商店中看到两个其他内核以相同的顺序;可以通过一个物理核心上SMT线程之间的私有存储转发,让另一个逻辑核心在提交L1d =全局可见性之前就看到了一个存储。)




我想您知道其中的一些内容,但让我从基础开始。


每个内核中的L1高速缓存都参与了高速缓存一致性协议使其缓存与一致性域中的其他缓存保持一致(例如,其他内核中的L2和L3,以及L1,但GPU内的视频RAM缓存不保持一致性)。


加载已全局从L1缓存读取其数据的瞬间可见(或通过存储缓冲区或不可缓存的RAM或MMIO)。 MFENCE 可以迫使他们等待早期的商店在对L1进行采样之前全局可见,以便


存储在其数据被提交到L1高速缓存时立即变得全局可见。可能发生的条件是:



只能访问Wikipedia的 MESI文章,其中列出了允许的状态转换图和详细信息。关键点是,一致性是通过在确保没有其他缓存包含该行的情况下仅允许内核修改其缓存行的副本来实现的,因此不可能有相同副本的两个冲突副本


Intel CPU实际上使用 MESIF ,而AMD CPU实际上使用 MOESI 来允许对脏数据进行高速缓存->高速缓存数据传输


还应注意,现代英特尔设计(在Skylake-AVX512之前)使用大型共享包含 L3高速缓存作为高速缓存一致性的后援,因此监听请求不会最终必须广播到所有核心;他们只是检查L3标签(其中包含额外的元数据来跟踪哪个内核在缓存什么。
英特尔的L3包括标签在内,即使内部缓存处于排他或已修改状态且因此在L3中无效的行也是如此) 。有关简化版本的详细信息,请参见本文。英特尔确实有)。


还与以下内容有关:我最近写了一篇关于为什么我们拥有较小/快速的L1 +较大的L2 / L3,而不是一个大的缓存,包括一些与其他与缓存相关的内容的链接。




回到实际问题:


是的,存储按照程序顺序提交到L1 ,因为这是x86要求它们在全局范围内可见的顺序。 L1提交顺序与全局可见性顺序相同。


与其说完成高速缓存一致性,不如说是获得高速缓存行的所有权。这涉及使用缓存一致性协议与其他缓存通信,因此我猜您可能是使用缓存一致性协议完成获得独占所有权。


内存排序 MESI Wiki文章的一部分指出,在存储队列中缓冲存储与无序存储是分开的


存储缓冲区将提交到L1d的提交与OoO exec退役分离。与正常的无序窗口大小相比,这可能隐藏更多 lot 的商店延迟。但是,即使中断到达,退役存储必须最终也会以正确的顺序发生,因此允许大量退役但未提交的存储会增加中断延迟。


存储缓冲区尝试尽快将退休的存储区提交到L1d ,但受内存排序规则的限制。 (即,其他内核很快就会看到存储;除非您需要当前线程在该线程中的后续加载之前等待该线程发生,否则不需要栅栏来刷新存储缓冲区,例如顺序一致的存储。) / p>

在顺序较弱的ISA上,以后的存储可以提交到L1d,而以前的存储仍在等待高速缓存未命中。 (但是您仍然需要一个内存顺序缓冲区来保留程序顺序中单个核心运行指令的错觉。)


存储缓冲区可能一次有多个高速缓存未命中,因为即使在顺序严格的x86上,它也可以在该存储区是缓冲区中最旧的存储区之前发送RFO缓存行。


Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order.

I am curious about:

To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order?

解决方案

Your terminology is unusual. You say "finish the cache coherence"; what actually happens is that the core has to get (exclusive) ownership of the cache line before it can modify it. At the instant/cycle that modification happens, it becomes part of the view of memory contents shared by all participants in the cache-coherency protocol.

So yes, you do "finish the cache coherence" = get exclusive ownership before the store can even enter cache and become globally visible = available for requests to share that cache line. Cache always maintains coherence (that's the point of MESI), not gets out of sync and then wait for coherence. I think your confusion stems from your mental model not matching that reality.

(Weakly-ordered architectures have mind-bending possibilities like not all cores seeing the stores from two other cores in the same order; that can happen by private store-forwarding between SMT threads on one physical core letting another logical core see a store ahead of commit to L1d = global visibility.)


I think you know some of this, but let me start from the basics.

L1 cache in each core participates in the cache-coherency protocol that keeps its cache coherent with the other caches in the coherency domain (e.g. L2 and L3, and L1 in other cores, but not video-RAM caches inside the GPU).

Loads become globally visible at the instant their data is read from L1 cache (or from the store buffer or from uncacheable RAM or MMIO). MFENCE can force them to wait for earlier stores to become globally visible before sampling L1, to avoid StoreLoad reordering.

Stores become globally visible at the instant their data is committed to L1 cache. The conditions required before this can happen are:

  • It's finished executing: the data+address is in a store buffer entry. (i.e. the store-address and store-data uops executed on the appropriate ports once their inputs were ready, writing the address and data into the store buffer, aka Memory Order Buffer on Intel CPUs).

  • It's retired from the out-of-order part of the core, and thus known to be non-speculative. Before retirement, we don't know that it and all preceding instructions won't fault, or that it wasn't in the shadow of a branch mispredict or other mis-speculation.

    Retirement can only happen after it's finished executing, but isn't tied to commitment to L1d. The store buffer can continue to track a non-speculative store that will definitely happen eventually even after the ROB (out-of-order execution ReOrder Buffer) has forgotten about the store instruction.

  • All preceding loads/stores/fences are already globally visible (because of x86's memory ordering rules). This excludes weakly-ordered ops (NT stores); other loads/stores can pass them.

  • The cache line is in the Exclusive or Modified state of the MESI/MESIF/MOESI cache-coherence protocol, in the L1d cache of the current core. This can take a long time if the RFO (read for ownership) encounters a cache miss in outer levels of cache, or contention with other cores that also want exclusive access to write, or atomically RMW, a cache line.

See wikipedia's MESI article for diagrams of allowed state transitions, and details. The key point is that coherency is achieved by only allowing a core to modify its copy of a cache line when it's sure that no other caches contain that line, so that it's impossible for two conflicting copies of the same line to exist.

Intel CPUs actually use MESIF, while AMD CPUs actually use MOESI to allow cache->cache data transfer of dirty data instead of write-back to a shared outer cache like the basic MESI protocol requires.

Also note that modern Intel designs (before Skylake-AVX512) implement use a large shared inclusive L3 cache as a backstop for cache-coherency, so snoop requests don't actually have to be broadcast to all cores; they just check L3 tags (which contain extra metadata to track which core is caching what.
Intel's L3 is tag-inclusive even for lines that inner caches have in Exclusive or Modified state and thus are Invalid in L3. See this paper for more details of a simplified version of what Intel does).

Also related: I wrote an answer recently about why we have small/fast L1 + larger L2/L3, instead of one big cache, including some links to other cache-related stuff.


Back to the actual question:

Yes, stores are committed to L1 in program order, because that's the order that x86 requires them to become globally visible. L1-commit order is the same thing as global-visibility order.

Instead of "finish the cache coherence", instead you should say "get ownership of the cache line". This involves communicating with other caches using the cache coherency protocol, so I guess you probably meant "finish getting exclusive ownership using the cache coherency protocl".

The memory ordering part of the MESI wiki article points out that buffering stores in a store queue is separate from out-of-order execution in general.

The store buffer decouples commit to L1d from OoO exec retirement. This can potentially hide a lot more store latency than the regular out-of-order window size. However, retired stores must eventually happen (in the right order) even if an interrupt arrives, so allowing lots of retired but not committed stores can increase interrupt latency.

The store buffer tries to commit retired stores to L1d as quickly as it can, but it's restricted by the memory ordering rules. (i.e. other cores will see stores soon; you don't need a fence to flush the store buffer unless you need the current thread to wait for that to happen before a later load in this thread. e.g. for sequentially-consistent stores.)

On a weakly-ordered ISA, later stores can commit to L1d while an earlier store is still waiting for a cache miss. (But you'd still need a memory order buffer to preserve the illusion of a single core running instructions in program order.)

The store buffer can have multiple cache misses in flight at once, because even on strongly-ordered x86 it can send an RFO for a cache line before that store is the oldest one in the buffer.

这篇关于L1高速缓存控制器的顺序以处理来自CPU的内存请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆