存储缓冲区和行填充缓冲区如何相互作用? [英] How do the store buffer and Line Fill Buffer interact with each other?

查看:53
本文介绍了存储缓冲区和行填充缓冲区如何相互作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 MDS 攻击论文

问题

  1. 如果存储缓冲区已经存在,我们为什么还需要行填充缓冲区来跟踪超出的存储请求?
  2. 我的描述中事件的顺序是否正确?

解决方案

如果存储缓冲区已经存在,我们为什么还需要行填充缓冲区来跟踪超出的存储请求?

存储缓冲区用于按顺序跟踪存储,在它们退休之前,在它们退休之后但在它们提交到 L1 缓存之前2.从概念上讲,存储缓冲区是一个完全本地的东西,它并不真正关心缓存未命中.存储缓冲区以单位"为单位进行交易.各种规模的独立商店.像 Intel Skylake 这样的芯片具有 存储 50 多个条目的缓冲区.

行填充缓冲区主要处理加载和存储在 L1 缓存中未命中.本质上,它是从 L1 缓存到内存子系统其余部分的路径,并处理缓存行大小的单位.如果加载或存储命中 L1 缓存1,我们不希望 LFB 参与其中.像 Skylake 这样的英特尔芯片的 LFB 条目要少得多,可能只有 10 到 12 个.

<块引用>

我的描述中事件的顺序是否正确?

非常接近.以下是我将如何更改您的列表:

  1. 存储指令被解码并拆分为存储数据和存储地址微指令,它们被重命名、调度并为它们分配了存储缓冲区条目.
  2. store uops 以任何顺序或同时执行(这两个子项可以以任一顺序执行,主要取决于哪个首先满足其依赖项).

    1. 存储数据 uop 将存储数据写入存储缓冲区.
    2. 存储地址 uop 执行 V-P 转换并将地址写入存储缓冲区.

  3. 在所有旧指令都已停用的某个时刻,存储指令停用.这意味着指令不再是推测性的,并且结果可以是可见的.此时,商店仍保留在商店缓冲区中,称为高级商店.
  4. 存储现在等待直到它位于存储缓冲区的头部(它是最旧的未提交的存储),此时它将提交(成为全局可观察的)到 L1,如果相关的缓存线存在于L1 处于 MESIF Modified 或 Exclusive 状态.(即该核心拥有线路)
  5. 如果该行不存在于所需状态(完全丢失,即缓存未命中,或存在但处于非排他状态),则必须获得修改该行和行数据(有时)的权限从内存子系统获得:如果尚未分配,则为整行分配一个 LFB.这是所谓的所有权请求 (RFO),这意味着内存层次结构应该以适合修改的独占状态返回行,而不是仅适合读取的共享状态(这会使存在于任何其他私有缓存中的行的副本无效).

将共享转换为独占的 RFO 仍必须等待响应以确保所有其他缓存已使其副本无效.对这种无效的响应不需要包含数据的副本,因为该缓存已经有一个.它仍然可以称为 RFO;重要的部分是在修改一行之前获得所有权.6. 在未命中的情况下,LFB 最终会返回该行的全部内容,这些内容已提交给 L1,待处理存储现在可以提交3.

这是该过程的粗略近似.某些或所有芯片上的某些细节可能会有所不同,包括不太了解的细节.

举一个例子,在上面的顺序中,直到商店到达商店队列的头部,商店未命中行才会被提取.实际上,商店子系统可能会实现一种 RFO 预取,其中检查商店队列是否有即将到来的商店,如果 L1 中不存在这些行,则提前启动请求(实际可见提交到 L1 仍然必须在 x86 上按顺序发生,或者至少好像"按顺序发生).

因此请求和 LFB 的使用可能最早在第 3 步完成时发生(如果 RFO 预取仅在商店退役后才适用),或者甚至可能早在 2.2 完成时发生,如果初级商店受预取的影响.

>

作为另一个例子,第 6 步描述了从内存层次结构返回并提交到 L1 的行,然后存储提交.有可能挂起的存储实际上与返回的数据合并,然后写入 L1.即使在未命中的情况下,存储也可能离开存储缓冲区,并在 LFB 中等待,释放一些存储缓冲区条目.


1 在 L1 缓存中命中的存储的情况下,建议实际上涉及 LFB:每个存储实际上进入一个组合缓冲区(这可能只是一个 LFB)在提交到缓存之前,这样一系列针对同一缓存线的存储被组合在缓存中,并且只需要访问 L1 一次.这尚未得到证实,但无论如何它并不是 LFB 主要用途的一部分(更明显的是,我们甚至无法真正判断它是否发生).

2 保存之前和退休之前存储的缓冲区可能是两种完全不同的结构,具有不同的大小和行为,但在这里我们将它们称为一种结构.

3 所描述的场景涉及在存储缓冲区的头部等待直到相关行返回的存储丢失.另一种情况是将存储数据写入用于请求的 LFB,并且可以释放存储缓冲区条目.这可能允许在发生未命中时处理一些后续存储,但要遵守严格的 x86 排序要求.这可能会增加商店 MLP.

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the micro-architectural details of the exploit.

One thing that isn't clear to me after reading that question is why we need a Line Fill Buffer if we already have a store buffer.

John McCalpin discusses how the store buffer and Line Fill Buffer are connected in How does WC-buffer relate to LFB? on the Intel forums, but that doesn't really make things clearer to me.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

I've just recently started reading up on out-of-order execution so please excuse my ignorance. Here is my idea of how a store would pass through the store buffer and Line Fill Buffer.

  1. A store instruction get scheduled in the front-end.
  2. It executes in the store unit.
  3. The store request is put in the store buffer (an address and the data)
  4. An invalidate read request is sent from the store buffer to the cache system
  5. If it misses the L1d cache, then the request is put in the Line Fill Buffer
  6. The Line Fill Buffer forwards the invalidate read request to L2
  7. Some cache receives the invalidate read and sends its cache line
  8. The store buffer applies its value to the incoming cache line
  9. Uh? The Line Fill Buffer marks the entry as invalid


Questions

  1. Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
  2. Is the ordering of events correct in my description?

解决方案

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?

The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.

The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache1. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12.

Is the ordering of events correct in my description?

Pretty close. Here's how I'd change your list:

  1. A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.
  2. The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).

    1. The store data uop writes the store data into the store buffer.
    2. The store address uop does the V-P translation and writes the address(es) into the store buffer.

  3. At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.
  4. The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)
  5. If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).

An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line. 6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit3.

This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.

As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).

So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.

As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.


1 In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).

2 The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.

3 The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.

这篇关于存储缓冲区和行填充缓冲区如何相互作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆