存储缓冲区和行填充缓冲区如何相互影响? [英] How do the store buffer and Line Fill Buffer interact with each other?

对存储指令进行解码，并将其拆分为存储数据和存储地址ouop，对它们进行重命名，调度并为其分配存储缓冲区条目.
商店uops以任何顺序执行或同时执行(两个子项目可以以任何顺序执行，这主要取决于首先满足其依赖性的对象).
1. 商店数据uop将商店数据写入商店缓冲区.
2. 商店地址uop进行V-P转换并将地址写入商店缓冲区.
在所有较旧的指令都已退休的某个时刻，存储指令将退休.这意味着该指令不再是推测性的，并且可以使结果可见.此时，商店仍保留在商店缓冲区中，称为高级商店.
现在，存储等待，直到它位于存储缓冲区的最前面(这是最早的未提交存储)，此时如果存在关联的缓存行，它将提交(成为全局可观察到的)到L1中. L1处于MESIF修改或独占状态. (即此核心拥有这条线)
如果该行不处于要求的状态(要么完全丢失(即，高速缓存未命中)，要么存在但处于非排他状态)，则必须(有时)获得修改该行和行数据的权限从内存子系统获得:如果尚未分配LFB，则会为整行分配一个LFB.这就是所谓的所有权请求(RFO)，这意味着内存层次结构应该 3 .

这是该过程的粗略估计.在某些或所有芯片上，某些细节可能有所不同，其中包括一些尚未被很好理解的细节.

作为一个示例，以上述顺序，直到商店到达商店队列的开头才获取商店未命中行.实际上，商店子系统可以实现一种 RFO预取类型，其中检查商店队列中是否有即将到来的商店，并且如果L1中不存在这些行，则提早开始请求(实际可见提交仍然必须按顺序发生在x1上，或者至少好像"在顺序上发生.

因此，请求和LFB使用可能最早在第3步完成时发生(如果RFO预取仅在商店退役后才适用)，或者甚至可能在2.2完成时(如果初级商店要预取). >
作为另一个示例，步骤6描述了从内存层次结构返回并提交给L1的行，然后存储提交了.实际上，挂起的存储实际上可能与返回的数据合并，然后将其写入L1.即使在未命中的情况下，商店也有可能离开商店缓冲区，而只是在LFB中等待，从而释放了一些商店缓冲区条目.

¹对于在L1缓存中命中的商店，实际涉及LFB的建议:每个商店实际上都进入一个合并缓冲区( (可能只是LFB)在提交给缓存之前，因此，针对同一缓存行的一系列存储将合并到缓存中，只需要访问一次L1.这还没有得到证明，但是无论如何，它实际上并不是LFB主要用途的一部分(从我们甚至无法真正知道它是否正在发生的事实中可以更加明显地看出来.)

²保留存储区的缓冲区可能是两个完全不同的结构，具有不同的大小和行为，但在这里我们将它们称为一个结构.

³所描述的场景涉及存储，它错过了在存储缓冲区的开头等待直到关联的行返回的等待.另一种情况是将存储数据写入用于请求的LFB中，并且可以释放存储缓冲区条目.根据严格的x86订购要求，这有可能允许在进行未命中时处理某些后续存储.这可能会增加商店的MLP.
I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the micro-architectural details of the exploit.

One thing that isn't clear to me after reading that question is why we need a Line Fill Buffer if we already have a store buffer.

John McCalpin discusses how the store buffer and Line Fill Buffer are connected in How does WC-buffer relate to LFB? on the Intel forums, but that doesn't really make things clearer to me.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

I've just recently started reading up on out-of-order execution so please excuse my ignorance. Here is my idea of how a store would pass through the store buffer and Line Fill Buffer.
1. A store instruction get scheduled in the front-end.
2. It executes in the store unit.
3. The store request is put in the store buffer (an address and the data)
4. An invalidate read request is sent from the store buffer to the cache system
5. If it misses the L1d cache, then the request is put in the Line Fill Buffer
6. The Line Fill Buffer forwards the invalidate read request to L2
7. Some cache receives the invalidate read and sends its cache line
8. The store buffer applies its value to the incoming cache line
9. Uh? The Line Fill Buffer marks the entry as invalid
Questions
1. Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
2. Is the ordering of events correct in my description?
解决方案

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?

The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache². The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.

The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache¹. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12.

Is the ordering of events correct in my description?

Pretty close. Here's how I'd change your list:
1. A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.
2. The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).
  1. The store data uop writes the store data into the store buffer.
  2. The store address uop does the V-P translation and writes the address(es) into the store buffer.
3. At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.
4. The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)
5. If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).
An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line. 6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit³.

This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.

As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).

So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.

As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.

¹ In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).

² The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.

³ The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.

这篇关于存储缓冲区和行填充缓冲区如何相互影响?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

存储缓冲区和行填充缓冲区如何相互影响? [英] How do the store buffer and Line Fill Buffer interact with each other?

问题描述

问题

Questions

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

存储缓冲区和行填充缓冲区如何相互影响? [英] How do the store buffer and Line Fill Buffer interact with each other?

问题描述

问题

Questions

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭