写合并缓冲区在哪里? x86 [英] Where is the Write-Combining Buffer located? x86

查看:134
本文介绍了写合并缓冲区在哪里? x86的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Write-Combine缓冲区在物理上是如何连接的?我已经看到了说明多种变体的框图:

How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants:


  • L1和内存控制器之间

  • 之间CPU的存储缓冲区和内存控制器

  • CPU的AGU和/或存储单元之间

微架构依赖?

推荐答案

写入缓冲区在不同处理器中可以具有不同的用途或不同的用途。此答案可能不适用于未特别提及的处理器。我想强调一下,写入缓冲区一词在不同的上下文中可能意味着不同的含义。此答案仅适用于Intel和AMD处理器。

Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different things in different contexts. This answer is about Intel and AMD processors only.

每个缓存可能随附零个或多个行填充缓冲区(也称为填充缓冲区)。 L2处的填充缓冲区的集合称为超级队列或超队列(超级队列中的每个条目都是填充缓冲区)。如果高速缓存是在逻辑核心或物理核心之间共享的,则关联的填充缓冲区也将在核心之间共享。每个填充缓冲区可容纳一个高速缓存行和描述高速缓存行(如果已被占用)的其他信息,包括高速缓存行的地址,内存类型和一组有效位,其中位数取决于跟踪缓存行的各个字节。在早期处理器(例如奔腾II )中,只有一个填充缓冲区能够写合并(和写折叠)。随着新型处理器的出现,行缓冲区和具有写合并功能的缓冲区的总数稳步增长。

Each cache might be accompanied with zero or more line fill buffers (also called fill buffers). The collection of fill buffers at L2 are called the super queue or superqueue (each entry in the super queue is a fill buffer). If the cache is shared between logical cores or physical cores, then the associated fill buffers are shared as well between the cores. Each fill buffer can hold a single cache line and additional information that describes the cache line (if it's occupied) including the address of the cache line, the memory type, and a set of validity bits where the number of bits depends on the granularity of tracking the individual bytes of the cache line. In early processors (such as Pentium II), only one of the fill buffers is capable of write-combining (and write-collapsing). The total number of line buffers and those capable of write-combing has increased steadily with newer processors.

Nehalem直到Broadwell在每个L1数据高速缓存中都包含10个填充缓冲区。核心和核心2每个物理核心具有8个LFB。根据,Skylake上有12个LFB。 @BeeOnRope已观察到Cannon湖上有20个LFB。我在手册中找不到明确的说法,说所有这些微体系结构中的LFB与WCB相同。但是,此文章由一位来自英特尔的人说:

Nehalem up to Broadwell include 10 fill buffers at each L1 data cache. Core and Core2 have 8 LFBs per physical core. According to this, there are 12 LFBs on Skylake. @BeeOnRope has observed that there are 20 LFBs on Cannon Lake. I could not find a clear statement in the manual that says LFBs are the same as WCBs on all of these microarchitectures. However, this article written by a person from Intel says:


请咨询《英特尔®64和IA-32架构优化参考》
手册以了解填充数量特定处理器中的缓冲区;
的数字通常为8到10。请注意,有时它们有时也被称为 Write Combining Buffers,因为在某些较旧的
处理器上仅支持流存储。

Consult the Intel® 64 and IA-32 Architectures Optimization Reference Manual for the number of fill buffers in a particular processor; typically the number is 8 to 10. Note that sometimes these are also referred to as "Write Combining Buffers", since on some older processors only streaming stores were supported.

我认为LFB一词最早是由Intel与Intel Core微体系结构引入的,在这8个LFB上,它们也是WCB。基本上,英特尔当时曾偷偷地将WCB重命名为LFB,但此后并未在其手册中加以澄清。

I think the term LFB was first introduced by Intel with the Intel Core microarchitecture, on which all of the 8 LFBs are WCBs as well. Basically, Intel sneakily renamed WCBs to LFBs at that time, but did not clarify this in their manuals since then.

该引语还说,在较旧的处理器,因为它们不支持流负载。这可以解释为流负载请求( MOVNTDQA )也使用LFB。但是,第12.10.3节说,流负载将目标行提取到称为流负载缓冲区的缓冲区中,这显然与LFB / WCB物理上不同。

That same quote also says that the term WCB was used on older processors because streaming loads were not supported on them. This could be interpreted as the LFBs are also used by streaming load requests (MOVNTDQA). However, Section 12.10.3 says that streaming loads fetch the target line into buffers called streaming load buffers, which are apparently physically different from the LFBs/WCBs.

在以下情况下会使用行填充缓冲区:

(1):在未命中负载时会分配填充缓冲区(需求或预取)。如果没有可用的填充缓冲区,则加载请求将不断堆积在加载缓冲区中,这最终可能导致停滞发布阶段。在有加载请求的情况下,分配的填充缓冲区用于临时保存来自内存层次结构较低层的请求行,直到它们可以被写入高速缓存数据阵列为止。但是,即使高速缓存行的请求部分尚未被写入高速缓存数据阵列,它仍然可以提供给目标寄存器。根据 Patrick Fay(Intel)

(1) A fill buffer is allocated on a load miss (demand or prefetch) in the cache. If there was no fill buffer available, load requests keep piling up in the load buffers, which may eventually lead to stalling the issue stage. In case of a load request, the allocated fill buffer is used to temporarily hold requested lines from lower levels of the memory hierarchy until they can be written to the cache data array. But the requested part of the cache line can still be provided to the destination register even if the line has not yet been written to the cache data array. According to Patrick Fay (Intel):


如果您在PDF中搜索填充缓冲区,则可以看到Line
填充缓冲区L1D未命中后分配(LFB)。 LFB会保存
数据,以便满足L1D丢失的要求,但是在所有数据准备好要写入L1D缓存之前,都需要

If you search for 'fill buffer' in the PDF you can see that the Line fill buffer (LFB) is allocated after an L1D miss. The LFB holds the data as it comes in to satisfy the L1D miss but before all the data is ready tobe written to the L1D cache.

(2)填充缓冲区在可缓存存储区中分配给L1缓存,并且目标行未处于允许修改的一致性状态。我的理解是,对于可缓存存储,只有RFO请求保留在LFB中,但是要存储的数据在存储缓冲区中等待,直到将目标行提取到为其分配的LFB条目中。英特尔优化手册第2.4.5.2节中的以下语句对此提供了支持:

(2) A fill buffer is allocated on a cacheable store to the L1 cache and the target line is not in a coherence state that allows modifications. My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:


L1 DCache最多可以维护64个负载微控制器-从分配
直至退休。从
分配开始,它可以维持多达36个存储操作,直到将存储值提交给缓存为止;对于非临时存储,它可以将
写入行填充缓冲区(LFB)。

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

这表明如果目标行不在L1D中,则可缓存存储区不会提交给LFB。换句话说,存储必须在存储缓冲区中等待,直到将目标行写入LFB,然后在LFB中修改该行,或者将目标行写入L1D,然后对该行进行了修改。

This suggests that cacheable stores are not committed to the LFB if the target line is not in the L1D. In other words, the store has to wait in the store buffer until either the target line is written into the LFB, and then the line is modified in the LFB, or the target line is written into the L1D, and then the line is modified in the L1D.

(3):填充缓冲区分配给L1缓存中不可缓存的写合并存储,无论该行是否处于缓存或其一致性状态。可以将在同一高速缓存行中的WC存储合并并折叠(在同一行中多次写入同一位置的最后一个存储将使程序顺序中的最后一个存储在全局可观察之前覆盖以前的存储)。当前在LFB中分配的请求之间的顺序未保持。因此,如果有两个使用中的WCB,则不能保证首先将其驱逐出去,而与程序顺序有关的存储顺序无关。这就是为什么即使所有商店都按顺序退役的原因,WC商店仍可能在全球范围内出现乱序观察的原因(尽管WC协议允许WC商店不按顺序提交)。此外,WCB不会被监听,因此只有在到达内存控制器时才可以全局观察。可以在Intel手册V3的11.3.1节中找到更多信息。

(3) A fill buffer is allocated on a uncacheable write-combining store in the L1 cache irrespective of whether the line is in the cache or its coherence state. WC stores to the same cache line can be combined and collapsed (multiple writes to the same location in the same line will make the last store in program order overwrite previous stores before they become globally observable) in a single LFB. Ordering is not maintained among the requests currently allocated in LFBs. So if there are two WCBs in use, there is no guarantee which will be evicted first, irrespective of the order of stores with respect to program order. That's why WC stores may become globally observable out of order even if all stores are retired committed in order (although the WC protocol allows WC stores to be committed out of order). In addition, WCBs are not snooped and so only becomes globally observable when they reach the memory controller. More information can be found in Section 11.3.1 in the Intel manual V3.

某些AMD处理器使用的缓冲区与非临时存储区的填充缓冲区分开。在P6中(第一个实现WCB的)和P4中还有许多WCB缓冲区专用于WC内存类型(不能用于其他内存类型)。在P4的早期版本中,有4个这样的缓冲区。对于支持超线程的P4版本,启用超线程并且两个逻辑内核都在运行时,WCB会在两个逻辑内核之间进行静态分区。但是,现代的英特尔微体系结构竞争性地共享了所有LFB,但是我认为每个逻辑内核至少应有一个可用空间,以防止饥饿。

There are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. There were also a number of WCB buffers in the P6 (the first to implement WCBs) and P4 dedicated for the WC memory type (cannot be used for other memory types). On the early versions of P4, there are 4 such buffers. For the P4 versions that support hyperthreading, when hyperthreading is enabled and both logical cores are running, the WCBs are statically partitioned between the two logical cores. Modern Intel microarchitectures, however, competitively share the all the LFBs, but I think keep at least one available for each logical core to prevent starvation.

(4) L1D_PEND_MISS.FB_FULL 的文档指示UC存储区分配在相同的LFB中(无论该行是处于缓存还是其一致性状态)。像可缓存存储区一样,但与WC不同,UC存储区不在LFB中合并。

(4) The documentation of L1D_PEND_MISS.FB_FULL indicates that UC stores are allocated in the same LFBs (irrespective of whether the line is in the cache or its coherence state). Like cacheable stores, but unlike WC, UC stores are not combined in the LFBs.

(5)我从实验上观察到,来自LFB中也分配了 IN OUT 指令。有关更多信息,请参见:使用环形总线拓扑的Intel CPU如何解码和处理端口I / O操作

(5) I've experimentally observed that requests from IN and OUT instructions are also allocated in LFBs. For more information, see: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.

其他信息:

填充缓冲区由高速缓存控制器管理,高速缓存控制器连接到其他级别的其他高速缓存控制器(如果是LLC,则为内存控制器) 。当请求命中高速缓存时,不会分配填充缓冲区。因此,在缓存中命中的存储请求直接在缓存中执行,而在缓存中命中的加载请求直接从缓存中得到服务。从缓存中逐出一行时,不会分配填充缓冲区。逐出的行被写入其自己的缓冲区(称为回写缓冲区或逐出缓冲区)。这是Intel的专利,其中讨论了I / O写入的写入梳理。

The fill buffers are managed by the cache controller, which is connected to other cache controllers at other levels (or the memory controller in case of the LLC). A fill buffer is not allocated when a request hits in the cache. So a store request that hits in the cache is performed directly in the cache and a load request that hits in the cache is directly serviced from the cache. A fill buffer is not allocated when a line is evicted from the cache. Evicted lines are written to their own buffers (called writeback buffers or eviction buffers). Here is a patent from Intel that discusses write combing for I/O writes.

我进行的实验与我所描述的实验非常相似此处确定是否分配了单个LFB,即使同一条线路上有多个负载也是如此。事实证明确实如此。在回写L1D高速缓存中未命中的行的第一次加载将获得为其分配的LFB。以后所有对同一高速缓存行的加载均被阻止,并且在其相应的加载缓冲区条目中写入了块代码,以指示它们正在等待该LFB中保持的相同请求。当数据到达时,L1D高速缓存将唤醒信号发送到加载缓冲区,并且在该行上等待的所有条目都被唤醒(无阻塞),并计划在至少一个加载端口可用时发布给L1D高速缓存。显然,内存调度程序必须在未阻塞的负载和刚从RS分派的负载之间进行选择。如果在所有等待的负载得到维修之前,由于某种原因将该线路逐出,则将再次阻塞它们,并再次为该线路分配LFB。我尚未测试储物柜,但我认为无论进行什么操作,都为一条生产线分配了一个LFB。可以将LFB中的请求类型从预取提升为需求负载,也可以将推测性RFO提升为需求时的需求RFO。我还凭经验发现,在刷新管道时,不会删除由oups在错误的路径上发出的推测性请求。它们可能会降级以预取请求。我不确定。

I've run an experiment that is very similar to the one I've described here to determine whether a single LFB is allocated even if there are multiple loads to the same line. It turns out that that is indeed the case. The first load to a line that misses in the write-back L1D cache gets an LFB allocated for it. All later loads to the same cache line are blocked and a block code is written in their corresponding load buffer entries to indicate that they are waiting on the same request being held in that LFB. When the data arrives, the L1D cache sends a wake-up signal to the load buffer and all entries that are waiting on that line are woken up (unblocked) and scheduled to be issued to the L1D cache when at least one load port is available. Obviously the memory scheduler has to choose between the unblocked loads and the loads that have just been dispatched from the RS. If the line got evicted for whatever reason before all waiting loads get the chance to be serviced, then they will be blocked again and an LFB will be again allocated for that line. I've not tested the store case, but I think no matter what the operation is, a single LFB is allocated for a line. The request type in the LFB can be promoted from prefetch to demand load to speculative RFO to demand RFO when required. I also found out empirically that speculative requests that were issued from uops on a mispredicted path are not removed when flushing the pipeline. They might be demoted to prefetch requests. I'm not sure.

我之前根据文章,其中有些AMD处理器在使用与非临时存储的填充缓冲区分开的缓冲区。我引用这篇文章:

I mentioned before according to an article that there are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. I quote from the article:


在较旧的AMD处理器(K8和Family 10h)上,非临时存储区
使用了四个写合并寄存器的集合,它们是用于L1数据高速缓存未命中的八个缓冲区中的独立

On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four "write-combining registers" that were independent of the eight buffers used for L1 data cache misses.

在较旧的AMD处理器上部分使我感到好奇。较新的AMD处理器是否发生了变化?在我看来,这在所有较新的AMD处理器(包括最新的Family 17h处理器(Zen))上仍然适用。 Zen mircoarchitecture上的WikiChip 文章包含两个提及WC缓冲区的数字:。在第一个图中,尚不清楚如何使用WCB。但是,在第二篇文章中,很明显,所示的WCB实际上是专门用于NT写入的(WCB和L1数据缓存之间没有连接)。第二个数字的来源似乎是这些幻灯片 1 。我认为第一个数字是由WikiChip制作的(这说明了为什么将WCB放置在不确定的位置)。实际上,WikiChip文章并未提及WCB。但是,仍然可以通过查看《 AMD Family 17h处理器软件优化指南》 手册和专利用于Family 17h处理器的加载和存储队列。 AMD优化手册指出,现代AMD处理器中每个内核有4个WCB。我认为这适用于K8和所有以后的处理器。不幸的是,没有任何有关起到Intel填充缓冲区作用的AMD缓冲区的说法。

The "on the older AMD processors" part got me curious. Did this change on newer AMD processors? It seems to me that this is still true on all newer AMD processors including the most recent Family 17h Processors (Zen). The WikiChip article on the Zen mircoarchitecture includes two figures that mention WC buffers: this and this. In the first figure, it's not clear how the WCBs are used. However, in the second one it's clear that the WCBs shown are indeed specifically used for NT writes (there is no connection between the WCBs and the L1 data cache). The source for the second figure seems to be these slides1. I think that the first figure was made by WikiChip (which explains why the WCBs were placed in an indeterminate position). In fact, the WikiChip article does not say anything about the WCBs. But still, we can confirm that the WCBs shown are only used for NT writes by looking at Figure 7 from the Software Optimization Guide for AMD Family 17h Processors manual and the patent for the load and store queues for the Family 17h processors. The AMD optimization manual states that there are 4 WCBs per core in modern AMD processors. I think this applies to the K8 and all later processors. Unfortunately, nothing is said about the AMD buffers that play the role of Intel fill buffers.

1 迈克尔·克拉克(Michael Clark), AMD的新型高性能x86内核设计,2016年。

1 Michael Clark, A New, High Performance x86 Core Design from AMD, 2016.

这篇关于写合并缓冲区在哪里? x86的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆