当CPU将存储缓冲区中的值刷新到L1高速缓存时? [英] When CPU flush value in storebuffer to L1 Cache?

查看:259
本文介绍了当CPU将存储缓冲区中的值刷新到L1高速缓存时?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Core A将值x写入到storebuffer中,等待无效的ack,然后将x刷新到高速缓存中.它只等待一个ack还是等待所有ack?以及它如何确定所有CPU中有多少个托架?

Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?

推荐答案

对我来说,您不清楚无效确认"是什么意思,但是假设您的意思是源自另一个请求所有权的核心的监听/无效在同一行.

It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.

在这种情况下,由于存储缓冲区中的存储尚不全局可见,因此存储缓冲区中的存储通常可以自由地忽略来自其他内核的此类无效操作.仅当他们在退休后的某个时间提交给L1时,该存储才成为全局可见的.此时, 1 如果缓存中还没有关联的行,则缓存控制器将对其进行RFO(所有权请求).从本质上讲,商店在全球范围内可见. L1缓存控制器不需要知道还有多少其他失效,因为它们是作为MESI协议的一部分由系统中的一些更高级别的组件介导的,并且当它们处于E状态时,它们保证他们是专有所有者.

In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point1 the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.

简而言之,来自其他内核的失效对存储缓冲区 2 中的存储几乎没有影响,因为基于RFO请求,它们在单个点上变得全局可见.是执行该区域的 loads 更有可能是由另一个内核上的无效活动造成的,尤其是在x86这样的强大平台上,不允许进行可见的负载-负载重新排序.例如,x86上的所谓MOB负责跟踪失效是否可能破坏排序规则.

In short, invalidations from other cores have little effect on stores in the store buffer2, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.

也许您正在谈论的"acks"是其他内核对写入内核请求获取或升级其行所有权以使其可以写入的请求的响应:即,使另一行中的行副本无效CPU等.

Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.

这通常被称为发出RFO,当成功时,RFO将在请求核心中使该行处于E状态.

This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.

大多数CPU都是分层的,各种不同的代理协同工作以确保一致性.在实践中,这意味着CPU不需要等待N CPU系统上其他N-1内核的最多N-1个"acks",而只需要来自上级组件的单个答复即可.负责发送和收集其他CPU的响应.

Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.

一个示例可能是具有专用L1和L2以及共享L3的单路多核CPU.内核可能会将其RFO向下发送到L3,L3可能会将无效请求发送到所有内核,等待它们的响应,然后将RFO请求确认给发出请求的内核.或者,L3可以存储一些位,这些位指示哪些核心可能具有该行的副本,然后它仅需要将请求发送到那些核心(在这种情况下,L3承担的角色有时称为监听)文件管理器).

One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).

由于代理之间的所有通信都通过L3进行,因此它能够使任何内容保持一致.在多套接字系统的情况下,事情变得更加复杂:本地核心上的L3可能会再次获取请求,并将其传递给另一个套接字以在那里进行相同类型的失效.同样,可能存在探听过滤器的概念,或者可能存在其他概念,并且行为甚至可以配置!

Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!

例如,在英特尔的Broadwell Xeon架构中,有

For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:

Broadwell提供了四种不同的监听模式,重新引入了Home 具有目录和机会侦听广播的侦听(带DIR的HS + OSB)以前在Ivy Bridge上可用,并且提供三种侦听模式 在Haswell,Early Snoop,Home Snoop和Die Cluster上可用 模式(COD).表5映射了内存带宽和延迟权衡 在每种不同的模式下都会有所不同.大多数工作量将 找到带有目录和机会监听的Home Snoop 将是最佳选择.

Broadwell offers four different snoop modes a reintroduction of Home Snoop with Directory and Opportunistic Snoop Broadcast (HS with DIR + OSB) previously available on Ivy Bridge, and three snoop modes that were available on Haswell, Early Snoop, Home Snoop, and Cluster on Die Mode (COD). Table 5 maps the memory bandwidth and latency trade-offs that will vary across each of the different modes. Most workloads will find that Home Snoop with Directory and Opportunistic Snoop Broadcast will be the best choice.

...具有不同的性能折衷:

... with different performance tradeoffs:

该文档的其余部分详细介绍了各种模式的工作方式.

The rest that document goes into some detail about how the various modes work.

因此,我想简短的答案是它很复杂,取决于详细的设计,甚至可能取决于用户可配置的设置."

So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".

1 或可能在某个较早的时间点,因为优化的实现可能会提前看"商店缓冲区并为即将到来的商店发布RFO(即所谓的"RFO预取"),甚至在它们成为零售店之前就没有了.最高级的商店.

1 Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.

2 但是,无效可能会使第一个脚注中提到的RFO预取复杂化,因为这意味着存在一个窗口,在该窗口中,行可能会被另一个内核偷回",从而导致RFO预取浪费工作.一个复杂的实现可能具有一个预测器,该预测器可以基于监视是否发生而改变RFO预取的攻击性.

2 Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.

这篇关于当CPU将存储缓冲区中的值刷新到L1高速缓存时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆