英特尔硬件上的存储缓冲区大小?究竟什么是存储缓冲区? [英] Size of store buffers on Intel hardware? What exactly is a store buffer?

查看:45
本文介绍了英特尔硬件上的存储缓冲区大小?究竟什么是存储缓冲区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Intel 优化手册 谈到了处理器的许多部分中存在的存储缓冲区的数量,但似乎没有谈到存储缓冲区的大小.这是公开信息还是存储缓冲区的大小作为微架构细节保留?

我正在研究的处理器主要是 Broadwell 和 Skylake,但有关其他处理器的信息也会很好.

此外,存储缓冲区究竟有什么作用?

解决方案

相关:什么是存储缓冲区? 以及 推测执行的 cpu 分支是否可以包含访问 RAM 的操作码?

还有 怎么做存储缓冲区和行填充缓冲区是否相互交互? 很好地描述了执行存储指令的步骤以及它最终如何提交到 L1d 缓存.


整个存储缓冲区由多个条目组成.

每个内核都有自己的存储缓冲区1,以将执行和退出与提交到 L1d 缓存中分离.即使是有序 CPU 也可以从存储缓冲区中受益,以避免在缓存未命中存储时停顿,因为与加载不同的是,它们必须最终变得可见.(没有实际的 CPU 使用顺序一致性内存模型,因此至少允许 StoreLoad 重新排序,即使在 x86 和 SPARC-TSO 中也是如此).

对于推测/乱序 CPU,它还可以在检测到异常或旧指令中的其他错误推测后回滚存储,而无需全局可见的推测存储.这显然对正确性至关重要!(您无法回滚其他核心,因此您不能让他们看到您的商店数据,直到知道它是非推测性的.)


当两个逻辑内核都处于活动状态(超线程)时,英特尔将存储缓冲区分成两部分;每个逻辑核心得到一半.从一个逻辑核心加载只监听它自己的一半存储缓冲区2.什么会用于在具有 HT 的一个 Core 上执行的线程之间进行数据交换?

存储缓冲区将数据从退役存储指令以尽可能快的速度提交到 L1d,按照程序顺序(尊重 x86 的强有序内存模型3).要求商店在他们退休时提交将不必要地拖延缓存未命中商店的退休.仍在存储缓冲区中的退休存储肯定会发生并且无法回滚,因此它们实际上会损害中断延迟.(从技术上讲,中断不需要序列化,但是 IRQ 处理程序完成的任何存储在排空现有挂起存储之前都不会变得可见.而且 iret 正在序列化,所以即使在最好的情况下存储缓冲区在返回之前耗尽.)

这是一个常见的(?)误解,认为必须显式刷新数据才能对其他线程可见.内存屏障不会导致刷新存储缓冲区,完整的屏障使当前核心等待直到存储缓冲区自身耗尽,然后才允许任何以后的负载发生(即读取 L1d).原子 RMW 操作必须等待存储缓冲区耗尽,然后才能锁定缓存行并对该行进行加载和存储,而不允许它离开 MESI 修改状态,从而阻止系统中的任何其他代理在此期间观察它原子操作.

为了实现 x86 的强有序内存模型,同时在微架构上仍然允许早期/乱序加载(并在架构上允许发生加载时稍后检查数据是否仍然有效),加载缓冲区 + 存储缓冲区条目共同形成内存顺序缓冲区 (MOB).(如果允许加载发生时缓存行仍然存在,那就是内存顺序错误推测.)这个结构大概是 mfencelocked 指令可以设置一个障碍,阻止 StoreLoad 重新排序,而不会阻止乱序执行.(尽管 mfence 在 Skylake确实阻止了独立 ALU 指令的 OoO 执行,作为一个实现细节.)

movnt 缓存绕过存储(如 movntps)也通过存储缓冲区,因此它们可以被视为推测,就像 OoO exec CPU 中的其他所有内容一样.但是它们直接提交到 LFB(行填充缓冲区),也就是写组合缓冲区,而不是 L1d 缓存.


Intel CPU 上的存储指令解码为存储地址和存储数据 uop(微融合为一个融合域 uop).存储地址 uop 只是将地址(可能还有存储宽度)写入存储缓冲区,因此以后的加载可以设置存储-> 加载转发或检测它们不重叠.store-data uop 写入数据.

Store-address 和 store-data 可以按任一顺序执行,以先准备好的为准:将前端的 uops 写入 ROB 和后端的 RS 的分配/重命名阶段也分配负载或存储缓冲区以在发布时加载或存储微指令.或者停止直到一个可用.由于分配和提交是按顺序发生的,这可能意味着更老/更年轻的人​​很容易跟踪,因为它可以只是一个循环缓冲区,不必担心在回绕后仍然在使用旧的长期条目.(除非缓存绕过/弱排序的 NT 存储可以做到这一点?他们可以无序地提交到 LFB(行填充缓冲区).与普通存储不同,它们直接提交到 LFB 以进行核外传输,而不是 L1d.)


<块引用>

但是条目的大小是多少?

存储缓冲区大小以条目而非位来衡量.

狭窄的商店不会使用更少的空间";在存储缓冲区中,他们仍然只使用 1 个条目.

Skylake 的存储缓冲区有 56 个条目(wikichip),从 Haswell/Broadwell 的 42 和 SnB/IvB 的 36(David Kanter 关于 RealWorldTech 的 HSW 文章有图表).您可以在 Kanter 在 RWT 上的文章、Wikichip 的图表或其他各种来源中找到大多数早期 x86 uarches 的数字.

SKL/BDW/HSW 也有 72 个加载缓冲区条目,SnB/IvB 有 64 个.这是尚未执行或正在等待数据从外部缓存到达的进行中加载指令的数量.

>

每个条目的大小是一个实现细节,对您如何优化软件的影响为零.类似地,我们不知道 uop 的比特大小(在前端、ROB 中、RS 中)、TLB 实现细节或许多其他事情,但我们知道有多少 ROB 和 RS条目,以及在各种 uarches 中有多少不同类型的 TLB 条目.

英特尔不会发布其 CPU 设计的电路图,而且 (AFAIK) 这些尺寸通常不为人所知,因此我们甚至无法满足我们对设计细节/权衡的好奇心.


在存储缓冲区中写入合并:

到同一缓存行的背靠背窄存储可以(可能?)在提交之前合并在存储缓冲区中,因此在 L1d 缓存的写端口上可能只需要一个周期来提交多个存储.

我们确信某些非 x86 CPU 会这样做,我们有一些证据/理由怀疑 Intel CPU 可能会这样做.但如果发生了,那是有限的.@BeeOnRope 和我目前认为英特尔 CPU 可能进行任何重大合并.如果他们这样做,最合理的情况是存储缓冲区末尾的条目(准备提交到 L1d)都进入同一缓存行可能合并到一个缓冲区中,如果我们正在等待 RFO,优化提交对于该缓存行.请参阅关于 在最近的 Intel 上,拆分行/页存储是否需要两个存储缓冲区条目?.我提出了一些可能的实验,但还没有完成.

之前关于可能的存储缓冲区合并的内容:

请参阅以该评论开头的讨论:写入组合缓冲区是否用于对英特尔 WB 内存区域的正常写入?

还有 出乎意料的穷和英特尔 Skylake 上存储循环的奇怪双峰性能可能是相关的.

我们肯定知道一些弱排序的 ISA,比如 Alpha 21264 确实在他们的存储缓冲区中存储了合并,因为 手册记录了它,以及它对每个周期可以提交和/或从 L1d 提交和/或读取的内容的限制.还有 PowerPC RS64-II 和 RS64-III,细节较少,在从这里的评论链接的文档中:是否有任何现代 CPU 的缓存字节存储实际上比字存储慢?

人们已经发表了关于如何在 TSO 内存模型(如 x86)中进行(更激进?)存储合并的论文,例如非投机性商店合并在总商店订单中

合并可以允许存储缓冲区条目在其数据提交到 L1d 之前被释放(大概只有在退休之后),如果它的数据被复制到存储到同一行.这只会发生在没有存储到其他行将它们分开的情况下,否则它会导致存储在程序顺序之外提交(变得全局可见),从而违反内存模型.但是我们认为这可能发生在同一行的任意 2 个存储中,甚至是第一个和最后一个字节.

这个想法的一个问题是SB条目分配可能是一个环形缓冲区,就像ROB一样.无序释放条目意味着硬件需要扫描每个条目以找到一个空闲的条目,然后如果它们被乱序重新分配,那么它们就不会按照程序顺序用于以后的存储.这可能会使分配和存储转发变得更加困难,因此它可能不合理.

正如在是两个存储缓冲区在最近的英特尔上拆分行/页存储所需的条目?,即使它跨越缓存行边界,SB 条目保存所有一个存储也是有意义的.在离开SB 时提交L1d 缓存时,缓存线边界变得相关.我们知道存储转发适用于跨缓存行拆分的存储.如果将它们拆分为存储端口中的多个 SB 条目,这似乎不太可能.


术语:我一直在使用合并"谈论在存储缓冲区中合并,而不是写入合并"在(希望)进行没有 RFO 的全行写入之前,谈论在 LFB 中组合的 NT 存储.或者存储到执行相同操作的 WC 内存区域.

这种区别/约定只是我编造的.根据评论中的讨论,这可能不是标准的计算机架构术语.

Intel 的手册(尤其是优化手册)是由不同的作者编写多年的,而且他们的术语也不一致. 优化手册的大部分内容尤其是如果它谈到奔腾4.关于 Sandybridge 和 Haswell 的新部分是可靠的,但旧部分可能有陈旧的建议,这些建议仅/主要与 P4 相关(例如 inc 与 add 1),或者对某些优化规则的微架构解释可能令人困惑/错误.特别是 3.6.10 写组合.由于内存排序规则,关于在等待行到达以将缓存未命中存储到 WB 内存时使用 LFB 组合存储的第一个要点似乎并不合理.请参阅上面链接的我和 BeeOnRope 之间的讨论,以及此处的评论.


脚注 1:

用于缓冲来自内部缓存的回写(或直写)的写组合缓存将具有不同的名称.例如Bulldozer 系列使用 16k 直写 L1d 缓存,带有一个小的 4k 回写缓冲区.(参见 为什么要使用 L1 和 L2 缓存浪费空间来保存相同的数据? 有关详细信息和更多详细信息的链接.请参阅 您系统上的缓存大小估计? 用于在推土机系列 CPU 上减速超过 4k 的重写阵列微基准测试.)

脚注 2:一些 POWER CPU 允许其他 SMT 线程监听存储缓冲区中已退休的存储:这可能导致不同线程对来自其他线程的全局存储顺序产生分歧.将两个其他线程总是以相同的顺序看到对不同线程中不同位置的原子写入?

脚注 3:具有弱内存模型的非 x86 CPU 可以按任何顺序提交退休存储,允许将多个存储更积极地合并到同一行,并使缓存未命中存储不会停滞其他商店的提交.

The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail?

The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well.

Also, what do store buffers do, exactly?

解决方案

Related: what is a store buffer? and a basic / beginner-friendly intro to the concept of buffers in can a speculatively executed cpu branch contain opcodes that access RAM?

Also How do the store buffer and Line Fill Buffer interact with each other? has a good description of the steps in executing a store instruction and how it eventually commits to L1d cache.


The store buffer as a whole is composed of multiple entries.

Each core has its own store buffer1 to decouple execution and retirement from commit into L1d cache. Even an in-order CPU benefits from a store buffer to avoid stalling on cache-miss stores, because unlike loads they just have to become visible eventually. (No practical CPUs use a sequential-consistency memory model, so at least StoreLoad reordering is allowed, even in x86 and SPARC-TSO).

For speculative / out-of-order CPUs, it also makes it possible roll back a store after detecting an exception or other mis-speculation in an older instruction, without speculative stores ever being globally visible. This is obviously essential for correctness! (You can't roll back other cores, so you can't let them see your store data until it's known to be non-speculative.)


When both logical cores are active (hyperthreading), Intel partitions the store buffer in two; each logical core gets half. Loads from one logical core only snoop its own half of the store buffer2. What will be used for data exchange between threads are executing on one Core with HT?

The store buffer commits data from retired store instructions into L1d as fast as it can, in program order (to respect x86's strongly-ordered memory model3). Requiring stores to commit as they retire would unnecessarily stall retirement for cache-miss stores. Retired stores still in the store buffer are definitely going to happen and can't be rolled back, so they can actually hurt interrupt latency. (Interrupts aren't technically required to be serializing, but any stores done by an IRQ handler can't become visible until after existing pending stores are drained. And iret is serializing, so even in the best case the store buffer drains before returning.)

It's a common(?) misconception that it has to be explicitly flushed for data to become visible to other threads. Memory barriers don't cause the store buffer to be flushed, full barriers make the current core wait until the store buffer drains itself, before allowing any later loads to happen (i.e. read L1d). Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load and store to that line without allowing it to leave MESI Modified state, thus stopping any other agent in the system from observing it during the atomic operation.

To implement x86's strongly ordered memory model while still microarchitecturally allowing early / out-of-order loads (and later checking if the data is still valid when the load is architecturally allowed to happen), load buffer + store buffer entries collectively form the Memory Order Buffer (MOB). (If a cache line isn't still present when the load was allowed to happen, that's a memory-order mis-speculation.) This structure is presumably where mfence and locked instructions can put a barrier that blocks StoreLoad reordering without blocking out-of-order execution. (Although mfence on Skylake does block OoO exec of independent ALU instructions, as an implementation detail.)

movnt cache-bypassing stores (like movntps) also go through the store buffer, so they can be treated as speculative just like everything else in an OoO exec CPU. But they commit directly to an LFB (Line Fill Buffer), aka write-combining buffer, instead of to L1d cache.


Store instructions on Intel CPUs decode to store-address and store-data uops (micro-fused into one fused-domain uop). The store-address uop just writes the address (and probably the store width) into the store buffer, so later loads can set up store->load forwarding or detect that they don't overlap. The store-data uop writes the data.

Store-address and store-data can execute in either order, whichever is ready first: the allocate/rename stage that writes uops from the front-end into the ROB and RS in the back end also allocates a load or store buffer for load or store uops at issue time. Or stalls until one is available. Since allocation and commit happen in-order, that probably means older/younger is easy to keep track of because it can just be a circular buffer that doesn't have to worry about old long-lived entries still being in use after wrapping around. (Unless cache-bypassing / weakly-ordered NT stores can do that? They can commit to an LFB (Line Fill Buffer) out of order. Unlike normal stores, they commit directly to an LFB for transfer off-core, rather than to L1d.)


but what is the size of an entry?

Store buffer sizes are measured in entries, not bits.

Narrow stores don't "use less space" in the store buffer, they still use exactly 1 entry.

Skylake's store buffer has 56 entries (wikichip), up from 42 in Haswell/Broadwell, and 36 in SnB/IvB (David Kanter's HSW writeup on RealWorldTech has diagrams). You can find numbers for most earlier x86 uarches in Kanter's writeups on RWT, or Wikichip's diagrams, or various other sources.

SKL/BDW/HSW also have 72 load buffer entries, SnB/IvB have 64. This is the number of in-flight load instructions that either haven't executed or are waiting for data to arrive from outer caches.


The size in bits of each entry is an implementation detail that has zero impact on how you optimize software. Similarly, we don't know the size in bits of of a uop (in the front-end, in the ROB, in the RS), or TLB implementation details, or many other things, but we do know how many ROB and RS entries there are, and how many TLB entries of different types there are in various uarches.

Intel doesn't publish circuit diagrams for their CPU designs and (AFAIK) these sizes aren't generally known, so we can't even satisfy our curiosity about design details / tradeoffs.


Write coalescing in the store buffer:

Back-to-back narrow stores to the same cache line can (probably?) be combined aka coalesced in the store buffer before they commit, so it might only take one cycle on a write port of L1d cache to commit multiple stores.

We know for sure that some non-x86 CPUs do this, and we have some evidence / reason to suspect that Intel CPUs might do this. But if it happens, it's limited. @BeeOnRope and I currently think Intel CPUs probably don't do any significant merging. And if they do, the most plausible case is that entries at the end of the store buffer (ready to commit to L1d) that all go to the same cache line might merge into one buffer, optimizing commit if we're waiting for an RFO for that cache line. See discussion in comments on Are two store buffer entries needed for split line/page stores on recent Intel?. I proposed some possible experiments but haven't done them.

Earlier stuff about possible store-buffer merging:

See discussion starting with this comment: Are write-combining buffers used for normal writes to WB memory regions on Intel?

And also Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake may be relevant.

We know for sure that some weakly-ordered ISAs like Alpha 21264 did store coalescing in their store buffer, because the manual documents it, along with its limitations on what it can commit and/or read to/from L1d per cycle. Also PowerPC RS64-II and RS64-III, with less detail, in docs linked from a comment here: Are there any modern CPUs where a cached byte store is actually slower than a word store?

People have published papers on how to do (more aggressive?) store coalescing in TSO memory models (like x86), e.g. Non-Speculative Store Coalescing in Total Store Order

Coalescing could allow a store-buffer entry to be freed before its data commits to L1d (presumably only after retirement), if its data is copied to a store to the same line. This could only happen if no stores to other lines separate them, or else it would cause stores to commit (become globally visible) out of program order, violating the memory model. But we think this can happen for any 2 stores to the same line, even the first and last byte.

A problem with this idea is that SB entry allocation is probably a ring buffer, like the ROB. Releasing entries out of order would mean hardware would need to scan every entry to find a free one, and then if they're reallocated out of order then they're not in program order for later stores. That could make allocation and store-forwarding much harder so it's probably not plausible.

As discussed in Are two store buffer entries needed for split line/page stores on recent Intel?, it would make sense for an SB entry to hold all of one store even if it spans a cache-line boundary. Cache line boundaries become relevant when committing to L1d cache on leaving the SB. We know that store-forwarding can work for stores that split across a cache line. That seems unlikely if they were split into multiple SB entries in the store ports.


Terminology: I've been using "coalescing" to talk about merging in the store buffer, vs. "write combining" to talk about NT stores that combine in an LFB before (hopefully) doing a full-line write with no RFO. Or stores to WC memory regions which do the same thing.

This distinction / convention is just something I made up. According to discussion in comments, this might not be standard computer architecture terminology.

Intel's manuals (especially the optimization manual) are written over many years by different authors, and also aren't consistent in their terminology. Take most parts of the optimization manual with a grain of salt especially if it talks about Pentium4. The new sections about Sandybridge and Haswell are reliable, but older parts might have stale advice that's only / mostly relevant for P4 (e.g. inc vs. add 1), or the microarchitectural explanations for some optimization rules might be confusing / wrong. Especially section 3.6.10 Write Combining. The first bullet point about using LFBs to combine stores while waiting for lines to arrive for cache-miss stores to WB memory just doesn't seem plausible, because of memory-ordering rules. See discussion between me and BeeOnRope linked above, and in comments here.


Footnote 1:

A write-combining cache to buffer write-back (or write-through) from inner caches would have a different name. e.g. Bulldozer-family uses 16k write-through L1d caches, with a small 4k write-back buffer. (See Why do L1 and L2 Cache waste space saving the same data? for details and links to even more details. See Cache size estimation on your system? for a rewrite-an-array microbenchmark that slows down beyond 4k on a Bulldozer-family CPU.)

Footnote 2: Some POWER CPUs let other SMT threads snoop retired stores in the store buffer: this can cause different threads to disagree about the global order of stores from other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

Footnote 3: non-x86 CPUs with weak memory models can commit retired stores in any order, allowing more aggressive coalescing of multiple stores to the same line, and making a cache-miss store not stall commit of other stores.

这篇关于英特尔硬件上的存储缓冲区大小?究竟什么是存储缓冲区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆