软件预取是否分配行填充缓冲区 (LFB)? [英] Does software prefetching allocate a Line Fill Buffer (LFB)?

查看:19
本文介绍了软件预取是否分配行填充缓冲区 (LFB)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到Little's法律限制了在给定的延迟和给定的并发级别下传输数据的速度.如果你想更快地传输一些东西,你要么需要更大的传输,更多的飞行中"传输,或者更低的延迟.对于从 RAM 读取的情况,并发受到 Line Fill Buffers 数量的限制.

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.

当加载未命中 L1 缓存时,会分配一个行填充缓冲区.现代英特尔芯片(Nehalem、Sandy Bridge、Ivy Bridge、Haswell)的每个内核有 10 个 LFB,因此每个内核仅限于 10 个未完成的缓存未命中.如果 RAM 延迟为 70 ns(合理),并且每次传输为 128 字节(64B 缓存线加上其硬件预取孪生),这会将每个内核的带宽限制为:10 * 128B/75 ns = ~16 GB/s.单线程Stream 等基准测试证实这是相当准确的.

A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.

减少延迟的明显方法是使用 x64 指令(例如 PREFETCHT0、PREFETCHT1、PREFETCHT2 或 PREFETCHNTA)预取所需数据,这样就不必从 RAM 中读取数据.但是我无法通过使用它们来加速任何事情.问题似乎是 __mm_prefetch() 指令本身消耗了 LFB,因此它们也受到相同的限制.硬件预取不会触及 LFB,也不会跨越页面边界.

The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.

但是我在任何地方都找不到任何记录.我找到的最接近的是 15 岁的文章上面提到了 Pentium III 上的预取使用 Line Fill Buffers.我担心事情可能从那时起就发生了变化.由于我认为 LFB 与 L1 缓存相关联,我不确定为什么预取到 L2 或 L3 会消耗它们.然而,我测量的速度与这种情况是一致的.

But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.

那么:有没有什么方法可以在不使用这 10 个行填充缓冲区中的一个的情况下从内存中的新位置启动提取,从而通过绕过利特尔定律获得更高的带宽?

So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

推荐答案

根据我的测试,所有类型的预取指令都会消耗最新的 Intel 主流 CPU 上的行填充缓冲区.

特别是我添加了一些负载&预取测试到 uarch-bench,它在不同大小的缓冲区上使用大跨度负载.以下是我的 Skylake i7-6700HQ 的典型结果:

In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:

                     Benchmark   Cycles    Nanos
  16-KiB parallel        loads     0.50     0.19
  16-KiB parallel   prefetcht0     0.50     0.19
  16-KiB parallel   prefetcht1     1.15     0.44
  16-KiB parallel   prefetcht2     1.24     0.48
  16-KiB parallel prefetchtnta     0.50     0.19

  32-KiB parallel        loads     0.50     0.19
  32-KiB parallel   prefetcht0     0.50     0.19
  32-KiB parallel   prefetcht1     1.28     0.49
  32-KiB parallel   prefetcht2     1.28     0.49
  32-KiB parallel prefetchtnta     0.50     0.19

 128-KiB parallel        loads     1.00     0.39
 128-KiB parallel   prefetcht0     2.00     0.77
 128-KiB parallel   prefetcht1     1.31     0.50
 128-KiB parallel   prefetcht2     1.31     0.50
 128-KiB parallel prefetchtnta     4.10     1.58

 256-KiB parallel        loads     1.00     0.39
 256-KiB parallel   prefetcht0     2.00     0.77
 256-KiB parallel   prefetcht1     1.31     0.50
 256-KiB parallel   prefetcht2     1.31     0.50
 256-KiB parallel prefetchtnta     4.10     1.58

 512-KiB parallel        loads     4.09     1.58
 512-KiB parallel   prefetcht0     4.12     1.59
 512-KiB parallel   prefetcht1     3.80     1.46
 512-KiB parallel   prefetcht2     3.80     1.46
 512-KiB parallel prefetchtnta     4.10     1.58

2048-KiB parallel        loads     4.09     1.58
2048-KiB parallel   prefetcht0     4.12     1.59
2048-KiB parallel   prefetcht1     3.80     1.46
2048-KiB parallel   prefetcht2     3.80     1.46
2048-KiB parallel prefetchtnta    16.54     6.38

需要注意的关键是没有任何预取技术比任何缓冲区大小的加载快得多.如果任何预取指令不使用 LFB,我们希望它对于适合它预取到的缓存级别的基准测试来说非常快.例如,prefetcht1 将行带入 L2,因此对于 128-KiB 测试,如果它不使用 LFB,我们可能期望它比加载变体更快.

The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.

更确切地说,我们可以检查 l1d_pend_miss.fb_full 计数器,其描述为:

More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:

请求需要 FB(填充缓冲区)条目的次数,但有没有可用的条目.一个请求包括加载、存储或软件预取的可缓存/不可缓存需求说明.

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

描述已经表明软件预取需要 LFB 条目并且测试证实了这一点:对于所有类型的预取,对于并发是限制因素的任何测试,这个数字都非常高.例如,对于 512-KiB prefetcht1 测试:

The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:

 Performance counter stats for './uarch-bench --test-name 512-KiB parallel   prefetcht1':

        38,345,242      branches                                                    
     1,074,657,384      cycles                                                      
       284,646,019      mem_inst_retired.all_loads                                   
     1,677,347,358      l1d_pend_miss.fb_full                  

fb_full 值大于周期数,这意味着 LFB 几乎一直都是满的(它可以超过周期数,因为最多两个负载可能需要一个 LFB每个周期).此工作负载是纯预取,因此除了预取之外没有其他东西可以填充 LFB.

The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.

这个测试的结果也收缩了Leeor引用的手册部分中声称的行为:

The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:

在某些情况下,PREFETCH 不会执行数据预取.其中包括:

There are cases where a PREFETCH will not perform the data prefetch. These include:

  • ...
  • 如果内存子系统用完请求缓冲区一级缓存和二级缓存之间.

显然这里不是这样:当 LFB 填满时,预取请求不会被丢弃,而是像正常加载一样停止,直到资源可用(这不是不合理的行为:如果您要求软件预取,你可能想要得到它,也许即使这意味着拖延).

Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).

我们还注意到以下有趣的行为:

We also note the following interesting behaviors:

  • prefetcht1prefetcht2 之间似乎存在一些细微差别,因为它们报告了 16-KiB 测试的不同性能(差异有所不同,但始终不同),但如果您重复测试,您会发现这更有可能只是运行间的变化,因为这些特定值有些不稳定(大多数其他值非常稳定).
  • 对于 L2 包含的测试,我们每个周期可以承受 1 个负载,但只能承受一个 prefetcht0 预取.这有点奇怪,因为 prefetcht0 应该与加载非常相似(并且在 L1 情况下每个周期可以发出 2 个).
  • 即使 L2 有大约 12 个周期的延迟,我们也能够完全隐藏延迟 LFB,只有 10 个 LFB:我们每个负载得到 1.0 个周期(受 L2 吞吐量限制),而不是 12/10 == 1.2如果 LFB 是限制事实(并且 fb_full 的非常低的值证实了这一点),我们期望(最佳情况)每个负载的 周期.这可能是因为 12 个周期的延迟是一直到执行核心的全部加载到使用延迟,其中还包括几个周期的额外延迟(例如,L1 延迟是 4-5 个周期),所以实际花费在LFB 小于 10 个周期.
  • 对于 L3 测试,我们看到 3.8-4.1 个周期的值,非常接近基于 L3 加载到使用延迟的预期 42/10 = 4.2 个周期.因此,当我们达到 L3 时,我们肯定会受到 10 个 LFB 的限制.这里prefetcht1prefetcht2 始终比加载或prefetcht0 快0.3 个周期.给定 10 个 LFB,这等于减少了 3 个周期的占用,这或多或少是由预取在 L2 处停止而不是一直到 L1 来解释的.
  • prefetchtnta 通常比 L1 以外的其他产品的吞吐量低得多.这可能意味着 prefetchtnta 实际上正在做它应该做的事情,并且似乎将行带入 L1,而不是 L2,而只是弱"地进入 L3.因此,对于包含 L2 的测试,它具有并发限制的吞吐量,就好像它正在击中 L3 缓存一样,而对于 2048-KiB 的情况(L3 缓存大小的 1/3),它具有击中主内存的性能.prefetchnta 限制 L3 缓存污染(每组只有一种方式),所以我们似乎正在被驱逐.
  • It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
  • For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
  • Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
  • For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
  • prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.

这是我在测试之前写的一个较旧的答案,推测它是如何工作的:

Here's an older answer I wrote before testing, speculating on how it could work:

一般来说,我希望任何导致数据在 L1 中结束的预取 消耗一个行填充缓冲区,因为我相信 L1 和其余内存层次结构之间的唯一路径是LFB1.因此,针对 L1 的 SW 和 HW 预取可能使用 LFB.

In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.

然而,这留下了目标 L2 或更高级别的预取不消耗 LFB 的可能性.对于硬件预取的情况,我很确定情况确实如此:您可以找到许多参考资料,说明硬件预取是一种有效地获得更多内存并行性的机制,超出了 LFB 提供的最大 10 个.此外,L2 预取器似乎无法根据需要使用 LFB:它​​们位于 L2 中/附近并向更高级别发出请求,大概使用超级队列并且不需要 LFB.

However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.

这留下了针对 L2(或更高级别)的软件预取,例如 prefetcht1prefetcht22.与 L2 生成的请求不同,这些请求从内核开始,因此它们需要某种方式从内核发出,这可能是通过 LFB.英特尔优化指南中有以下有趣的引用(重点是我的):

That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):

一般来说,软件预取到 L2 会显示出更多的好处比 L1 预取.进入 L1 的软件预取将消耗关键硬件资源(填充缓冲区),直到缓存行填充完成.A进入 L2 的软件预取不保存这些资源,它是不太可能对绩效产生负面影响.如果你使用 L1软件预取,最好为软件预取提供服务通过 L2 缓存中的命中,因此硬件的时间长度所持有的资源被最小化.

Generally, software prefetching into the L2 will show more benefit than L1 prefetches. A software prefetch into L1 will consume critical hardware resources (fill buffer) until the cacheline fill completes. A software prefetch into L2 does not hold those resources, and it is less likely to have a negative perfor- mance impact. If you do use L1 software prefetches, it is best if the software prefetch is serviced by hits in the L2 cache, so the length of time that the hardware resources are held is minimized.

这似乎表明软件预取不消耗 LFB - 但这句话仅适用于 Knights Landing 架构,我找不到任何更主流架构的类似语言.看来 Knights Landing 的缓存设计明显不同(或者引用错误).

This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).

1 事实上,我认为即使是非临时存储也使用 LFB 离开执行核心 - 但它们的占用时间很短,因为一旦它们到达 L2可以进入超级队列(实际上不需要进入 L2)然后释放它们关联的 LFB.

1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.

2 我认为这两个都针对最近的 Intel 上的 L2,但这也不清楚 - 也许 t2 提示实际上针对某些 uarch 上的 LLC?

2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

这篇关于软件预取是否分配行填充缓冲区 (LFB)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆