软件预取是否分配行填充缓冲区(LFB)? [英] Does software prefetching allocate a Line Fill Buffer (LFB)?

查看:260
本文介绍了软件预取是否分配行填充缓冲区(LFB)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经意识到 Little's法律限制了在给定的延迟和给定的并发水平下如何快速传输数据。如果您想更快地传输内容,则需要更大的传输量,更多的运行中传输或更低的延迟。对于从RAM读取的情况,并发性受行填充缓冲区数的限制。

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.

当负载未命中L1高速缓存时,将分配行填充缓冲区。现代的Intel芯片(Nehalem,Sandy Bridge,Ivy Bridge和Haswell)每个内核有10个LFB,因此每个内核最多只能有10个未命中的高速缓存未命中。如果RAM延迟为70 ns(合理),且每次传输为128字节(64B高速缓存行加上其硬件预取的孪生),则每个内核的带宽限制为:10 * 128B / 75 ns =〜16 GB / s。诸如单线程 Stream 之类的基准可以证明这是相当准确的。

A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.

减少延迟的明显方法是使用x64指令(例如PREFETCHT0,PREFETCHT1,PREFETCHT2或PREFETCHNTA)预取所需的数据,这样就不必从RAM读取。但是我无法通过使用它们来加快速度。问题似乎在于__mm_prefetch()指令本身消耗了LFB,因此它们也受到相同的限制。硬件预取不会碰到LFB,但也不会越过页面边界。

The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.

但是我在任何地方都找不到这些文档。我找到的最接近的是15岁的文章那说提到奔腾III上的预取使用行填充缓冲区。从那以后,我担心事情可能会发生变化。而且,由于我认为LFB与L1缓存相关联,因此我不确定为什么预取到L2或L3会消耗它们。但是,我测量的速度与这种情况是一致的。

But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.

所以:有什么方法可以在不耗尽那10个行填充缓冲区之一的情况下从内存中的新位置启动取指令,从而通过绕过Little的缓冲区来获得更高的带宽法律?

So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

推荐答案

首先进行较小的更正-阅读优化指南,您会注意到,某些硬件预取器属于L2缓存,因此不受填充缓冲区数量的限制,而是受L2副本的限制。

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.

空间预取器(您所指的共置64B行,完成到128B块)就是其中之一,因此从理论上讲,如果您获取其他所有行,就能获得更高的带宽(某些DCU预取器可能会尝试为您填补空白,但从理论上讲,它们应该具有较低的优先级,这样才可能起作用)。

The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).

但是,国王预取器是另一个人, L2流光。 2.1.5.4节的内容为:

However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:


Streamer:此预取器监视L1高速缓存中的读取请求,以了解地址的升序和降序。监视的读取请求包括由加载和存储操作以及硬件预取器发起的L1 DCache请求,以及对代码提取的L1 ICache请求。当检测到请求的前向或后向流时,将预取预期的缓存行。预取的缓存行必须在同一4K页中

Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page

重要的部分是-


流媒体可以在每个L2查找上发出两个预取请求。拖缆
可以在加载请求之前运行多达20条线

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load reques

这个2:1的比率意味着对于一条流在此预取器可以识别的访问权限中,它将始终在您的访问权限之前运行。的确,您不会在L1中自动看到这些行,但这确实意味着,如果一切正常,则应该始终为它们获得L2命中延迟(一旦预取流有足够的时间继续运行并减轻L3 /内存)延迟)。您可能只有10个LFB,但是正如您在计算中所指出的那样-访问等待时间越短,更换它们的速度就越快,可以达到的带宽就越高。这实质上是将 L1<-mem 延迟分离为 L1<-L2 L2<-mem

This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.

标题中的问题-可以推断出预取尝试填充L1将需要一个行填充缓冲区来保存该级别的检索数据。这可能应该包括所有L1预取。对于软件预取,第7.4.3节说:

As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:


在某些情况下,PREFETCH不会执行数据预取。这些包括:

There are cases where a PREFETCH will not perform the data prefetch. These include:


  • PREFETCH导致DTLB(数据转换后备缓冲区)未命中。这适用于CPUID签名与系列15,模型0、1或2相对应的奔腾4处理器。PREFETCH解决DTLB丢失并在CPUID签名与系列15,模型3相对应的奔腾4处理器上获取数据。

  • 对导致故障/异常的指定地址的访问。

  • 如果内存子系统用尽了一级缓存和第二级缓存之间的请求缓冲区级缓存。

  • PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
  • An access to the specified address that causes a fault/exception.
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.

...

所以我认为您是对的,SW预取不是人为地增加未完成请求数量的方法。但是,同样的解释也适用于此-如果您知道如何使用SW预取来事先足够好地访问线路,则可以减轻某些访问延迟并提高有效带宽。但是,这对于长数据流将不起作用,原因有两个:1)您的缓存容量有限(即使预取是临时的,如t0风格),并且2)您仍需要为以下内容支付全部的L1-> mem延迟:每次预取,因此您只需要稍微提前一点即可-如果数据操作的速度快于内存访问的速度,最终将赶上SW的预取。因此,这只有在您可以预先预取所有您需要的东西并将其保留在那里时才起作用。

So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

这篇关于软件预取是否分配行填充缓冲区(LFB)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆