DCU预取器在哪种情况下开始预取? [英] In which condition DCU prefetcher start prefetching?

查看:190
本文介绍了DCU预取器在哪种情况下开始预取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读有关Intel Core i7系统中可用的不同预取器的信息。
我进行了实验,以了解何时调用这些预取器。



这些是我的发现


  1. L1 IP预取器在3次缓存未命中后开始预取。


  2. L2相​​邻的行预取器在第一个高速缓存未命中
    后开始预取,并在高速缓存未命中时进行预取。 / p>


  3. L2硬件(跨步)预取器在第一次高速缓存未命中
    后开始预取,并在命中高速缓存时进行预取。


我无法理解DCU预取器的行为。它何时开始预取或调用?它是否预取缓存命中或未命中的下一个缓存行?



我已经浏览了英特尔文档 disclosure-of-hw-prefetcher 在其中提到-DCU预取器获取下一个缓存行



有人可以解释DCU预取器何时开始预取吗?

解决方案

DCU预取器不会以确定的方式预取行。它似乎具有与每个潜在的预取请求关联的置信度值。如果置信度仅大于某个阈值,则触发预取。此外,似乎如果两个L1预取器都启用,则它们中只有一个可以在同一周期内发出预取请求。也许可以接受较高信任度的预取。下面的答案未考虑这些观察结果。 (还需要做更多的实验工作。以后我会重写它。)






英特尔手册告诉我们向您介绍DCU预取器。优化手册的第2.4.5.4节和第2.5.4.2节都说以下内容:


数据缓存单元(DCU)预取器- prefetcher,也称为
流式预取器,是由对最近加载的
数据的递增访问触发的。处理器假定此访问是流算法的
部分,并自动获取下一行。


请注意第2.4节.5.4是Sandy Bridge的一部分,而2.5.4.2是Intel Core的一部分。 DCU预取程序首先在Intel Core微体系结构上受支持,并且在以后的所有微体系结构中也受支持。据我所知,没有迹象表明DCU预取器已经随着时间而改变。因此,我认为至少在Skylake之前的所有微体系结构上,它的工作原理都完全相同。



该报价并没有说太多。 升序访问部分表明预取器是由偏移量增加的多次访问触发的。 最近加载的数据部分含糊不清。它可能指的是紧接在地址空间中要预取的行之前的一个或多个行。还不清楚这是指虚拟地址还是物理地址。 获取下一行部分表明,每次触发时,它仅获取一行,并且该行是触发预取的行之后的行。



我已经在Haswell上进行了一些实验,其中所有预取器都被禁用,除了DCU预取器。我还禁用了超线程。这使我能够独立研究DCU预取器。结果显示如下:




  • DCU预取器最多跟踪4个不同的4KB(可能是物理)页面的访问。

  • 当在同一缓存集中中的一个或多个行有三个或更多访问时,将触发DCU预取器。访问权限必须是需求加载或软件预取(任何包含 prefetchnta 的预取指令)或两者的组合。访问可以是L1D中的命中或未命中,或者是两者的组合。触发后,对于当前正在跟踪的4个页面,它将在相应页面的 each 中预取下一行。例如,请考虑以下三个需求负载缺失:0xF1000、0xF2008和0xF3004。假设要跟踪的4个页面是0xF1000、0xF2000、0xF3000和0xF4000。然后,DCU预取器将预取以下行:0xF1040、0xF2040、0xF3040和0xF4040。

  • 当对三个或更多访问一个或多个<>时,将触发DCU预取器。 两个连续的缓存集中的/ em>行。像以前一样,访问必须是需求负载或软件预取。访问可以是L1D中的命中或未命中。触发后,对于当前正在跟踪的4个页面,它将相对于物理地址较小的已访问缓存集预取相应页面的每个中的下一行。例如,考虑以下三个需求负载未命中:0xF1040、0xF2048和0xF3004。假设要跟踪的4个页面是0xF1000、0xF2000、0xF3000和0xF4000。然后,DCU预取器将预取以下行:0xF3040和0xF4040。不需要预取0xF1040或0xF2040,因为已经有请求。

  • 预取器将不会预取到下一个4KB页面。因此,如果这三个访问是对页面最后一行的访问,则不会触发预取器。

  • 要跟踪的页面如下选择。每当需求负载或软件预取访问页面时,都会跟踪该页面,它将替换当前正在跟踪的4个页面之一。我没有进一步研究用于确定要替换4页中的哪一页的算法。

  • 由于对上一个项目符号点中提到的类型的访问而导致跟踪新页面时,同一页面至少需要两次访问和同一行触发预取器预取下一行。否则,如果L1中尚不存在对下一行的后续访问,则会丢失该行。之后,无论哪种方式,DCU预取器的行为都如第二和第三项目要点所述。例如,考虑以下三个需求负载未命中:0xF1040、0xF2048和0xF3004。对同一行有两次访问,第三个访问相同的缓存集,但访问的行不同。这些访问将使DCU预取器跟踪两个页面,但不会立即触发它。当预取程序看到对同一缓存集中的任何一行的另外三个访问时,它将为当前正在跟踪的那些页面预取下一行。作为另一个示例,请考虑以下三个需求未命中:0xF1040、0xF2048和0xF3030。这些访问都在同一行上,因此它们不仅使预取器跟踪该页面,而且还会触发该页面和已被跟踪的任何其他页面的下一行预取。

  • 在我看来,预取器正在从正在访问的页面的页面表条目(从TLB)接收脏标志。该标志指示页面是否脏。如果脏了,则预取程序将不会跟踪页面,并且对该页面的访问将不计入满足触发条件的三个访问中。因此,似乎DCU预取器只是忽略脏页。也就是说,尽管预取器支持该页面,但该页面不必是只读的。但是,需要进行更彻底的调查才能更准确地了解商店如何与DCU预取器进行交互。



因此触发预取器的访问不必升序或遵循任何顺序。高速缓存行偏移量本身似乎被预取器忽略。只有物理页号很重要。



我认为DCU预取器具有一个包含4个条目的完全关联缓冲区。每个条目都标记有(可能是物理的)页码,并具有一个有效位来指示该条目是否包含有效的页码。另外,L1D的每个高速缓存集都与一个2位饱和计数器关联,每当需求负载或软件预取请求访问相应的高速缓存集且未设置所访问页面的脏标志时,该计数器就会递增。当计数器的值达到3时,将触发预取器。预取器已经具有需要从中进行预取的物理页码;它可以从与计数器相对应的缓冲区条目中获取它们。因此,它可以立即为缓冲区正在跟踪的每个页面的下一个缓存行发出预取请求。但是,如果填充缓冲区不可用于触发的预取请求,则将删除预取。然后,计数器将重置为零。页表可能会被修改。每次刷新TLB时,预取器就有可能刷新其缓冲区。



可能存在两个DCU预取器,每个逻辑核一个。禁用超线程时,也会禁用其中一个预取器。也可能是包含页码的4个缓冲区条目在两个逻辑核心之间静态分区,并且在禁用超线程时组合在一起。我不确定,但是这种设计对我来说很有意义。另一种可能的设计是每个预取器都有一个专用的4入口缓冲区。启用超线程后,不难确定DCU预取器的工作方式。



总而言之,DCU pefetcher迄今为止是现代高可用的4种数据预取器中最简单的一种性能的英特尔处理器。似乎仅当顺序但缓慢地访问小块只读数据(例如只读文件和静态初始化的全局数组)或同时访问可能包含许多小字段的多个只读对象时,此方法才有效



第2.4.5.4节通常还提供了有关L1D预取的其他信息,因此适用于DCU预取器。 / p>


当满足以下
条件时,数据预取由加载操作触发:




  • 负载来自回写内存类型。


此意味着DCU预取程序将不会跟踪对WP和WT可缓存内存类型的访问。



  • 预取的数据与触发它的加载指令位于同一4K字节页内。




已通过实验验证。



  • 没有围栏


我不知道这意味着什么。请参阅: https:// software .intel.com / zh-cn / forums / software-tuning-performance-optimization-platform-monitoring / topic / 805373



  • 没有其他许多未命中的负载。


只有10个填充缓冲区可以容纳未命中L1D的请求。这提出了一个问题,尽管如果只有一个可用的填充缓冲区,那么硬件预取器将使用它还是将其留给预期的需求访问?我不知道。



  • 没有连续的商店。


这表明,如果存在大量的商店,而负载却很少,那么L1预取器将忽略加载并基本上暂时关闭,直到商店成为少数。但是,我的实验结果表明,即使是单个存储到一个页面,也会关闭该页面的预取器。



所有Intel Atom微体系结构都具有DCU预取器。尽管预取程序在这些微体系结构中可能跟踪不到4页。



所有Xeon Phi微体系结构(包括Knights Landing在内)都没有DCU预取器。我不知道后来的Xeon Phi微体系结构。


I am reading about different prefetcher available in Intel Core i7 system. I have performed experiments to understand when these prefetchers are invoked.

These are my findings

  1. L1 IP prefetchers starts prefetching after 3 cache misses. It only prefetch on cache hit.

  2. L2 Adjacent line prefetcher starts prefetching after 1st cache miss and prefetch on cache miss.

  3. L2 H/W (stride) prefetcher starts prefetching after 1st cache miss and prefetch on cache hit.

I am not able to understand the behavior of DCU prefetcher. When it starts prefetching or invoked ? Does it prefetch next cache line on cache hit or miss ?

I have explored intel document disclosure-of-hw-prefetcher where it mentioned - DCU prefetcher fetches the next cache line into L1-D cache , but no clear information when it starts prefetching .

Can anyone explain when DCU prefetcher prefetch starts prefetching?

解决方案

The DCU prefetcher does not prefetch lines in a deterministic manner. It appears to have a confidence value associated with each potential prefetch request. If the confidence is larger than some threshold only then is the prefetch triggered. Moreover, it seems that if both L1 prefetchers are enabled, only one of them can issue a prefetch request in the same cycle. Perhaps the prefetch from the one with higher confidence is accepted. The answer below does not take these observations into consideration. (A lot more experimentation work needs to be done. I will rewrite it in the future.)


The Intel manual tells us a few things about the DCU prefetcher. Section 2.4.5.4 and Section 2.5.4.2 of the optimization manual both say the following:

Data cache unit (DCU) prefetcher -- This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.

Note that Section 2.4.5.4 is part of the section on Sandy Bridge and Section 2.5.4.2 is part of the section on Intel Core. The DCU prefetcher was first supported on the Intel Core microarchitecture and it's also supported on all later microarchitectures. There is no indication as far as I know that the DCU prefetcher have changed over time. So I think it works exactly the same on all microarchitectures up to Skylake at least.

That quote doesn't really say much. The "ascending access" part suggests that the prefetcher is triggered by multiple accesses with increasing offsets. The "recently loaded data" part is vague. It may refer to one or more lines that immediately precede the line to be prefetched in the address space. It's also not clear whether that refers to virtual or physical addresses. The "fetches the next line" part suggests that it fetches only a single line every time it's triggered and that line is the line that succeeds the line(s) that triggered the prefetch.

I've conducted some experiments on Haswell with all prefetchers disabled except for the DCU prefetcher. I've also disabled hyperthreading. This enables me to study the DCU prefetcher in isolation. The results show the following:

  • The DCU prefetcher tracks accesses for up to 4 different 4KB (probably physical) pages.
  • The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within the same cache set. The accesses must be either demand loads or software prefetches (any prefetch instruction including prefetchnta) or a combination of both. The accesses can be either hits or misses in the L1D or a combination of both. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages. For example, consider the following three demand load misses: 0xF1000, 0xF2008, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF1040, 0xF2040, 0xF3040, and 0xF4040.
  • The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within two consecutive cache sets. Just like before, the accesses must be either demand loads or software prefetches. The accesses can be either hits or misses in the L1D. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages with respect to the accessed cache set that has a smaller physical address. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF3040 and 0xF4040. There is no need to prefetch 0xF1040 or 0xF2040 because there are already requests for them.
  • The prefetcher will not prefetch into the next 4KB page. So if the three accesses are to the last line in the page, the prefetcher will not be triggered.
  • The pages to be tracked are selected as follows. Whenever a demand load or a software prefetch accesses a page, that page will be tracked and it will replace one of the 4 pages currently being tracked. I've not investigated further the algorithm used to decide which of the 4 pages to replace. It's probably simple though.
  • When a new page gets tracked because of an access of the type mentioned in the previous bullet point, at least two more accesses are required to the same page and same line to trigger the prefetcher to prefetch the next line. Otherwise, a subsequent access to the next line will miss in the L1 if the line was not already there. After that, either way, the DCU prefetcher behaves as described in the second and third bullet points. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. There are two accesses to the same line and the third one is to the same cache set but different line. These accesses will make the DCU prefetcher track the two pages, but it will not trigger it just yet. When the prefetcher sees another three accesses to any line in the same cache set, it will prefetch the next line for those pages that are currently being tracked. As another example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3030. These accesses are all to the same line so they will not only make the prefetcher track the page but also trigger a next line prefetch for that page and any other pages that are already being tracked.
  • It seems to me that the prefetcher is receiving the dirty flag from the page table entry of the page being accessed (from the TLB). The flag indicates whether page is dirty or not. If it's dirty, the prefetcher will not track the page and accesses to the page will not be counted towards the three accesses for the triggering condition to be satisfied. So it seems that the DCU prefetcher simply ignores dirty pages. That said, the page doesn't have to be read-only though to be supported by the prefetcher. However, more thorough investigation is required to understand more accurately how stores may interact with the DCU prefetcher.

So the accesses that trigger the prefetcher don't have to be "ascending" or follow any order. The cache line offset itself seems to be ignored by the prefetcher. Only the physical page number matters.

I think the DCU prefetcher has a fully associative buffer that contains 4 entries. Each entry is tagged with the (probably physical) page number and has a valid bit to indicate whether the entry contains a valid page number. In addition, each cache set of the L1D is associated with a 2-bit saturating counter that is incremented whenever a demand load or a software prefetch request accesses the corresponding cache set and the dirty flag of the accessed page is not set. When the counter reaches a value of 3, the prefetcher is triggered. The prefetcher already has the physical page numbers from which it needs to prefetch; it can obtain them from the buffer entry that corresponds to the counter. So it can immediately issue prefetch requests to the next cache lines for each of the pages being tracked by the buffer. However, if a fill buffer is not available for a triggered prefetch request, the prefetch will be dropped. Then the counter will be reset to zero. Page tables might be modified though. It's possible that the prefetcher flushes its buffer whenever the TLB is flushed.

It could be the case that there are two DCU prefetchers, one for each logical core. When hyperthreading is disabled, one of the prefetchers would be disabled too. It could also be the case the 4 buffer entries that contain the page numbers are statically partitioned between the two logical cores and combined when hyperthreading is disabled. I don't know for sure, but such design makes sense to me. Another possible design would be each prefetcher has a dedicated 4-entry buffer. It's not hard to determine how the DCU prefetcher works when hyperthreading is enabled. I just didn't spend the effort to study it.

All in all, the DCU pefetcher is by far the simplest among the 4 data prefetchers that are available in modern high-performance Intel processors. It seems that it's only effective when sequentially, but slowly, accessing small chunks of read-only data (such as read-only files and statically initialized global arrays) or accessing multiple read-only objects at the same time that may contain many small fields and span a few consecutive cache lines within the same page.

Section 2.4.5.4 also provides additional information on L1D prefetching in general, so it applies to the DCU prefetcher.

Data prefetching is triggered by load operations when the following conditions are met:

  • Load is from writeback memory type.

This means that the DCU prefetcher will not track accesses to the WP and WT cacheable memory types.

  • The prefetched data is within the same 4K byte page as the load instruction that triggered it.

This has been verified experimentally.

  • No fence is in progress in the pipeline.

I don't know what this means. See: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/805373.

  • Not many other load misses are in progress.

There are only 10 fill buffers that can hold requests that missed the L1D. This raises the question though that if there was only a single available fill buffer, would the hardware prefetcher use it or leave it for anticipated demand accesses? I don't know.

  • There is not a continuous stream of stores.

This suggests that if there is a stream of a large number of stores intertwined with few loads, the L1 prefetcher will ignore the loads and basically temporarily switch off until the stores become a minority. However, my experimental results show that even a single store to a page will turn the prefetcher off for that page.

All Intel Atom microarchitectures have the DCU prefetcher. Although the prefetcher might track less than 4 pages in these microarchitectures.

All Xeon Phi microarchitectures up to and including Knights Landing don't have the DCU prefetcher. I don't know about later Xeon Phi microarchitectures.

这篇关于DCU预取器在哪种情况下开始预取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆