是由精确地址流还是由高速缓存行流触发预取? [英] Is prefetching triggered by the stream of exact addresses or by the stream of cache lines?

查看:168
本文介绍了是由精确地址流还是由高速缓存行流触发预取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在现代x86 CPU上,硬件预取是一项重要技术,可将缓存行带入各个级别

On modern x86 CPUs, hardware prefetching is an important technique to bring cache lines into various levels of the cache hierarchy before they are explicitly requested by the user code.

基本思想是,当处理器检测到对顺序或跨步顺序的一系列访问时, > 1 个位置,它将继续执行并按顺序获取其他内存位置,甚至在执行(可能)实际访问这些位置的指令之前。

The basic idea is that when the processor detects a series of accesses to sequential or strided-sequential1 locations, it will go ahead and fetch further memory locations in the sequence, even before executing the instructions that (may) actually access those locations.

我的问题是预取序列的检测是基于完整地址(用户代码请求的实际地址)还是基于缓存行地址除去剥离的最低6位 2

My question is if the detection of a prefetch sequence is based on the full addresses (the actual addresses requested by user code) or the cache line addresses which is pretty much the address excluding the bottom 6 bits2 stripped off.

例如,在具有64位高速缓存行的系统上,访问完整地址 1、2、3、65、150 将访问缓存行 0、0、0、1、2

For example, on a system with a 64-bit cache line, accesses to full addresses 1, 2, 3, 65, 150 would access cache lines 0, 0, 0, 1, 2.

当一系列访问在高速缓存行寻址中的规则性比完整寻址时更重要。例如,一系列完整的地址,例如:

The difference could be relevant when a series of accesses is more regular in the cache line addressing than the full addressing. For example, a series of full addresses like:

32, 24, 8, 0, 64 + 32, 64 + 24, 64 + 8, 64 + 0, ..., N*64 + 32, N*64 + 24, N*64 + 8, N*64 + 0

看起来不像是完整地址级别上的跨步序列(实际上,它可能会错误地触发向后预取器,因为4次访问的每个子序列看起来像是8字节跨步的反向序列),但是在高速缓存行级别上,它看起来像是一次转发高速缓存行(就像简单的序列 0、8、16、24,... ) 。

might not look like a strided sequence at the full address level (indeed it might incorrectly trigger the backwards prefetcher since each subsequence of 4 accesses looks like an 8-byte strided reverse sequence), but at the cache line level it looks like its going forwards a cache line a time (just like the simple sequence 0, 8, 16, 24, ...).

哪个系统(如果有)在现代硬件上已安装?

Which system, if either, is in place on modern hardware?

注意:人们还可以想象答案不会基于次访问,而是仅基于在某些缓存级别中丢失的访问预取程序正在观察,但是同样的问题仍然适用于过滤后的未命中访问流。

Note: One could imagine also that the answer wouldn't be based on every access, but only accesses which miss in the some level of the cache that the prefetcher is observing, but then the same question still applies to the filtered stream of "miss accesses".

1 分步顺序表示的是具有相同 stride (增量)之间,即使该增量不是1。例如,对位置 100、200、300,... 的一系列访问也可以被检测为跨步访问,跨度为100,原则上CPU将基于此模式进行获取(这意味着某些高速缓存行可能会在预取模式中跳过)。

1Strided-sequential just means that accesses that have the same stride (delta) between them, even if that delta isn't 1. For example, a series of accesses to locations 100, 200, 300, ... could be detected as strided access with a stride of 100, and in principle the CPU will fetch based on this pattern (which would mean that some cache lines might be "skipped" in the prefetch pattern).

2 这里假设有64位缓存行。

2 Here assuming a 64-bit cache line.

推荐答案

缓存线偏移量可能很有用,但如您的示例所示,它们也会产生误导。我将基于我在Haswell上的实验,讨论行偏移如何影响现代Intel处理器上的数据预取器。

The cache line offsets can be useful but they also can be misleading as your example shows. I will discuss the how line offsets impact the data prefetchers on modern Intel processors based on my experiments on Haswell.

我遵循的方法很简单。首先,我禁用所有要测试的数据预取器。其次,我设计了一系列具有特定兴趣模式的访问。目标预取器将看到此序列并从中学习。然后,我接着访问特定的行,以通过准确地测量等待时间来确定预取器是否已预取该行。该循环不包含任何其他负载。它包含一个用于将延迟测量值存储在某个缓冲区中的存储。

The method I followed is simple. First, I disable all the data prefetchers except the one I want to test. Second, I design a sequence of accesses that exhibit a particular pattern of interest. The target prefetcher will see this sequence and learn from it. Then I follow that by an access to a particular line to determine whether the prefetcher has prefetched that line or not by accurately measuring the latency. The loop doesn't contain any other loads. It contains though one store used to store the latency measurement in some buffer.

有4个硬件数据预取器。 DCU预取器和L2相邻行预取器的行为不受行偏移模式的影响,而仅受64字节对齐地址的模式的影响。

There are 4 hardware data prefetchers. The behaviors of the DCU prefetcher and the L2 adjacent line prefetcher are not affected by the pattern of the line offsets, but only by the pattern of 64-byte aligned addresses.

我的实验没有显示任何证据表明L2流预取器甚至收到了缓存行偏移量。似乎它只获取行对齐的地址。例如,通过多次访问同一行,偏移模式本身似乎不会对预取器的行为产生影响。

My experiments don't show any evidence that the L2 streaming prefetcher even receives the cache line offset. It seems that it only gets the line-aligned address. For example, by accessing the same line multiple times, the offset pattern by itself does not seem to have an impact on the behavior of the prefetcher.

DCU IP预取器显示有趣的行为。我已经测试了两种情况:

The DCU IP prefetcher shows interesting behavior. I've tested two cases:


  • 如果负载的偏移量减小,则预取器将在向前和向后预取一行或多行

  • 如果负载的偏移量增加,则预取器将预取一条或多条线,但仅在正向。

这篇关于是由精确地址流还是由高速缓存行流触发预取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆