次优缓存行预取的成本 [英] Cost of a sub-optimal cacheline prefetch

查看:86
本文介绍了次优缓存行预取的成本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用__builtin_prefetch(..., 1)内在函数(为写操作做准备而进行的预取)进行后期预取的成本是多少?就是说,在需求加载或需要它的写入之前没有到达L1缓存的预取?

What is the cost of a late prefetch done with a __builtin_prefetch(..., 1) intrinsic (prefetch in preparation for a write)? That is, a prefetch that does not arrive in the L1 cache before the demand load or write that requires it?

例如

void foo(std::uint8_t* line) {
    __builtin_prefetch(line + std::hardware_constructive_interference_size, 1);
    auto next_line = calculate_address_of_next_line(line);
    auto result = transform(line);
    write(next_line, result)
}

在这种情况下,如果transform的成本低于预取,则此代码的效率最终会比没有预取时低吗?关于缓存预取的维基百科文章讨论了for循环的最佳步幅,但确实更不用说在这种情况下次优预取的影响了(例如,如果k太低会发生什么?).

In this case if the cost of transform is lower than the prefetch, will this code end up being less efficient than if there were no prefetch? The wikipedia article on cache prefetching talks about an optimal stride for a for loop, but does not mention the impact of a sub-optimal prefetch in that scenario (eg. what would happen if k were too low?).

这是否足够流水线以至于次优预取没有关系?为了这个问题,我只考虑使用Intel x86(也许是在Broadwell时代的处理器).

Does this get pipelined enough that a suboptimal prefetch does not matter? I am only considering Intel x86 (processors around the time of Broadwell maybe) for the purposes of this question.

推荐答案

让我们将您所指的预取类型称为 late 预取:在需求加载之前,预取未充分发生使用相同高速缓存行来完全隐藏高速缓存未命中延迟的存储.这与太早预取相反,在该预取中,预取发生在与需求访问相距甚远的地方,以至于在发生访问之前,它已从至少某些级别的缓存中逐出.

Let's call the type of prefetch you are referring to a late prefetch: where the prefetch does not occur sufficiently before the demand load or store that uses the same cache line to fully hide the latency of the cache miss. This is as opposed to a too-early prefetch, where the prefetch happens so far away from the demand access that it is evicted from at least some levels of the cache before the access occurs.

与根本不进行预取相比,这种晚期预取的成本很可能很小,为零或为负.

Compared to not doing the prefetch at all, the cost of such a late prefetch is likely very small, zero or negative.

我们将重点放在负面方面:即,即使预取时间很晚,预取还是有帮助的情况.如果我正确理解了您的问题,则认为预取不会在需要它的负载丢失"或无效之前到达.但是,情况并非如此:一旦预取请求开始,时钟就开始滴答作响,以完成内存访问;如果需求负载在完成之前发生,则工作不会丢失.例如,如果您的内存访问时间为100 ns,而需求访问仅在预取后20 ns发生,则预取是太迟了",即未隐藏全部100 ns的延迟,而是20 ns花费在预取仍然有用:它将需求访问延迟减少到大约80 ns.

Let's focus on the negative part: i.e., the scenario where the prefetch helps even though it is late. If I understand your question correctly, you consider a prefetch that doesn't arrive before the load that needs it "missed" or ineffective. That is not case however: as soon as the prefetch request starts, the clocks starts ticking down for the completion of memory access and that work is not lost if the demand load occurs before it completes. For example, if your memory access takes 100 ns, but the demand access occurs only 20 ns after the prefetch, the prefetch is "too late" in the sense that the full 100 ns latency wasn't hidden, but the 20 ns spend on the prefetch is still useful: it reduced the demand access latency to about 80 ns.

也就是说, late 预取不是二进制条件:它的发生范围从稍晚(例如,预取在访问前90 ns发出,等待时间为100 ns),或者真的很晚(几乎在消费访问之前).在大多数情况下,假设内存延迟首先是您算法的瓶颈,那么即使是相对较晚的预取也可能会有所帮助.

That is, late prefetch isn't a binary condition: it ranges from just a little bit late (e.g., a prefetch issued 90 ns before an access with a latency of 100 ns), or really late (almost immediately before the consuming access). In most scenarios even fairly late prefetching probably helps, assuming memory latency was a bottleneck for your algorithm in the first place.

现在让我们考虑完全无用的预取的情况(即在访问之前立即发出,因此如果不存在预取,则可以在原位置发出访问)-成本是多少?在最现实的情况下,成本可能很小:要处理额外的指令,对AGU施加一些小的额外压力,以及在将后续访问与进行中的预取 2 <相匹配时,可能会浪费少量的精力/sup>.

Let's consider now the case of a totally useless prefetch (i.e., issued immediately before the access, so the access could have been issued in its place had the prefetch not existed) - what is the cost? In most realistic scenarios the costs are probably very small: an extra instruction to handle, some small additional pressure on the AGUs, and perhaps a small amount of wasted effort when matching up the subsequent access with the in-flight prefetch2.

由于假设是由于错过了高速缓存或DRAM的外部级别而采用了预取,并且transform函数中的工作足以掩盖某些延迟,因此这一额外的相对成本指令可能很小.

Since the assumption is that prefetching is employed because of missed to the outer levels of cache or DRAM, and that the work in the transform function is significant enough to hide some of the latency, the relative cost of this one additional instruction is likely to be very small.

当然,所有这些假设都是在额外的预取是一条指令的前提下进行的.在某些情况下,您可能不得不某种程度地组织代码以允许预取或执行一些重复的计算以允许在适当的位置进行预取.在这种情况下,成本方面可能会更高.

Of course, this is all under the assumption that the additional prefetch is a single instruction. In some cases, you may have had to organize your code somewhat to allow prefetching or perform some duplicate calculations to allow prefetching at the appropriate place. In that case, the cost side could be correspondingly higher.

最后,在具有写意图的写访问和预取方面还有其他行为,这意味着在某些情况下,即使是完全无用的预取(即紧接在第一次访问之前)也很有用-当第一次访问是阅读.

Finally, there is an additional behavior with respect to write accesses and prefetch with write intent, which means that in some scenarios even a totally useless prefetch (i.e., immediately before the first access) is useful - when the first access is a read.

如果先读取给定的行,然后再写入,则内核可以在E(xclusive)中获取该行.

If a given line is first read, then later written, the core may get the line in the E(xclusive) coherence state and then on the first need to make another roundtrip to some level of the cache to get it in the M state. Using a prefetch with write-intent before the first access would avoid this second roundtrip because the line would be brought in with the M state the first time. The effect of this optimization is tough to quantify in general, not least because writes are usually buffered and don't form part of a dependence chain (outside of store forwarding).

2 我在这里使用了刻意模糊的术语浪费的精力",因为目前尚不清楚它是否具有性能或功耗成本,还是仅仅是一些没有增加的额外工作操作延迟.一种可能的代价是,触发初始L1丢失的负载具有特殊状态,并且可以在不进行另一次L1往返的情况下接收其结果.在预取紧接着是负载的情况下,该负载可能没有处于特殊状态,这可能会稍微增加成本.但是,这个问题是关于存储而不是加载.

2 I use the deliberately vague term "wasted effort" here because it isn't really clear if this has a performance or power cost, or is just some additional work that doesn't add to the operation latency. One possible cost is that a load that triggers the initial L1 miss has a special status and can receive its result without making another roundtrip to L1. In the scenario of a prefetch followed immediately by a load, the load presumably doesn't get the special status which may slightly increase the cost. However, this question is about stores not loads.

这篇关于次优缓存行预取的成本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆