非时间负载和硬件预取器,它们一起工作吗? [英] Non-temporal loads and the hardware prefetcher, do they work together?

查看:65
本文介绍了非时间负载和硬件预取器,它们一起工作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当从连续的内存位置执行一系列 _mm_stream_load_si128() 调用 (MOVNTDQA) 时,硬件预取器是否仍然启动,或者我应该使用显式软件预取(带 NTA 提示)是为了获得预取的好处,同时还能避免缓存污染?

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution?

我问这个的原因是因为他们的目标在我看来是矛盾的.流式加载将绕过缓存获取数据,而预取器尝试主动将数据提取到缓存中.

The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache.

当顺序迭代一个大型数据结构时(处理后的数据在很长一段时间内不会被修饰),避免污染chache层次结构对我来说是有意义的,但我不想招致频繁的~100个周期的惩罚因为预取器空闲.

When sequentially iterating a large data structure (processed data won't be retouched in a long while), it would make sense to me to avoid polluting the chache hierarchy, but I do not want to incur in frequent ~100 cycle penalties because the pre-fetcher is idle.

目标架构是 Intel SandyBridge

Target architecture is Intel SandyBridge

推荐答案

根据 Patrick Fay(英特尔)2011 年 11 月的帖子:,在最新的英特尔处理器上,prefetchnta 将内存中的一行带入 L1 数据缓存(而不是其他缓存级别)."他还说,您需要确保预取不会太晚(硬件预取已经将其拉入所有级别)或过早(在您到达那里时被驱逐).

According to Patrick Fay (Intel)'s Nov 2011 post:, "On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels)." He also says you need to make sure you don't prefetch too late (HW prefetch will already have pulled it in to all levels), or too early (evicted by the time you get there).

正如在 OP 的评论中所讨论的,当前的 Intel CPU 有一个很大的共享 L3,其中包括所有每核缓存.这意味着缓存一致性流量只需检查 L3 标签,看看缓存行是否可能在每核 L1/L2 的某处被修改.

As discussed in comments on the OP, current Intel CPUs have a large shared L3 which is inclusive of all the per-core caches. This means cache-coherency traffic only has to check L3 tags to see if a cache line might be modified somewhere in a per-core L1/L2.

IDK 如何将 Pat Fay 的解释与我对缓存一致性/缓存层次结构的理解相协调.我想如果它确实进入 L1,它也必须进入 L3.也许 L1 标签有某种标志来说明这条线是弱排序的?我最好的猜测是他在简化,并在实际上只进入填充缓冲区时说 L1.

IDK how to reconcile Pat Fay's explanation with my understanding of cache coherency / cache heirarchy. I thought if it does go in L1, it would also have to go in L3. Maybe L1 tags have some kind of flag to say this line is weakly-ordered? My best guess is he was simplifying, and saying L1 when it actually only goes in fill buffers.

有关处理视频的英特尔指南RAM 谈论使用加载/存储缓冲区而不是缓存行的非临时移动.(请注意,这可能仅适用于不可缓存 内存.)它没有提到预取.它也很古老,早于 SandyBridge.然而,它确实有这个多汁的报价:

This Intel guide about working with video RAM talks about non-temporal moves using load/store buffers, rather than cache lines. (Note that this may only the case for uncacheable memory.) It doesn't mention prefetch. It's also old, predating SandyBridge. However, it does have this juicy quote:

普通加载指令从 USWC 内存中提取数据,单位为指令要求的相同大小.相比之下,流式加载像 MOVNTDQA 这样的指令通常会拉出一个完整的缓存线数据到 CPU 中的特殊填充缓冲区".后续流加载将从该填充缓冲区中读取数据,从而减少延迟.

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

然后在另一段中说典型的 CPU 有 8 到 10 个填充缓冲区.SnB/Haswell 每个核心仍有 10 个..再次注意,这可能仅适用于不可缓存的内存区域.

And then in another paragraph, says typical CPUs have 8 to 10 fill buffers. SnB/Haswell still have 10 per core.. Again, note that this may only apply to uncacheable memory regions.

movntdqa 在 WB(回写)内存上不是弱排序(请参阅链接答案的 NT 加载部分),因此不允许过时".与 NT 存储不同,movntdqaprefetchnta 都不会改变 Write-Back 内存的内存排序语义.

movntdqa on WB (write-back) memory is not weakly-ordered (see the NT loads section of the linked answer), so it's not allowed to be "stale". Unlike NT stores, neither movntdqa nor prefetchnta change the memory ordering semantics of Write-Back memory.

我没有测试过这个猜测,但是现代英特尔 CPU 上的 prefetchnta/movntdqa 可以将缓存线加载到 L3 和 L1,但可以跳过 L2(因为 L2 不包含或不包含 L1).NT 提示可以通过将缓存行放置在其集合的 LRU 位置来产生效果,它是下一个要被逐出的行.(普通缓存策略在 MRU 位置插入新行,离被驱逐最远.见 这篇关于 IvB 的自适应 L3 策略的文章,了解更多关于缓存插入策略的信息).

I have not tested this guess, but prefetchnta / movntdqa on a modern Intel CPU could load a cache line into L3 and L1, but could skip L2 (because L2 isn't inclusive or exclusive of L1). The NT hint could have an effect by placing the cache line in the LRU position of its set, where it's the next line to be evicted. (Normal cache policy inserts new lines at the MRU position, farthest from being evicted. See this article about IvB's adaptive L3 policy for more about cache insertion policy).

IvyBridge 上的预取吞吐量仅为每 43 个周期一个,因此如果您不希望预取降低 IvB 上的代码速度,请注意不要预取太多.来源:Agner Fog 的 insn 表和微架构指南.这是 IvB 特有的性能错误.在其他设计中,过多的预取只会占用可能是有用指令的 uop 吞吐量(除了预取无用地址的危害).

Prefetch throughput on IvyBridge is only one per 43 cycles, so be careful not to prefetch too much if you don't want prefetches to slow down your code on IvB. Source: Agner Fog's insn tables and microarch guide. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).

关于一般的软件预取(不是 nt 类型):Linus Torvalds 发布了关于他们如何很少在 Linux 内核中提供帮助,而且往往弊大于利.显然,在链表末尾预取一个 NULL 指针会导致速度变慢,因为它试图填充 TLB.

About SW prefetching in general (not the nt kind): Linus Torvalds posted about how they rarely help in the Linux kernel, and often do more harm than good. Apparently prefetching a NULL pointer at the end of a linked-list can cause a slowdown, because it attempts a TLB fill.

这篇关于非时间负载和硬件预取器,它们一起工作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆