在 L1 和 L2 预取数据 [英] prefetching data at L1 and L2

查看:32
本文介绍了在 L1 和 L2 预取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Agner Fog 的手册 Optimizing software in C++ 的第 9.10 节Cahce 争用大数据结构"他描述了当矩阵宽度等于称为临界步幅的东西时转置矩阵的问题.在他的测试中,当宽度等于临界步幅时,L1 中矩阵的成本要高 40%.如果矩阵更大并且只适合 L2,成本是 600%! 这在他的文本中的表 9.1 中得到了很好的总结.这与在为什么要转置512x512 的矩阵比转置 513x513 的矩阵慢得多?

后来他写道:

<块引用>

为什么这种效果如此强烈2 级缓存争用与 1 级缓存争用不同的是,2 级缓存不能一次预取多于一行.

所以我的问题与预取数据有关.

根据他的评论,我推断 L1 一次可以预取多个缓存行.它可以预取多少个?

据我所知,尝试编写代码来预取数据(例如使用 _mm_prefetch)很少有帮助.我读过的唯一示例是预取示例?,并且这只是 O(10%) 的改进(在某些机器).Agner 后来解释了这一点:

<块引用>

原因是现代处理器自动预取数据,这要归功于乱序执行和高级预测机制.现代微处理器是能够为包含多个流的常规访问模式自动预取数据以不同的步伐.因此,如果数据访问可以,您不必显式预取数据以固定步幅以规则模式排列.

那么 CPU 如何决定要预取哪些数据,有没有办法帮助 CPU 为预取做出更好的选择(例如具有固定步幅的常规模式")?

根据 Leeor 的评论,让我添加到我的问题中并使其更有趣.与 L1 相比,为什么临界步幅对 L2 的影响更大?

我尝试使用 为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多?我在 Xeon E5 1620(Ivy Bridge)上用 MSVC2013 64 位发布模式运行了这个,它有 L1 32KB 8 路、L2 256 KB 8 路和 L3 10MB 20 路.L1 的最大矩阵大小约为 90x90,L3 为 256x256,L3 为 1619.

矩阵大小平均时间64x64 0.004251 0.004472 0.004412(三倍)65x65 0.004422 0.004442 0.004632(三倍)128x128 0.0409129x129 0.0169256x256 0.219//最大L2矩阵大小257x257 0.0692512x512 2.701513x513 0.6491024x1024 12.81025x1025 10.1

我在 L1 中没有看到任何性能损失,但是 L2 显然存在关键步幅问题,可能还有 L3.我还不确定为什么 L1 没有显示问题.可能还有其他一些背景源(开销)在 L1 时间中占主导地位.

解决方案

此声明:

<块引用><块引用>

二级缓存一次不能预取超过一行.

不正确

事实上,L2 预取器通常比 L1 预取器更强大、更具攻击性.这取决于您使用的实际机器,但英特尔的 L2 预取器例如可以为每个请求触发 2 次预取,而 L1 通常是有限的(有几种类型的预取可以在 L1 中共存,但它们可能会在比 L2 可用的带宽更有限的带宽上进行竞争,所以来自 L1 的预取可能会更少.

优化指南,在第 2.3.5.4 节(数据预取)中统计了以下预取器类型:

两个硬件预取器将数据加载到 L1 DCache:- 数据缓存单元 (DCU) 预取器:该预取器,也称为流式预取器,由对最近加载的数据的升序访问触发.处理器假定此访问是流算法的一部分并自动获取下一行.- 基于指令指针 (IP) 的跨距预取器:该预取器跟踪各个加载指令.如果检测到加载指令具有常规步长,则将预取发送到下一个地址,该地址是当前地址和步长的总和.该预取器可以向前或向后预取,并且可以检测高达 2K 字节的步幅.数据预取到 L2 和最后一级缓存 -- 空间预取器:该预取器努力完成提取到 L2 高速缓存的每个高速缓存线,并使用将其完成为 128 字节对齐块的线对.- Streamer:该预取器监控来自 L1 缓存的读取请求,以获取地址的升序和降序序列.受监控的读取请求包括由加载和存储操作以及硬件预取器发起的 L1 DCache 请求,以及用于代码提取的 L1 ICache 请求.当检测到前向或后向请求流时,预取预期的缓存行.预取缓存行必须在同一个 4K 页中.

再往前走一点:

... Streamer 可能会在每次 L2 查找时发出两个预取请求.流送器最多可以在加载请求之前运行 20 行.

在上面,只有基于 IP 的可以处理大于一个缓存行的步幅(流媒体可以处理任何使用连续缓存行的东西,这意味着最多 64 字节的步幅(或者实际上最多 128 字节,如果你不这样做)请注意一些额外的行.要使用它,请确保在给定地址的加载/存储将执行跨步访问 - 通常在遍历数组的循环中已经存在这种情况.编译器循环展开可能会将其拆分为多个不同的跨步流.strides - 这会更好(前瞻会更大),除非你超过了未完成的被跟踪 IP 的数量 - 同样,这取决于具体的实现.

但是,如果您的访问模式确实由连续的行组成,则 L2 流传输器比 L1 更高效,因为它运行得更快.

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

Later he writes:

The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.

So my questions are related to prefetching data.

From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?

From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:

The reason is that modern processors prefetch data automatically thanks to out-of-order execution and advanced prediction mechanisms. Modern microprocessors are able to automatically prefetch data for regular access patterns containing multiple streams with different strides. Therefore, you don't have to prefetch data explicitly if data access can be arranged in regular patterns with fixed strides.

So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?

Edit: Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?

Edit: I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513? I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649
1024x1024    12.8
1025x1025    10.1

I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.

解决方案

This statement :

the level-2 cache cannot prefetch more than one line at a time.

is incorrect

In fact, the L2 prefetchers are often stronger and more aggressive than L1 prefetchers. It depends on the actual machine you use, but Intels' L2 prefetcher for e.g. can trigger 2 prefetches for each request, while the L1 is usually limited (there are several types of prefetches that can coexist in the L1, but they're likely to be competing on a more limited BW than the L2 has at its disposal, so there will probably be less prefetches coming out of the L1.

The optimization guide, in Section 2.3.5.4 (Data Prefetching) counts the following prefetcher types:

Two hardware prefetchers load data to the L1 DCache:
- Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
- Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.

 Data Prefetch to the L2 and Last Level Cache - 
 - Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to  the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
 - Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page. 

And a bit further ahead:

... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.

Of the above, only the IP-based can handle strides greater than one cache line (the streaming ones can deal with anything that uses consecutive cachelines, meaning up to 64byte stride (or actually up to 128 bytes if you don't mind some extra lines). To use that, make sure that loads/stores at a given address would perform strided accesses - that's usually the case already in loops going over arrays. Compiler loop-unrolling may split that into multiple different stride streams with larger strides - that would work even better (the lookahead would be larger), unless you exceed the number of outstanding tracked IPs - again, that depends on the exact implementation.

However, if your access pattern does consist of consecutive lines, the L2 streamer is much more efficient than the L1 since it runs ahead faster.

这篇关于在 L1 和 L2 预取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆