在L1和L2处预取数据 [英] prefetching data at L1 and L2

查看:141
本文介绍了在L1和L2处预取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Agner Fog的手册中,在C ++中优化软件,在第9.10节大型数据结构,他描述了一个在矩阵宽度等于临界跨度时转置矩阵的问题。在他的测试中,当宽度等于临界跨度时,L1中矩阵的成本要高40%。 如果矩阵更大,并且仅适合L2,则成本为600%!表9.1中很好地总结了这一点。这与在
为什么转置512x512的矩阵要比转置513x513的矩阵慢得多?

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

后来他写道:



二级缓存争用比一级缓存争用要强得多的原因是- 2高速缓存不能
一次预取多行。

The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.

所以我的问题与预取数据有关

根据他的评论,我推断L1一次可以预取多个缓存行。 可以预取多少?

From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?

根据我的理解,尝试编写代码以预取数据(例如,使用_mm_prefetch)几乎没有帮助。我读过的唯一示例是预取示例?,它只是O(10%)的改进(在某些情况下)机器)。 Agner后来解释了这一点:

From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:


原因是现代处理器由于
的无序执行和高级功能而自动预取数据预测机制。现代微处理器能够自动为包含多个步长不同的多个流
的常规访问模式预取数据。因此,如果数据访问
可以以固定的步幅按常规模式排列,则不必显式预取数据。

The reason is that modern processors prefetch data automatically thanks to out-of-order execution and advanced prediction mechanisms. Modern microprocessors are able to automatically prefetch data for regular access patterns containing multiple streams with different strides. Therefore, you don't have to prefetch data explicitly if data access can be arranged in regular patterns with fixed strides.

那么CPU如何确定要预取的数据以及有什么方法可以帮助CPU为预取做出更好的选择(例如固定步幅的常规模式)?

编辑:根据Leeor的评论,让我补充我的问题并使之更加有趣。 与L1相比,为什么临界跨度对L2的影响如此之大?

Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?

编辑:尝试使用为什么转置512x512的矩阵要比转置513x513的矩阵慢得多?
我在Xeon E5 1620(常春藤桥)上以MSVC2013 64位发布模式运行了该程序,具有L1 32KB 8路,L2 256 KB 8路和L3 10MB 20路。 L1的最大矩阵大小约为90x90,L3的最大矩阵大小为256x256,L3的最大矩阵大小为1619。

I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513? I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.

Matrix Size  Average Time
64x64        0.004251 0.004472 0.004412 (three times)
65x65        0.004422 0.004442 0.004632 (three times)
128x128      0.0409
129x129      0.0169
256x256      0.219   //max L2 matrix size
257x257      0.0692
512x512      2.701
513x513      0.649
1024x1024    12.8
1025x1025    10.1

我没有看到L1的任何性能损失,但是L2显然有关键步幅问题,也许是L3。我还不确定为什么L1不会出现问题。可能还有其他一些背景(开销)来源在L1时期占主导地位。

I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.

推荐答案

此语句:



二级缓存一次不能预取多行。

the level-2 cache cannot prefetch more than one line at a time.


不正确

事实上,L2预取器通常更强大比L1预取器更具攻击性。这取决于您使用的实际计算机,但是Intels的L2预取器例如可以为每个请求触发2个预取,而L1通常是有限的(L1中可以共存几种类型的预取,但是与L2可以使用的预取权相比,它们可能竞争更有限的带宽),因此

In fact, the L2 prefetchers are often stronger and more aggressive than L1 prefetchers. It depends on the actual machine you use, but Intels' L2 prefetcher for e.g. can trigger 2 prefetches for each request, while the L1 is usually limited (there are several types of prefetches that can coexist in the L1, but they're likely to be competing on a more limited BW than the L2 has at its disposal, so there will probably be less prefetches coming out of the L1.

优化指南(在第2.3.5.4节(数据预取)中)计算了以下预取器类型:

The optimization guide, in Section 2.3.5.4 (Data Prefetching) counts the following prefetcher types:

Two hardware prefetchers load data to the L1 DCache:
- Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
- Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.

 Data Prefetch to the L2 and Last Level Cache - 
 - Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to  the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
 - Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page. 

再往前走:

... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.

以上中,只有基于IP的行可以处理一个以上的缓存行(流式缓存)可以处理使用连续高速缓存行的任何内容,这意味着最大跨度为64字节(如果您不介意额外的行,则实际上可以高达128字节)。要使用该方式,请确保在给定地址处加载/存储将执行跨步访问-编译器循环展开可能会将其拆分为更大步幅的多个不同步幅流-效果更好(前瞻性会更大),除非您超过未跟踪的IP数量-再次,这取决于确切的实现。

Of the above, only the IP-based can handle strides greater than one cache line (the streaming ones can deal with anything that uses consecutive cachelines, meaning up to 64byte stride (or actually up to 128 bytes if you don't mind some extra lines). To use that, make sure that loads/stores at a given address would perform strided accesses - that's usually the case already in loops going over arrays. Compiler loop-unrolling may split that into multiple different stride streams with larger strides - that would work even better (the lookahead would be larger), unless you exceed the number of outstanding tracked IPs - again, that depends on the exact implementation.

但是,如果您的访问模式确实由连续的行组成,则L2流媒体比L1高效,因为它运行

However, if your access pattern does consist of consecutive lines, the L2 streamer is much more efficient than the L1 since it runs ahead faster.

这篇关于在L1和L2处预取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆