我们什么时候应该使用预取? [英] When should we use prefetch?

查看:23
本文介绍了我们什么时候应该使用预取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些 CPU 和编译器提供预取指令.例如:GCC 文档 中的 __builtin_prefetch.虽然GCC的文档里有注释,但对我来说太短了.

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.

我想知道,我们什么时候应该使用预取?有没有一些例子?谢谢!

I want to know, in prantice, when should we use prefetch? Are there some examples? Thx!

推荐答案

这个问题实际上与编译器无关,因为它们只是提供一些钩子来将预取指令插入到您的汇编代码/二进制文件中.不同的编译器可能提供不同的内在格式,但您可以忽略所有这些并(小心地)将其直接添加到汇编代码中.

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

现在真正的问题似乎是什么时候预取有用",答案是 - 在任何情况下,您受到内存延迟的限制,并且访问模式不规则且无法区分以供硬件预取捕获(有组织的)在一个流或步幅中),或者当您怀疑有太多不同的流供硬件同时跟踪时.
大多数编译器很少会为您插入他们自己的预取,因此基本上由您来处理您的代码并测试预取如何有用.

Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where youre bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

@Mysticial 的链接展示了一个很好的例子,但这里有一个更直接的例子,我认为 HW 无法捕捉到:

The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"

#define N 4096
#define REP 200
#define ELEM int

int main() {
    int i,j, k, b;
    const int blksize = 64 / sizeof(ELEM);
    ELEM __attribute ((aligned(4096))) a[N][N];
    for (i = 0; i < N; ++i) {
        for (j = 0; j < N; ++j) {
            a[i][j] = 1;
        }
    }
    unsigned long long int sum = 0;
    struct timeb start, end;
    unsigned long long delta;

    ftime(&start);
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j ++) {
                sum += a[i][j];
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 

    ftime(&start);
    sum = 0;
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j += blksize) {
                for (b = 0; b < blksize; ++b) {
                    sum += a[i][j+b];
                }
                _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
}

我在这里做的是遍历每个矩阵行(享受 HW 预取器对连续行的帮助),但是从位于不同页面的下一行(硬件预取应该很难抓住).我总结数据只是为了不优化它,重要的是我基本上只是循环遍历一个矩阵,应该非常简单和容易检测,但仍然得到加速.

What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

使用 gcc 4.8.1 -O3 构建,它使我在英特尔至强 X5670 上的性能提升了近 20%:

Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on:  N=4096, sum=3355443200, time=1502

请注意,即使我使控制流更复杂(额外的循环嵌套级别),也能收到加速,分支预测器应该很容易捕捉到那个短块大小的循环的模式,并且可以节省不需要的预取的执行.

Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

请注意 Ivybridge 及以后的 应该有一个下一页预取器",因此硬件可能能够在这些 CPU 上减轻这种情况(如果有人有可用的并且愿意尝试,我会很高兴知道).在那种情况下,我会修改基准以每两行求和(并且预取每次都会向前看两行),这应该会混淆硬件预取器.

Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try i'll be happy to know). In that case i'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

Skylake 结果

以下是 Skylake i7-6700-HQ 的一些结果,使用 gcc 以 2.6 GHz(无涡轮)运行:

Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

编译标志:-O3 -march=native

Compile flags: -O3 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on:  N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on:  N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on:  N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on:  N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on:  N=4096, sum=28147495993344000, time=1253

编译标志:-O2 -march=native

Compile flags: -O2 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on:  N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on:  N=4096, sum=28147495993344000, time=1814

因此,根据您分别使用 -O3-O2这个特殊的例子.-O3 的大幅放缓实际上是由于代码生成的怪癖:在 -O3 处,没有预取的循环被矢量化,但预取变体循环的额外复杂性阻止了矢量化无论如何,在我的 gcc 版本上.

So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

所以 -O2 的结果可能更多,而且好处大约是我们在 Leeor's Westmere 上看到的一半(8% 加速比 16%).仍然值得注意的是,您必须小心不要更改代码生成,从而导致速度大幅下降.

So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

这个测试可能并不理想,因为 intint 意味着大量的 CPU 开销而不是给内存子系统带来压力(这就是为什么矢量化有如此大的帮助).

This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).

这篇关于我们什么时候应该使用预取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆