为什么GCC __builtin_ prefetch不会提高性能? [英] why does GCC __builtin_prefetch not improve performance?

查看:2521
本文介绍了为什么GCC __builtin_ prefetch不会提高性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个程序来分析社交网络图。这意味着程序需要大量的随机存储器访问。在我看来,prefetch应该有所帮助。下面是一个顶点的邻居一小块读数值的code的。

I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex.

for (size_t i = 0; i < v.get_num_edges(); i++) {
    unsigned int id = v.neighbors[i];
    res += neigh_vals[id];
}

我变换上面一个下面和prefetch一个顶点的邻居的值code。

I transform the code above to the one as below and prefetch the values of the neighbors of a vertex.

int *neigh_vals = new int[num_vertices];

for (size_t i = 0; i < v.get_num_edges(); i += 128) {
    size_t this_end = std::min(v.get_num_edges(), i + 128);
    for (size_t j = i; j < this_end; j++) {
        unsigned int id = v.neighbors[j];
        __builtin_prefetch(&neigh_vals[id], 0, 2);
    }
    for (size_t j = i; j < this_end; j++) {
        unsigned int id = v.neighbors[j];
        res += neigh_vals[id];
    }
}

在本C ++ code,我没有覆盖任何操作。

In this C++ code, I didn't override any operators.

不幸的是,code并不能真正提高性能。我想知道为什么。显然,硬件prefetch不会在这种情况下工作,因为硬件无法predict的存储位置。

Unfortunately, the code doesn't really improve the performance. I wonder why. Apparently, hardware prefetch doesn't work in this case because the hardware can't predict the memory location.

我不知道它是由GCC的优化所致。当我编译code,启用-O3。我真的希望启用-O3时prefetch可以甚至进一步提高性能。确实-O3优化熔丝在这种情况下,两个环?可以-O3在这种情况下,在默认情况下启用prefetch?

I wonder if it's caused by GCC optimization. When I compile the code, I enable -O3. I really hope prefetch can further improve performance even when -O3 is enabled. Does -O3 optimization fuse the two loops in this case? Can -O3 enable prefetch in this case by default?

我用gcc版本4.6.3,程序英特尔至强E5-4620运行。

I use gcc version 4.6.3 and the program runs on Intel Xeon E5-4620.

谢谢,

推荐答案

是的,最近的一些版本的 GCC (如4.9 2015年3月),能够与 -O3 优化时,发出一些 preFETCH 指令(甚至没有任何明确的 __ builtin_ prefetch

Yes, some recent versions of GCC (e.g. 4.9 in march 2015) are able to issue some PREFETCH instruction when optimizing with -O3 (even without any explicit __builtin_prefetch)

我们不知道是什么 get_neighbor 是干什么的,有什么和<$ C类型$ C> neigh_val

We don't know what get_neighbor is doing, and what are the types of v and neigh_val.

和prefetching是的的总是有利可图的。添加明确的 __ builtin_ prefetch 可以的放缓的您code。 您需要测量。

And prefetching is not always profitable. Adding explicit __builtin_prefetch can slow down your code. You need to measure.

正如不再更新忍的评论在一个循环,prefetching和希望的数据将在以下被缓存环(在源$ C ​​$ C进一步下跌)是错误的。

As Retired Ninja commented, prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) is wrong.

您或许可以尝试,而不是

You might perhaps try instead

for (size_t i = 0; i < v.get_num_edges(); i++) {
  fg::vertex_id_t id = v.get_neighbor(i);
  __builtin_prefetch (neigh_val[v.get_neighbor(i+4)]);
  res += neigh_vals[id];
}

你可以凭经验替换 4 与任何适当的常数是最好的。

You could empirically replace the 4 with whatever appropriate constant is the best.

但我猜测, __ builtin_ prefetch 以上是无用的(因为编译器可能是能够通过自己添加),它可能会损害(甚至死机该程序,如果 v.get_neighbor(I + 4)是不确定的,但是$ p $您的地址空间之外pfetching地址不会伤害 - 但可以减缓你的程序)。 请标杆。

But I guess that the __builtin_prefetch above is useless (since the compiler is probably able to add it by itself) and it could harm (or even crash the program, if v.get_neighbor(i+4) is undefined; however prefetching an address outside of your address space won't harm -but could slow down your program). Please benchmark.

请参阅这个答案以一个相关的问题。

See this answer to a related question.

请注意,在C ++中所有的 [] get_neighbor 可能过载而变得非常复杂的操作,因此我们不能猜测!

Notice that in C++ all of [], get_neighbor could be overloaded and becomes very complex operations, so we cannot guess!

和有那里的硬件限制性能, __ builtin_ prefetch 您添加任何情况下(并加入他们可能的伤害的性能)

And there are cases where the hardware is limiting performance, whatever __builtin_prefetch you add (and adding them could hurt performance)

顺便说一句,你可能通过 -O3 -mtune =本地-fdump树-SSA -S -fverbose-ASM 欲了解什么编译器做的(看看内部产生转储文件和汇编文件);同时,它确实发生了 -O3 产生code比 -O2 给出稍微慢一些。

BTW, you might pass -O3 -mtune=native -fdump-tree-ssa -S -fverbose-asm to understand more what the compiler is doing (and look inside generated dump files and assembler files); also, it does happen that -O3 produces slightly slower code than what -O2 gives.

您可以考虑明确多线程,的 OpenMP的的OpenCL 如果你有时间浪费在优化。请记住, premature优化是邪恶。你有没有标杆,你配置您的整个应用程序?

You could consider explicit multithreading, OpenMP, OpenCL if you have time to waste on optimization. Remember that premature optimization is evil. Did you benchmark, did you profile your entire application?

这篇关于为什么GCC __builtin_ prefetch不会提高性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆