__builtin_prefetch,它读多少? [英] __builtin_prefetch, How much does it read?

查看:1017
本文介绍了__builtin_prefetch,它读多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用

来优化一些RK4 GCC C ++代码。

  __ builtin_prefetch 

我有一些麻烦试图找出如何预取整个类。我不明白读取了多少 const void * addr 。所以我有的下一个值。


$ b b

  for(int i = from; i  {
double kv = myLinks [i] .kv;
particle * from = con [i] .Pfrom;
particle * to = con [i] .Pto;
//在con [i ++]处预取值。Pfrom& con [i] .Pto;
double pos = to> px- from-> px;
double delta = from-> r + - > r - pos;
double k1 = axcel(kv,delta,from-> mass)* dt; // axcel是内联函数
double k2 = axcel(kv,delta + 0.5 * k1,from-> mass)* dt;
double k3 = axcel(kv,delta + 0.5 * k2,from-> mass)* dt;
double k4 = axcel(kv,delta + k3,from-> mass)* dt;
#define likely(x)__builtin_expect((x),1)
if(likely(!from-> bc))
{
from-> x + = ((k1 + 2 * k2 + 2 * k3 + k4)/ 6);
}
}

链接: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

解决方案

我认为它只是发出一个 FETCH



你可以使用 __ builtin_prefetch(con [i + 3] .Pfrom)



不要使用 __ builtin_prefetch

这样的循环中, code>太频繁(即不要把很多他们在一个循环)。如果需要,测量性能增益,并使用GCC优化(至少 -O2 )。如果你很幸运,手动 __ builtin_prefetch 可以将你的循环的性能提高10%或20%(但它也可能会伤害它)。



如果这样的循环对你是至关重要的,你可以考虑在使用OpenCL或CUDA的GPU上运行它(但是需要重新编译OpenCL或CUDA语言中的一些例程,并将它们调整到你的特定硬件)。



使用最近的GCC编译器(最新版本是 4.6.2 ),因为它在这些领域取得了很大的进步。 p>

I'm trying to optimize some RK4 GCC C++ code by using

__builtin_prefetch

I'm having some trouble trying to figure out how to prefetch a whole class. I don't understand how much of the const void *addr is read. So that I have the next values of from and to loaded.

for (int i = from; i < to; i++)
{
    double kv = myLinks[i].kv;
    particle* from = con[i].Pfrom;
    particle* to = con[i].Pto;
    //Prefetch values at con[i++].Pfrom & con[i].Pto;
    double pos = to->px- from->px;
    double delta = from->r + to->r - pos;
    double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function
    double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;
    double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;
    double k4 = axcel(kv, delta + k3, from->mass) * dt;
    #define likely(x)       __builtin_expect((x),1)
    if (likely(!from->bc))
    {
            from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);
    }
}

Link: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

解决方案

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.

And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.

Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).

If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).

Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.

这篇关于__builtin_prefetch,它读多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆