使用OpenMP停止GCC自动向量化 [英] Using OpenMP stops GCC auto vectorising

查看:372
本文介绍了使用OpenMP停止GCC自动向量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力让我的code能够被自动由GCC矢量化,但是,当我包括在 -fopenmp 标志似乎停止一切在尝试矢量化汽车。我现在用的 ftree-矢量-ftree - 矢量化 - 详细= 5 来vectorise和监控。

I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5 to vectorise and monitor it.

如果我不包括标志,它开始给我很多信​​息有关每个循环,如果是矢量化,为什么不。编译器停止时,我尝试使用 omp_get_wtime()函数,因为它不能被链接。一旦该标志被包括在内,它只是列出了每一个功能,并告诉我,它在它矢量化0环。

If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime() function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.

我读过这个问题已经提到了一些其他的地方,但他们真的不来任何的解决方案:的 http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 。 OpenMP的是否有自己的处理方式向量化?我是否需要明确告诉它?

I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?

推荐答案

有在这似乎在最近版本的GCC得到解决GCC vectoriser一个缺点。在我的测试情况下,GCC 4.7.2 vectorises成功以下简单的循环:

There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   a[i] = b[i] + c[i] * d;

在同一时间的GCC 4.6.1不和它抱怨,该循环包含函数调用或不能被分析的数据的引用。在vectoriser的错误是由循环并行海湾合作委员会实现方式触发。当OpenMP构造的处理和扩大,简单的循环code转化成这种类似的东西:

In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:

struct omp_fn_0_s
{
    int N;
    double *a;
    double *b;
    double *c;
    double d;
};

void omp_fn_0(struct omp_fn_0_s *data)
{
    int start, end;
    int nthreads = omp_get_num_threads();
    int threadid = omp_get_thread_num();

    // This is just to illustrate the case - GCC uses a bit different formulas
    start = (data->N * threadid) / nthreads;
    end = (data->N * (threadid+1)) / nthreads;

    for (int i = start; i < end; i++)
       data->a[i] = data->b[i] + data->c[i] * data->d;
}

...

struct omp_fn_0_s omp_data_o;

omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;

GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();

N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;

4.7之前,海湾合作委员会vectoriser未能vectorise该循环。这不是OpenMP的特定问题。人们可以很容易没有OpenMP的code复制它。为了证实这一点,我写了下面的简单的测试:

The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:

struct fun_s
{
   double *restrict a;
   double *restrict b;
   double *restrict c;
   double d;
   int n;
};

void fun1(double *restrict a,
          double *restrict b,
          double *restrict c,
          double d,
          int n)
{
   int i;
   for (i = 0; i < n; i++)
      a[i] = b[i] + c[i] * d;
}

void fun2(struct fun_s *par)
{
   int i;
   for (i = 0; i < par->n; i++)
      par->a[i] = par->b[i] + par->c[i] * par->d;
}

人们会预期codeS(通知 - 在这里没有的OpenMP)应vectorise因为同样的限制用于指定不走样可能发生的关键字。不幸的是,这不是与海湾合作委员会和LT的情况; 4.7 - 它成功vectorises循环在 FUN1 但未能vectorise在 FUN2 引用同样的原因,当它编译OpenMP的code。

One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1 but fails to vectorise that in fun2 citing the same reason as when it compiles the OpenMP code.

这样做的原因是,vectoriser不能证明杆D 1和D 不在内存中的杆&GT ;一个杆&GT; b 杆和以及c 指向。这并不总是与 FUN1 的情况,其中有两种可能情况:

The reason for this is that the vectoriser is unable to prove that par->d does not lie within the memory that par->a, par->b, and par->c point to. This is not always the case with fun1, where two cases are possible:


  • D 作为寄存器的值参数传递;

  • D 作为堆栈上的值参数传递。

  • d is passed as a value argument in a register;
  • d is passed as a value argument on the stack.

在x64系统System V的ABI任务的第几个浮点参数在XMM寄存器获得通过(YMM上启用AVX-的CPU)。那怎么 D 在这种情况下,因此没有指针被传递都不能指向它 - 循环被矢量化。在x86系统的参数传递到堆栈中的ABI任务,因此 D 可能被任何三分的别名。事实上,海湾合作委员会拒绝,如果指令生成32位x86 code与 -m32 FUN1 循环C>选项。

On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1 if instructed to generate 32-bit x86 code with the -m32 option.

GCC 4.7解决此获得通过插入运行时检查,确保既不 D 也不杆D 1和D 获得别名。

GCC 4.7 gets around this by inserting run-time checks which ensure that neither d nor par->d get aliased.

摆脱 d的删除不可证明的非混叠及以下的OpenMP code得到由GCC 4.6.1矢量化:

Getting rid of d removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   a[i] = b[i] + c[i];

这篇关于使用OpenMP停止GCC自动向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆