GCC暗示向量化 [英] GCC Hinting at Vectorization

查看:244
本文介绍了GCC暗示向量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望GCC对以下代码进行矢量化: -fopt-info 告诉我GCC目前不在。我认为问题在于 W 的跨越访问,或者可能是 k 的后退增量。请注意, height width 是常量, index_type 被设置至无符号长整型目前。



我删除了一些评论

<对于(index_type i = 0; i 0; k-){
116,pre> {
117 Yp [k * width + i] = 0.0; (index_type j = 0; j 121 Yp [k * width + i] + = W [k * width * width + j * width + i] * Yp [(k + 1)* width + j];
122}
123 Yp [k * width + i] * = DF(Ap [k * width + i]);
124}
125}

我正在编译 gcc -O3 -fast-math -fopt-info -std = c11 ./neural.c -o neural -lm



使矢量化的一个好方法?你可以引用我的进一步信息?

是我的方法索引坏主意(即 k * width * width + ... )?我需要动态分配,并且我认为将内容保持在内存中会更好地实现优化。



编辑:这可能有用



这些行的 -fopt-info-missed 输出

  ./ neural.c:114:3:note:not vectorized:multiple nested loops。 
./neural.c:114:3:注意:不良循环形式。
./neural.c:116:5:注意:不是矢量化的:循环中的控制流。
./neural.c:116:5:注意:不良循环形式。
./neural.c:119:7:注意:步骤未知。
./neural.c:119:7:注意:循环中使用的减少量。
./neural.c:119:7:注意:未知的def-use循环模式。
./neural.c:119:7:注意:未矢量化:复杂的访问模式。
./neural.c:119:7:注意:数据访问不正确。
./neural.c:110:21:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:110:58:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:110:62:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:117:18:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:114:37:注意:未矢量化:基本块中没有足够的数据引用。

编辑:



最小例子是<一个href =http://pastebin.com/Qwe0XVix =nofollow>这里



我正在用BLAS来尝试它。在最简单的例子中,它变得更快,但是在整个代码中它更慢......不知道为什么

编辑:



编译器正在优化代码。固定。 BLAS现在更快。修正是在整个代码中,而不是最小的例子。



编辑:



link from the previous edit

  #include< math.h> 
#include< cblas.h>
#include< stdlib.h>
#include< stdio.h>

typedef float value_type;
typedef unsigned long index_type;

static value_type F(value_type v){
return 1.0 /(1.0 + exp(-v));


static value_type DF(value_type v){
const value_type Ev = exp(-v);
return Ev /((1.0 + Ev)*(1.0 + Ev));


#ifndef WITH_BLAS

static void get_Yp(const value_type * __restrict__ Ap,const value_type * __restrict__ W,
value_type * __restrict__ Yp,const value_type * __restrict__ Dp,
const index_type height,const index_type width){
for(index_type i = 0; i Yp [height * width + i] = 2 * DF(Ap [height * width + i])*(Dp [i] - F(Ap [height * width + i])); (index_type i = 0; i 0; k-){
}
$ b} {
Yp [k * width + i] = 0.0; (index_type j = 0; j Yp [k * width + i] + = W [k * width * width + j * width + i] * Yp [( k + 1)* width + j];
}
Yp [k * width + i] * = DF(Ap [k * width + i]);




#else

static void get_Yp(const value_type * __restrict__ Ap,const value_type * __restrict__ W,$ (index_type i = 0; i Yp,const value_type * __restrict__ Dp,
const index_type height,const index_type width){
[高度*宽度+ i] = 2 * DF(Ap [高度*宽度+ i])*(Dp [i] -F(Ap [高度*宽度+ i]))


(index_type k = height-1; k + 1> 0; k--){
cblas_sgemv(CblasRowMajor,CblasTrans,width,width,1,$ b * b W + k * width * width,width,Yp +(k + 1)* width,1,0,Yp + k * width,1); (index_type i = 0; i Yp [k * width + i] * = DF(Ap [k * width + i])的
;



#endif

int main(){
const index_type height = 10,width = 10000;
$ b value_type * Ap = malloc((height + 1)* width * sizeof(value_type)),
* W = malloc(height * width * width * sizeof(value_type)),
* Yp = malloc((height + 1)* width * sizeof(value_type)),
* Dp = malloc(width * sizeof(value_type));

get_Yp(Ap,W,Yp,Dp,height,width);
printf(完成%f \ n,Yp [3]);

返回0;


解决方案


  1. j-loop是很好的可矢量化具有不变步长的width元素的SIMD简化循环。您可以使用现代编译器将它矢量化。这段代码可以通过英特尔编译器进行矢量化处理,并且在某些情况下应该由GCC进行矢量化处理。

  2. 首先,缩减是矢量化的一个特例。 > true 循环进行依赖。因此除非(a)由编译器自动识别(不是那么容易,严格地说不是那么有效/预期的行为),或者(b)由开发人员明确地使用OpenMP或类似的编译器进行编译,否则不能安全地对其进行矢量化标准。


    要向编译器传达有减少 - 您需要使用
    #pragma omp simd reduction(+:variable_name) 在j循环之前。



    仅支持从OpenMP4.0开始。所以你必须使用支持OpenMP4.x的GCC版本。引用自 https://gcc.gnu.org/wiki/openmp :GCC 4.9支持OpenMP 4.0 for C / C ++,GCC 4.9.1也用于Fortran



    我也会使用临时局部变量来累计减少量(OpenMP4。 0 要求减少变量以这种方式使用):

      tmpSUM = 0; 
    #pragma omp simd reduction(+:tmpSUM)
    for(index_type j = 0; j tmpSUM + = W [k * width * width + j * width + i] * Yp [(k + 1)* width + j];
    }
    Yp [k * width + i] = tmpSUM




    1. 我还建议使用signed int而不是unsigned,因为对于所有现代向量化工具来说,不加引入的引入变量是相当糟糕的,至少会引入额外的开销。如果使用unsigned是主要原因之一,那么我就不会感到惊讶。混淆GCC。

    2. 现在,你可能不满意我的回复,因为它讲述了它应该如何工作(以及它在ICC / ICPC中的工作原理)。它没有考虑到海湾合作委员会的具体细微差别(减少似乎有些相反),正如人们可以在海湾合作委员会的优化报告中看到的那样。

      所以,如果你仍然只限于GCC,我会建议:


      I would like GCC to vectorize the below code.-fopt-info tells me that GCC is not currently. I believe the problem is the strided access of W or possible the backward incrementing of k. Note that height and width are constants and index_type is set to unsigned long currently.

      I removed some comments

      114  for (index_type k=height-1;k+1>0;k--) {
      116    for (index_type i=0;i<width;i++) {
      117      Yp[k*width + i] = 0.0;                                                            
      119      for (index_type j=0;j<width;j++) {                                                                            
      121        Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      122      }
      123      Yp[k*width + i] *= DF(Ap[k*width + i]);
      124    }
      125  }
      

      I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

      Is there a good way to make this vectorize? Can you refer me to further information?

      Is my method for indexing a bad idea (ie the k*width*width + ...)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.

      EDIT: This might be useful

      The -fopt-info-missed output for these lines

      ./neural.c:114:3: note: not vectorized: multiple nested loops.
      ./neural.c:114:3: note: bad loop form.
      ./neural.c:116:5: note: not vectorized: control flow in loop.
      ./neural.c:116:5: note: bad loop form.
      ./neural.c:119:7: note: step unknown.
      ./neural.c:119:7: note: reduction used in loop.
      ./neural.c:119:7: note: Unknown def-use cycle pattern.
      ./neural.c:119:7: note: not vectorized: complicated access pattern.
      ./neural.c:119:7: note: bad data access.
      ./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
      ./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
      ./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
      ./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
      ./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.
      

      EDIT:

      Minimal example is HERE

      I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why

      EDIT:

      Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.

      EDIT:

      Same code as in the link from the previous edit

      #include <math.h>
      #include <cblas.h>
      #include <stdlib.h>
      #include <stdio.h>
      
      typedef float value_type;
      typedef unsigned long index_type;
      
      static value_type F(value_type v) {
        return 1.0/(1.0 + exp(-v));
      }
      
      static value_type DF(value_type v) {
        const value_type Ev = exp(-v);
        return Ev/((1.0 + Ev)*(1.0 + Ev));
      }
      
      #ifndef WITH_BLAS
      
      static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
                 value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
                 const index_type height, const index_type width) {
        for (index_type i=0;i<width;i++) {
          Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
        }
      
        for (index_type k=height-1;k+1>0;k--) {
          for (index_type i=0;i<width;i++) {
            Yp[k*width + i] = 0.0;
            for (index_type j=0;j<width;j++) {
          Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
            }
            Yp[k*width + i] *= DF(Ap[k*width + i]);
          }
        }
      }
      
      #else
      
      static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
                 value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
                 const index_type height, const index_type width) {
        for (index_type i=0;i<width;i++) {
          Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
        }
      
        for (index_type k=height-1;k+1>0;k--) {
          cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
              W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
          for (index_type i=0;i<width;i++)
            Yp[k*width + i] *= DF(Ap[k*width + i]);
        }
      }
      
      #endif
      
      int main() {
        const index_type height=10, width=10000;
      
        value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
          *W=malloc(height*width*width*sizeof(value_type)),
          *Yp=malloc((height+1)*width*sizeof(value_type)),
          *Dp=malloc(width*sizeof(value_type));
      
        get_Yp(Ap, W, Yp, Dp, height, width);
        printf("Done %f\n", Yp[3]);
      
        return 0;
      }
      

      解决方案

      1. j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.

      2. First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.

      To "communicate" to compiler that there is a reduction - you need to use #pragma omp simd reduction (+ : variable_name) before j-loop.

      This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"

      I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):

       tmpSUM = 0; 
       #pragma omp simd reduction (+: tmpSUM) 
       for (index_type j=0;j<width;j++) {                                                                            
              tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
            }
       Yp[k*width + i] = tmpSUM
      

      1. I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.

      2. Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.

      So, if you are still limited to GCC, I would suggest:

      • Make sure it's fresh enough GCC (4.9 or higer)
      • Use signed induction variable and still try omp simd reduction on temporary local tmp SUM(because it should enable more advanced vectorization techniques anyway)
      • If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.

        1. Last comment: is access pattern in your code and more generally const-stride so bad? Answer: for some platforms/compilers const-stride will not kill your performance. But still, ideally you need more cache-friendly algorithm. Check Not sure how to explain some of the performance results of my parallelized matrix multiplication code . One more option is considering MKL or BLAS (like suggested by other reply) if your code is really memory-bound and if you don't have time to deal with memory optimizations yourself.

      这篇关于GCC暗示向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆