GCC暗示向量化 [英] GCC Hinting at Vectorization

查看：244 发布时间：2018/4/21 10:38:08 gcc vectorization auto-vectorization

本文介绍了GCC暗示向量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望GCC对以下代码进行矢量化： -fopt-info 告诉我GCC目前不在。我认为问题在于 W 的跨越访问，或者可能是 k 的后退增量。请注意， height 和 width 是常量， index_type 被设置至无符号长整型目前。

我删除了一些评论

<对于（index_type i = 0; i 0; k-）{
116，pre> { 117 Yp [k * width + i] = 0.0; （index_type j = 0; j 121 Yp [k * width + i] + = W [k * width * width + j * width + i] * Yp [（k + 1）* width + j]; 122} 123 Yp [k * width + i] * = DF（Ap [k * width + i]）; 124} 125}

我正在编译 gcc -O3 -fast-math -fopt-info -std = c11 ./neural.c -o neural -lm

使矢量化的一个好方法？你可以引用我的进一步信息？

是我的方法索引坏主意（即 k * width * width + ... ）？我需要动态分配，并且我认为将内容保持在内存中会更好地实现优化。

编辑：这可能有用

这些行的 -fopt-info-missed 输出

  ./ neural.c：114：3：note：not vectorized：multiple nested loops。 
 ./neural.c:114:3：注意：不良循环形式。 
 ./neural.c:116:5：注意：不是矢量化的：循环中的控制流。 
 ./neural.c:116:5：注意：不良循环形式。 
 ./neural.c:119:7：注意：步骤未知。 
 ./neural.c:119:7：注意：循环中使用的减少量。 
 ./neural.c:119:7：注意：未知的def-use循环模式。 
 ./neural.c:119:7：注意：未矢量化：复杂的访问模式。 
 ./neural.c:119:7：注意：数据访问不正确。 
 ./neural.c:110:21：注意：未矢量化：基本块中没有足够的数据引用。 
 ./neural.c:110:58：注意：未矢量化：基本块中没有足够的数据引用。 
 ./neural.c:110:62：注意：未矢量化：基本块中没有足够的数据引用。 
 ./neural.c:117:18：注意：未矢量化：基本块中没有足够的数据引用。 
 ./neural.c:114:37：注意：未矢量化：基本块中没有足够的数据引用。

编辑：

最小例子是<一个href =http://pastebin.com/Qwe0XVix =nofollow>这里

我正在用BLAS来尝试它。在最简单的例子中，它变得更快，但是在整个代码中它更慢......不知道为什么

编辑：

编译器正在优化代码。固定。 BLAS现在更快。修正是在整个代码中，而不是最小的例子。

编辑：

link from the previous edit
#include< math.h> #include< cblas.h> #include< stdlib.h> #include< stdio.h> typedef float value_type; typedef unsigned long index_type; static value_type F（value_type v）{ return 1.0 /（1.0 + exp（-v））; static value_type DF（value_type v）{ const value_type Ev = exp（-v）; return Ev /（（1.0 + Ev）*（1.0 + Ev））; #ifndef WITH_BLAS static void get_Yp（const value_type * __restrict__ Ap，const value_type * __restrict__ W， value_type * __restrict__ Yp，const value_type * __restrict__ Dp， const index_type height，const index_type width）{ for（index_type i = 0; i Yp [height * width + i] = 2 * DF（Ap [height * width + i]）*（Dp [i] - F（Ap [height * width + i]））; （index_type i = 0; i 0; k-）{ } $ b} { Yp [k * width + i] = 0.0; （index_type j = 0; j Yp [k * width + i] + = W [k * width * width + j * width + i] * Yp [（ k + 1）* width + j]; } Yp [k * width + i] * = DF（Ap [k * width + i]）; #else static void get_Yp（const value_type * __restrict__ Ap，const value_type * __restrict__ W，$ （index_type i = 0; i Yp，const value_type * __restrict__ Dp， const index_type height，const index_type width）{ [高度*宽度+ i] = 2 * DF（Ap [高度*宽度+ i]）*（Dp [i] -F（Ap [高度*宽度+ i]））（index_type k = height-1; k + 1> 0; k--）{ cblas_sgemv（CblasRowMajor，CblasTrans，width，width，1，$ b * b W + k * width * width，width，Yp +（k + 1）* width，1,0，Yp + k * width，1）; （index_type i = 0; i Yp [k * width + i] * = DF（Ap [k * width + i]）的 ; #endif int main（）{ const index_type height = 10，width = 10000; $ b value_type * Ap = malloc（（height + 1）* width * sizeof（value_type））， * W = malloc（height * width * width * sizeof（value_type））， * Yp = malloc（（height + 1）* width * sizeof（value_type））， * Dp = malloc（width * sizeof（value_type））; get_Yp（Ap，W，Yp，Dp，height，width）; printf（完成％f \ n，Yp [3]）; 返回0;

解决方案

j-loop是很好的可矢量化具有不变步长的width元素的SIMD简化循环。您可以使用现代编译器将它矢量化。这段代码可以通过英特尔编译器进行矢量化处理，并且在某些情况下应该由GCC进行矢量化处理。
首先，缩减是矢量化的一个特例。 > true 循环进行依赖。因此除非（a）由编译器自动识别（不是那么容易，严格地说不是那么有效/预期的行为），或者（b）由开发人员明确地使用OpenMP或类似的编译器进行编译，否则不能安全地对其进行矢量化标准。

要向编译器传达有减少 - 您需要使用
#pragma omp simd reduction（+：variable_name） 在j循环之前。

仅支持从OpenMP4.0开始。所以你必须使用支持OpenMP4.x的GCC版本。引用自 https://gcc.gnu.org/wiki/openmp ：GCC 4.9支持OpenMP 4.0 for C / C ++，GCC 4.9.1也用于Fortran

我也会使用临时局部变量来累计减少量（OpenMP4。 0 要求减少变量以这种方式使用）：

tmpSUM = 0; #pragma omp simd reduction（+：tmpSUM） for（index_type j = 0; j tmpSUM + = W [k * width * width + j * width + i] * Yp [（k + 1）* width + j]; } Yp [k * width + i] = tmpSUM

我还建议使用signed int而不是unsigned，因为对于所有现代向量化工具来说，不加引入的引入变量是相当糟糕的，至少会引入额外的开销。如果使用unsigned是主要原因之一，那么我就不会感到惊讶。混淆GCC。
现在，你可能不满意我的回复，因为它讲述了它应该如何工作（以及它在ICC / ICPC中的工作原理）。它没有考虑到海湾合作委员会的具体细微差别（减少似乎有些相反），正如人们可以在海湾合作委员会的优化报告中看到的那样。

所以，如果你仍然只限于GCC，我会建议：

确保它足够新鲜GCC 4.9或higer）

使用已签名的归纳变量，并且仍然尝试在临时本地tmp SUM上进行omp simd reduction（因为它应该启用更高级的
矢量化技术）

如果以上都没有帮助，请看看这里描述的奇怪的东西（非常类似于您的情况）：

或考虑使用其他编译器/编译器版本。

最后的评论：代码中的访问模式和更一般的const-stride如此糟糕？
答案：对于某些平台/编译器，const-stride不会杀死你的性能。但是，理想情况下，您需要更多缓存友好的算法。检查不知道如何来解释我的并行矩阵乘法码的一些性能结果。还有一种选择是考虑MKL或BLAS（就像其他回复所建议的那样），如果你的代码真的是内存限制的，并且你没有时间自己处理内存优化。

I would like GCC to vectorize the below code.-fopt-info tells me that GCC is not currently. I believe the problem is the strided access of W or possible the backward incrementing of k. Note that height and width are constants and index_type is set to unsigned long currently.

I removed some comments
114 for (index_type k=height-1;k+1>0;k--) { 116 for (index_type i=0;i<width;i++) { 117 Yp[k*width + i] = 0.0; 119 for (index_type j=0;j<width;j++) { 121 Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j]; 122 } 123 Yp[k*width + i] *= DF(Ap[k*width + i]); 124 } 125 }
I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

Is there a good way to make this vectorize? Can you refer me to further information?

Is my method for indexing a bad idea (ie the k*width*width + ...)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.

EDIT: This might be useful

The -fopt-info-missed output for these lines
./neural.c:114:3: note: not vectorized: multiple nested loops. ./neural.c:114:3: note: bad loop form. ./neural.c:116:5: note: not vectorized: control flow in loop. ./neural.c:116:5: note: bad loop form. ./neural.c:119:7: note: step unknown. ./neural.c:119:7: note: reduction used in loop. ./neural.c:119:7: note: Unknown def-use cycle pattern. ./neural.c:119:7: note: not vectorized: complicated access pattern. ./neural.c:119:7: note: bad data access. ./neural.c:110:21: note: not vectorized: not enough data-refs in basic block. ./neural.c:110:58: note: not vectorized: not enough data-refs in basic block. ./neural.c:110:62: note: not vectorized: not enough data-refs in basic block. ./neural.c:117:18: note: not vectorized: not enough data-refs in basic block. ./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.
EDIT:

Minimal example is HERE

I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why

EDIT:

Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.

EDIT:

Same code as in the link from the previous edit
#include <math.h> #include <cblas.h> #include <stdlib.h> #include <stdio.h> typedef float value_type; typedef unsigned long index_type; static value_type F(value_type v) { return 1.0/(1.0 + exp(-v)); } static value_type DF(value_type v) { const value_type Ev = exp(-v); return Ev/((1.0 + Ev)*(1.0 + Ev)); } #ifndef WITH_BLAS static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W, value_type * __restrict__ Yp, const value_type * __restrict__ Dp, const index_type height, const index_type width) { for (index_type i=0;i<width;i++) { Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i])); } for (index_type k=height-1;k+1>0;k--) { for (index_type i=0;i<width;i++) { Yp[k*width + i] = 0.0; for (index_type j=0;j<width;j++) { Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j]; } Yp[k*width + i] *= DF(Ap[k*width + i]); } } } #else static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W, value_type * __restrict__ Yp, const value_type * __restrict__ Dp, const index_type height, const index_type width) { for (index_type i=0;i<width;i++) { Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i])); } for (index_type k=height-1;k+1>0;k--) { cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1, W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1); for (index_type i=0;i<width;i++) Yp[k*width + i] *= DF(Ap[k*width + i]); } } #endif int main() { const index_type height=10, width=10000; value_type *Ap=malloc((height+1)*width*sizeof(value_type)), *W=malloc(height*width*width*sizeof(value_type)), *Yp=malloc((height+1)*width*sizeof(value_type)), *Dp=malloc(width*sizeof(value_type)); get_Yp(Ap, W, Yp, Dp, height, width); printf("Done %f\n", Yp[3]); return 0; }

解决方案

j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.

First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.

To "communicate" to compiler that there is a reduction - you need to use #pragma omp simd reduction (+ : variable_name) before j-loop.

This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"

I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):
tmpSUM = 0; #pragma omp simd reduction (+: tmpSUM) for (index_type j=0;j<width;j++) { tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j]; } Yp[k*width + i] = tmpSUM

I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.

Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.

So, if you are still limited to GCC, I would suggest:

Make sure it's fresh enough GCC (4.9 or higer)

Use signed induction variable and still try omp simd reduction on temporary local tmp SUM(because it should enable more advanced vectorization techniques anyway)

If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.

Last comment: is access pattern in your code and more generally const-stride so bad? Answer: for some platforms/compilers const-stride will not kill your performance. But still, ideally you need more cache-friendly algorithm. Check Not sure how to explain some of the performance results of my parallelized matrix multiplication code . One more option is considering MKL or BLAS (like suggested by other reply) if your code is really memory-bound and if you don't have time to deal with memory optimizations yourself.

这篇关于GCC暗示向量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

GCC暗示向量化 [英] GCC Hinting at Vectorization

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

GCC暗示向量化 [英] GCC Hinting at Vectorization

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭