GCC暗示向量化 [英] GCC Hinting at Vectorization
问题描述
我希望GCC对以下代码进行矢量化: -fopt-info
告诉我GCC目前不在。我认为问题在于 W
的跨越访问,或者可能是 k
的后退增量。请注意, height
和 width
是常量,无符号长整型
目前。
我删除了一些评论 我正在编译 使矢量化的一个好方法?你可以引用我的进一步信息? 是我的方法索引坏主意(即 编辑:这可能有用 这些行的 编辑: 最小例子是<一个href =http://pastebin.com/Qwe0XVix =nofollow>这里 我正在用BLAS来尝试它。在最简单的例子中,它变得更快,但是在整个代码中它更慢......不知道为什么 编辑: 编译器正在优化代码。固定。 BLAS现在更快。修正是在整个代码中,而不是最小的例子。 编辑: link from the previous edit j-loop是很好的可矢量化具有不变步长的width元素的SIMD简化循环。您可以使用现代编译器将它矢量化。这段代码可以通过英特尔编译器进行矢量化处理,并且在某些情况下应该由GCC进行矢量化处理。 要向编译器传达有减少 - 您需要使用 仅支持从OpenMP4.0开始。所以你必须使用支持OpenMP4.x的GCC版本。引用自 https://gcc.gnu.org/wiki/openmp :GCC 4.9支持OpenMP 4.0 for C / C ++,GCC 4.9.1也用于Fortran 我也会使用临时局部变量来累计减少量(OpenMP4。 0 要求减少变量以这种方式使用): 我还建议使用signed int而不是unsigned,因为对于所有现代向量化工具来说,不加引入的引入变量是相当糟糕的,至少会引入额外的开销。如果使用unsigned是主要原因之一,那么我就不会感到惊讶。混淆GCC。 所以,如果你仍然只限于GCC,我会建议: 如果以上都没有帮助,请看看这里描述的奇怪的东西(非常类似于您的情况): I would like GCC to vectorize the below code. I removed some comments I am compiling with Is there a good way to make this vectorize? Can you refer me to further information? Is my method for indexing a bad idea (ie the EDIT: This might be useful The EDIT: Minimal example is HERE I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why EDIT: Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example. EDIT: Same code as in the link from the previous edit
j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions. First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards. To "communicate" to compiler that there is a reduction - you need to use
This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran" I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):
I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC. Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report. So, if you are still limited to GCC, I would suggest: If none above help, take a look at "weird" things like described here (very similar to your case):
What do gcc's auto-vectorization messages mean?
or consider using other compiler/compiler version. 这篇关于GCC暗示向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
<对于(index_type i = 0; i
116,pre> {
117 Yp [k * width + i] = 0.0; (index_type j = 0; j
122}
123 Yp [k * width + i] * = DF(Ap [k * width + i]);
124}
125}
gcc -O3 -fast-math -fopt-info -std = c11 ./neural.c -o neural -lm
k * width * width + ...
)?我需要动态分配,并且我认为将内容保持在内存中会更好地实现优化。
-fopt-info-missed
输出
./ neural.c:114:3:note:not vectorized:multiple nested loops。
./neural.c:114:3:注意:不良循环形式。
./neural.c:116:5:注意:不是矢量化的:循环中的控制流。
./neural.c:116:5:注意:不良循环形式。
./neural.c:119:7:注意:步骤未知。
./neural.c:119:7:注意:循环中使用的减少量。
./neural.c:119:7:注意:未知的def-use循环模式。
./neural.c:119:7:注意:未矢量化:复杂的访问模式。
./neural.c:119:7:注意:数据访问不正确。
./neural.c:110:21:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:110:58:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:110:62:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:117:18:注意:未矢量化:基本块中没有足够的数据引用。
./neural.c:114:37:注意:未矢量化:基本块中没有足够的数据引用。
#include< math.h>
#include< cblas.h>
#include< stdlib.h>
#include< stdio.h>
typedef float value_type;
typedef unsigned long index_type;
static value_type F(value_type v){
return 1.0 /(1.0 + exp(-v));
static value_type DF(value_type v){
const value_type Ev = exp(-v);
return Ev /((1.0 + Ev)*(1.0 + Ev));
#ifndef WITH_BLAS
static void get_Yp(const value_type * __restrict__ Ap,const value_type * __restrict__ W,
value_type * __restrict__ Yp,const value_type * __restrict__ Dp,
const index_type height,const index_type width){
for(index_type i = 0; i
}
$ b} {
Yp [k * width + i] = 0.0; (index_type j = 0; j
}
Yp [k * width + i] * = DF(Ap [k * width + i]);
#else
static void get_Yp(const value_type * __restrict__ Ap,const value_type * __restrict__ W,$ (index_type i = 0; i
const index_type height,const index_type width){
[高度*宽度+ i] = 2 * DF(Ap [高度*宽度+ i])*(Dp [i] -F(Ap [高度*宽度+ i]))
(index_type k = height-1; k + 1> 0; k--){
cblas_sgemv(CblasRowMajor,CblasTrans,width,width,1,$ b * b W + k * width * width,width,Yp +(k + 1)* width,1,0,Yp + k * width,1); (index_type i = 0; i
;
#endif
int main(){
const index_type height = 10,width = 10000;
$ b value_type * Ap = malloc((height + 1)* width * sizeof(value_type)),
* W = malloc(height * width * width * sizeof(value_type)),
* Yp = malloc((height + 1)* width * sizeof(value_type)),
* Dp = malloc(width * sizeof(value_type));
get_Yp(Ap,W,Yp,Dp,height,width);
printf(完成%f \ n,Yp [3]);
返回0;
#pragma omp simd reduction(+:variable_name)
在j循环之前。
tmpSUM = 0;
#pragma omp simd reduction(+:tmpSUM)
for(index_type j = 0; j
}
Yp [k * width + i] = tmpSUM
矢量化技术)
或考虑使用其他编译器/编译器版本。
答案:对于某些平台/编译器,const-stride不会杀死你的性能。但是,理想情况下,您需要更多缓存友好的算法。检查不知道如何来解释我的并行矩阵乘法码的一些性能结果。还有一种选择是考虑MKL或BLAS(就像其他回复所建议的那样),如果你的代码真的是内存限制的,并且你没有时间自己处理内存优化。
-fopt-info
tells me that GCC is not currently. I believe the problem is the strided access of W
or possible the backward incrementing of k
. Note that height
and width
are constants and index_type
is set to unsigned long
currently.114 for (index_type k=height-1;k+1>0;k--) {
116 for (index_type i=0;i<width;i++) {
117 Yp[k*width + i] = 0.0;
119 for (index_type j=0;j<width;j++) {
121 Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122 }
123 Yp[k*width + i] *= DF(Ap[k*width + i]);
124 }
125 }
gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm
k*width*width + ...
)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.-fopt-info-missed
output for these lines./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.
#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>
typedef float value_type;
typedef unsigned long index_type;
static value_type F(value_type v) {
return 1.0/(1.0 + exp(-v));
}
static value_type DF(value_type v) {
const value_type Ev = exp(-v);
return Ev/((1.0 + Ev)*(1.0 + Ev));
}
#ifndef WITH_BLAS
static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
const index_type height, const index_type width) {
for (index_type i=0;i<width;i++) {
Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
}
for (index_type k=height-1;k+1>0;k--) {
for (index_type i=0;i<width;i++) {
Yp[k*width + i] = 0.0;
for (index_type j=0;j<width;j++) {
Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] *= DF(Ap[k*width + i]);
}
}
}
#else
static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
const index_type height, const index_type width) {
for (index_type i=0;i<width;i++) {
Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
}
for (index_type k=height-1;k+1>0;k--) {
cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
for (index_type i=0;i<width;i++)
Yp[k*width + i] *= DF(Ap[k*width + i]);
}
}
#endif
int main() {
const index_type height=10, width=10000;
value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
*W=malloc(height*width*width*sizeof(value_type)),
*Yp=malloc((height+1)*width*sizeof(value_type)),
*Dp=malloc(width*sizeof(value_type));
get_Yp(Ap, W, Yp, Dp, height, width);
printf("Done %f\n", Yp[3]);
return 0;
}
#pragma omp simd reduction (+ : variable_name)
before j-loop. tmpSUM = 0;
#pragma omp simd reduction (+: tmpSUM)
for (index_type j=0;j<width;j++) {
tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] = tmpSUM