如何使用g ++向量化我的循环? [英] How to vectorize my loop with g++?

查看:905
本文介绍了如何使用g ++向量化我的循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在搜索时发现的介绍性链接:


  1. 6.59.14 Loop-Specific Pragmas

  2. 2.100 Pragma Loop_Optimize

  3. 如何向gcc提供有关循环计数的提示

  4. 告诉gcc专门展开循环

  5. 如何在C ++中强制向量化

正如你可以看到的,大多数是C,但我认为他们可能在C ++工作。这是我的代码:

  template< typename T& 
// __ attribute __((optimize(unroll-loops)))
// __ attribute__((pure))
void foo(std :: vector< T& start,
size_t end,const std :: vector< T& p2){
typename std :: vector< T> :: const_iterator it2 = p2.begin();
//#pragma simd
//#pragma omp parallel for
//#pragma GCC ivdep Unroll Vector
for(size_t i = start; i p1 [i] = p1 [i] - * it2;
p1 [i] + = 1;
}
}

int main()
{
size_t n;
double x,y;
n = 12800000;
矢量< double> v,u;
for(size_t i = 0; i x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}
使用命名空间std :: chrono;

high_resolution_clock :: time_point t1 = high_resolution_clock :: now();
foo(v,0,n,u);
high_resolution_clock :: time_point t2 = high_resolution_clock :: now();

duration< double> time_span = duration_cast< duration< double>>(t2-t1);

std :: cout<< 它带我< time_span.count()<< seconds。;
std :: cout<< std :: endl;
return 0;
}



我使用的提示可以看到上面注释,但我没有得到任何加速,作为示例输出显示(第一次运行取消注释此 #pragma GCC ivdep展开向量

  samaras @ samaras-A15:〜/下载$ g ++ test.cpp -O3 -std = c ++ 0x -funroll-loops -ftree-vectorize -o test 
samaras @ samaras-A15:〜/下载$ ./test
它花了我0.026575秒。
samaras @ samaras-A15:〜/下载$ g ++ test.cpp -O3 -std = c ++ 0x -o test
samaras @ samaras-A15:〜/下载$ ./test
它花了我0.0252697秒。

有任何希望吗?或优化标志 O3 只是做的窍门吗?任何建议加快这个代码( foo

  samaras @ samaras-A15:〜/下载$ g ++ --version 
g ++(Ubuntu 4.8.1-2ubuntu1〜12.04)4.8.1






注意循环体是随机的。






编辑



一个答案,说没有什么可以做的也是可以接受的。

解决方案

c $ c> O3 标志会自动打开-ftree-vectorize。 https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html


-O3打开-O2指定的所有优化,并打开-finline函数,-funswitch-loops ,-fpredictive-commoning,-fgc-after-reload,-ftree-loop-vectorize,-ftree-loop-distributions-patterns,-ftree-slp-vectorize,-fvect-cost-model,-ftree-partial-pre和-fipa-cp-clone选项


因此在这两种情况下,编译器都试图进行循环向量化。



使用g ++ 4.8.2编译:

  g ++ test.cpp -O2 -std = c ++ 0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose = 1 -o test 

给出:

 在test.cpp分析循环:16 


矢量化循环在test.cpp:16

test.cpp:16:注意:创建运行时检查数据引用* it2 $ _M_current_106和* _39
test.cpp:16:note:created 1版本控制别名检查。

test.cpp:16:note:LOOP VECTORIZED。
在test_old.cpp分析循环:29

test.cpp:22:note:vectorized 1个循环的函数。

test.cpp:18:note:展开循环7次

test.cpp:16:note:展开循环7次

test。 cpp:28:note:展开循环1次

编译时不带 -vectorize 标志:

  g ++ test.cpp -O2 -std = c ++ 0x -funroll -loops -ftree-vectorizer-verbose = 1 -o test 

仅返回:

  test_old.cpp:16:注意:展开循环7次

test_old.cpp:28:note:Unroll loop 1次

第16行是循环函数的开始,因此编译器肯定是矢量化。检查汇编器也确认这一点。



我目前正在使用的笔记本电脑上似乎正在获取一些积极的缓存,这使得很难准确测量函数需要运行多长时间。


  • 使用 __ restrict __ 限定词告诉编译器数组之间没有重叠。


  • 编译器数组与 __ builtin_assume_aligned (不可移植)对齐。




这是我生成的代码(我删除了模板,因为您将要为不同的数据类型使用不同的对齐方式)

  #include< ; iostream> 
#include< chrono>
#include< vector>

void foo(double * __restrict__ p1,
double * __restrict__ p2,
size_t start,
size_t end)
{
double * pA1 = static_cast< double *>(__ builtin_assume_aligned(p1,16));
double * pA2 = static_cast< double *>(__ builtin_assume_aligned(p2,16));

for(size_t i = start; i {
pA1 [i] = pA1 [i] -pA2 [i]
pA1 [i] + = 1;
}
}

int main()
{
size_t n;
double x,y;
n = 12800000;
std :: vector< double> v,u;

for(size_t i = 0; i x = i;
y = i - 1;
v.push_back(x);
u.push_back(y);
}

使用命名空间std :: chrono;

high_resolution_clock :: time_point t1 = high_resolution_clock :: now();
foo(& v [0],& u [0],0,n);
high_resolution_clock :: time_point t2 = high_resolution_clock :: now();

duration< double> time_span = duration_cast< duration< double>>(t2-t1);

std :: cout<< 它带我< time_span.count()<< seconds。;
std :: cout<< std :: endl;

return 0;就像我说的,我有麻烦得到一致的时间测量,所以可以''
}



< t确认这是否会提高您的效果(甚至可能会降低!)


The introductory links I found while searching:

  1. 6.59.14 Loop-Specific Pragmas
  2. 2.100 Pragma Loop_Optimize
  3. How to give hint to gcc about loop count
  4. Tell gcc to specifically unroll a loop
  5. How to Force Vectorization in C++

As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

Is there any hope? Or the optimization flag O3 just does the trick? Any suggestions to speedup this code (the foo function) are welcome!

My version of g++:

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1


Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.


EDIT

An answer saying that there is nothing more that can be done is also acceptable!

解决方案

The O3 flag turns on -ftree-vectorize automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

So in both cases the compiler is trying to do loop vectorization.

Using g++ 4.8.2 to compile with:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

Gives this:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               


Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             

test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   

test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               

test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          

test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:28: note: Unroll loop 1 times  

Compiling without the -ftree-vectorize flag:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

Returns only this:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

But here's a couple of other things you can try too:

  • Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.

  • Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

这篇关于如何使用g ++向量化我的循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆