如何用 g++ 向量化我的循环? [英] How to vectorize my loop with g++?

查看:30
本文介绍了如何用 g++ 向量化我的循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在搜索时找到的介绍链接:

The introductory links I found while searching:

  1. 6.59.14 Loop-Specific Pragmas
  2. 2.100 Pragma Loop_Optimize
  3. 如何向 gcc 提供有关循环计数的提示
  4. 告诉 gcc 专门展开循环
  5. 如何在 C++ 中强制向量化

正如您所见,它们中的大多数都是用于 C 的,但我认为它们也可能适用于 C++.这是我的代码:

As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

我使用了上面注释的所有提示,但我没有得到任何加速,如示例输出所示(第一次运行未注释此 #pragma GCC ivdep Unroll Vector:

I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

还有希望吗?还是优化标志 O3 就可以解决问题?欢迎提出任何加速此代码(foo 函数)的建议!

Is there any hope? Or the optimization flag O3 just does the trick? Any suggestions to speedup this code (the foo function) are welcome!

我的 g++ 版本:

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

<小时>

请注意,循环的主体是随机的.我对以其他形式重写它不感兴趣.


Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.

编辑

回答说无能为力也是可以接受的!

An answer saying that there is nothing more that can be done is also acceptable!

推荐答案

O3 标志自动开启-ftree-vectorize.https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 打开 -O2 指定的所有优化,同时打开 -finline-functions、-funswitch-loops、-fpredictive-commoning、-fgcse-after-reload、-ftree-loop-vectorize、-ftree-loop-distribute-patterns、-ftree-slp-vectorize、-fvect-cost-model、-ftree-partial-pre 和 -fipa-cp-clone 选项

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

所以在这两种情况下,编译器都在尝试进行循环向量化.

So in both cases the compiler is trying to do loop vectorization.

使用g++ 4.8.2编译:

Using g++ 4.8.2 to compile with:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

给出这个:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                
test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                
test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:28: note: Unroll loop 1 times  

不带 -ftree-vectorize 标志编译:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

只返回这个:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

第 16 行是循环函数的开始,因此编译器肯定会对其进行矢量化处理.检查汇编程序也证实了这一点.

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

我目前正在使用的笔记本电脑上似乎有一些激进的缓存,这使得准确测量函数运行所需的时间变得非常困难.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

但您也可以尝试以下其他一些方法:

But here's a couple of other things you can try too:

  • 使用 __restrict__ 限定符告诉编译器数组之间没有重叠.

  • Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.

告诉编译器数组与 __builtin_assume_aligned 对齐(不可移植)

Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

这是我的结果代码(我删除了模板,因为您希望对不同的数据类型使用不同的对齐方式)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

就像我说的那样,我无法获得一致的时间测量结果,因此无法确认这是否会提高您的性能(甚至可能会降低!)

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

这篇关于如何用 g++ 向量化我的循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆