C++ OpenMP:嵌套循环,其中内部迭代器依赖于外部迭代器 [英] C++ OpenMP: nested loops where the inner iterator depends on the outer one

查看:198
本文介绍了C++ OpenMP:嵌套循环,其中内部迭代器依赖于外部迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下代码:

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <cmath>
#include <omp.h>

using namespace std;

typedef std::chrono::steady_clock myclock;

double measure_time(myclock::time_point begin, myclock::time_point end)
{
    return std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/(double)1e6;
}

int main()
{
    int n = 20000;
    vector<double> v(n);
    iota(v.begin(), v.end(), 1.5);

    vector< vector<double> > D(n, vector<double>(n,0.0));

    myclock::time_point begin, end;

    begin = myclock::now();

    //#pragma omp parallel for collapse(2)
    //#pragma omp parallel for
    for(size_t i = 0; i < n - 1; i++){
        for(size_t j = i+1; j < n; j++){
            double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
            D[i][j] = d;
            D[j][i] = d;
        }
    }

    end= myclock::now();
    double time = measure_time(begin, end);
    cout<<"Time: "<<time<<" (s)"<<endl;
    return 0;
}

用于编译:

g++ -std=c++11 -fopenmp -o main main.cpp

我获得了以下运行时间:

I obtained the following run time:

  • 使用 #pragma omp parallel for collapse(2):7.9425 (s)
  • 使用#pragma omp parallel for:3.73262 (s)
  • 没有 OpenGM:11.0935 (s)

系统设置:Linux Mint 18.3 64 位,g++ 5.4.0,四核处理器.

System settings: Linux Mint 18.3 64-bit, g++ 5.4.0, quad-core processor.

我希望第一个比第二个更快(仅并行化外循环)并且比第三个快得多.

I would expect the first to be faster than the second (which parallelizes only the outer loop) and much faster than the third.

请问我做错了什么?第一个和第二个都在所有 8 个线程上运行.

What did I do wrong please? The first and the second both ran on all 8 threads.

预先感谢您的帮助!

推荐答案

当迭代依赖于另一个循环时,不应使用collapse 子句.请参阅了解 openmp 中的崩溃子句.

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

在您的情况下,由于对称性,您正在运行矩阵的下三角形(不包括对角线).这将迭代次数大约减少了一半.如果你想融合/折叠双循环,你可以像这样手动完成(参见这个答案的结尾)更多详情).

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

我认为大多数人认为折叠循环会提供更好的性能,但通常情况并非如此.根据我的经验,大多数时候性能没有差异,但在某些情况下,由于缓存问题,性能会差得多.在某些情况下,它会更好.你必须自我测试.

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

至于为什么你的代码在使用collapse 子句时慢了两倍,我只能猜测,因为对于你的 OpenMP 实现从 [0,n) 即完整矩阵而不是一半矩阵.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

这篇关于C++ OpenMP:嵌套循环,其中内部迭代器依赖于外部迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆