OpenMP / C ++：并行for循环后还原 - 最佳实践？ [英] OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

查看：255 发布时间：2016/10/30 17:58:08 c++ multithreading parallel-processing openmp

本文介绍了OpenMP / C ++：并行for循环后还原 - 最佳实践？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定以下代码...

  for（size_t i = 0; i< clusters.size ++ i）
 {
 const std :: set< int>& cluster = clusters [i]; 
 // ...昂贵的计算... 
 for（int j：cluster）
 velocity [j] + = f（j）; 
}

...我想在多个CPU /内核上运行。函数 f 不使用 velocity 。

A在第一个for循环之前，简单 #pragma omp parallel for 将产生不可预测的/错误的结果，因为 std :: vector< T&速度在内循环中被修改。多个线程可以同时访问和（尝试）修改 velocity 的同一元素。

第一个解决方案是在 velocity [j] + = f（j）; 之前写 #pragma omp atomic 操作。这给我一个编译错误（可能与类型 Eigen :: Vector3d 或速度的元素有关）作为类成员）。此外，我读取原子操作非常慢相比，每个线程有一个私有变量，并做一个减少结束。

我想出了这个：

  #pragma omp parallel 
 {
 //这些变量在每个线程的本地
 std :: vector< Eigen :: Vector3d> velocity_local（velocity.size（））; 
 std :: fill（velocity_local.begin（），velocity_local.end（），Eigen :: Vector3d（0,0,0））; 
 
 #pragma omp for 
 for（size_t i = 0; i< clusters.size（）; ++ i）
 {
 const std :: set ; int>& cluster = clusters [i]; 
 // ...昂贵的计算... 
 for（int j：cluster）
 velocity_local [j] + = f（j）; //保存以前计算的结果
} 
 
 //现在每个线程都可以将其结果保存到全局变量
 #pragma omp critical 
 {
 for（size_t i = 0; i  velocity [i] + = velocity_local [i] 
} 
}

这是一个很好的解决方案吗？是最好的解决方案吗？（是否甚至是正确的？）

进一步的想法：使用 reduce 而不是 critical 部分）会抛出编译器错误。我想这是因为 velocity 是一个类成员。

我试图找到一个类似问题的问题，并且此问题看起来几乎相同。但我认为我的case 可能不同，因为最后一步包括 for 循环。

编辑：每个注释的请求：

 reduce  / code> clause ... 
  #pragma omp平行缩减（+：速度）
 omp_int i = 0; i  velocity [i] + = velocity_local [i] 
  
 ...会引发以下错误：
 
 
  错误C3028：'ShapeMatching :: velocity'：只能在数据共享子句中使用变量或静态数据成员
 
 
  （与 g ++ 类似的错误）
解决方案
减少。我已经描述了这几次（例如在openmp中减少数组和与直方图并行填充直方图数组而不使用关键部分）。 
 
 
 您已经在关键部分（在最近的修改中）中正确完成了此操作，因此让我描述如何执行此操作没有关键部分。
 
 
 
 
 
 
  std :: vector< Eigen :: Vector3d>速度
 #pragma omp parallel 
 {
 const int nthreads = omp_get_num_threads（）; 
 const int ithread = omp_get_thread_num（）; 
 const int vsize = velocity.size（）; 
 
 #pragma omp single 
 velocitya.resize（vsize * nthreads）; 
 std :: fill（velocitya.begin（）+ vsize * ithread，velocitya.begin（）+ vsize *（ithread + 1），
 Eigen :: Vector3d（0,0,0））; 
 
 #pragma omp for schedule（static）
 for（size_t i = 0; i< clusters.size（）; i ++）{
 const std :: set< int> ;& cluster = clusters [i]; 
 // ...昂贵的计算... 
 for（int j：cluster）velocitya [ithread * vsize + j] + = f（j）; 
} 
 
 #pragma omp for schedule（static）
 for（int i = 0; i  for（int t = 0; t  velocity [i] + = velocitya [vsize * t + i]; 
} 
} 
} 
  
调整由于假共享，我没有做。 
 
 
 至于哪种方法更好，你必须测试。
 
Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
    const std::set<int>& cluster = clusters[i];
    // ... expensive calculations ...
    for (int j : cluster)
        velocity[j] += f(j); 
} 
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.

A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.

I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.

I have come up with this:
#pragma omp parallel
{
    // these variables are local to each thread
    std::vector<Eigen::Vector3d> velocity_local(velocity.size());
    std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));

    #pragma omp for
    for (size_t i = 0; i < clusters.size(); ++i)
    {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster)
            velocity_local[j] += f(j); // save results from the previous calculations
    } 

    // now each thread can save its results to the global variable
    #pragma omp critical
    {
        for (size_t i = 0; i < velocity_local.size(); ++i)
            velocity[i] += velocity_local[i];
    }
} 
Is this a good solution? Is it the best solution? (Is it even correct?)

Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.

I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.

Edit: As request per comment: The reduction clause...
    #pragma omp parallel reduction(+:velocity)
    for (omp_int i = 0; i < velocity_local.size(); ++i)
        velocity[i] += velocity_local[i];
...throws the following error:

error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause

(similar error with g++)
 解决方案 
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section).  You can do this with and without a critical section.

You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.



std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int vsize = velocity.size();

    #pragma omp single
    velocitya.resize(vsize*nthreads);
    std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1), 
              Eigen::Vector3d(0,0,0));

    #pragma omp for schedule(static)
    for (size_t i = 0; i < clusters.size(); i++) {
        const std::set<int>& cluster = clusters[i];
        // ... expensive calculations ...
        for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
    } 

    #pragma omp for schedule(static)
    for(int i=0; i<vsize; i++) {
        for(int t=0; t<nthreads; t++) {
            velocity[i] += velocitya[vsize*t + i];
        }
    }
}
This method requires extra care/tuning due to false sharing which I have not done. 

As to which method is better you will have to test.

                        这篇关于OpenMP / C ++：并行for循环后还原 - 最佳实践？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

OpenMP / C ++：并行for循环后还原 - 最佳实践？ [英] OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

OpenMP / C ++：并行for循环后还原 - 最佳实践？ [英] OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭