OpenMP / C ++:并行for循环后还原 - 最佳实践? [英] OpenMP/C++: Parallel for loop with reduction afterwards - best practice?
问题描述
给定以下代码...
for(size_t i = 0; i< clusters.size ++ i)
{
const std :: set< int>& cluster = clusters [i];
// ...昂贵的计算...
for(int j:cluster)
velocity [j] + = f(j);
}
...我想在多个CPU /内核上运行。函数 f
不使用 velocity
。
A在第一个for循环之前,简单 #pragma omp parallel for
将产生不可预测的/错误的结果,因为 std :: vector< T&速度
在内循环中被修改。多个线程可以同时访问和(尝试)修改 velocity
的同一元素。
第一个解决方案是在 velocity [j] + = f(j);
之前写 #pragma omp atomic
操作。这给我一个编译错误(可能与类型 Eigen :: Vector3d
或速度
的元素有关)作为类成员)。此外,我读取原子操作非常慢相比,每个线程有一个私有变量,并做一个减少结束。
我想出了这个:
#pragma omp parallel
{
//这些变量在每个线程的本地
std :: vector< Eigen :: Vector3d> velocity_local(velocity.size());
std :: fill(velocity_local.begin(),velocity_local.end(),Eigen :: Vector3d(0,0,0));
#pragma omp for
for(size_t i = 0; i< clusters.size(); ++ i)
{
const std :: set ; int>& cluster = clusters [i];
// ...昂贵的计算...
for(int j:cluster)
velocity_local [j] + = f(j); //保存以前计算的结果
}
//现在每个线程都可以将其结果保存到全局变量
#pragma omp critical
{
for(size_t i = 0; i velocity [i] + = velocity_local [i]
}
}
这是一个很好的解决方案吗?是最好的解决方案吗? (是否甚至是正确的?)
进一步的想法:使用 reduce
而不是 critical
部分)会抛出编译器错误。我想这是因为 velocity
是一个类成员。
我试图找到一个类似问题的问题,并且此问题看起来几乎相同。但我认为我的case 可能不同,因为最后一步包括 for
循环。
reduce / code> clause ... #pragma omp平行缩减(+:速度)
omp_int i = 0; i velocity [i] + = velocity_local [i]
...会引发以下错误:
错误C3028:'ShapeMatching :: velocity':只能在数据共享子句中使用变量或静态数据成员
(与 g ++
类似的错误)
解决方案减少。我已经描述了这几次(例如在openmp中减少数组和与直方图并行填充直方图数组而不使用关键部分)。
您已经在关键部分(在最近的修改中)中正确完成了此操作,因此让我描述如何执行此操作没有关键部分。
std :: vector< Eigen :: Vector3d>速度
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize * nthreads);
std :: fill(velocitya.begin()+ vsize * ithread,velocitya.begin()+ vsize *(ithread + 1),
Eigen :: Vector3d(0,0,0));
#pragma omp for schedule(static)
for(size_t i = 0; i< clusters.size(); i ++){
const std :: set< int> ;& cluster = clusters [i];
// ...昂贵的计算...
for(int j:cluster)velocitya [ithread * vsize + j] + = f(j);
}
#pragma omp for schedule(static)
for(int i = 0; i for(int t = 0; t velocity [i] + = velocitya [vsize * t + i];
}
}
}
调整由于假共享,我没有做。
至于哪种方法更好,你必须测试。
Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f
does not use velocity
.
A simple #pragma omp parallel for
before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity
is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity
at the same time.
I think the first solution would be to write #pragma omp atomic
before the velocity[j] += f(j);
operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d
or velocity
being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce
clause (instead of the critical
section) throws a compiler error. I think this is because velocity
is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for
loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction
clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++
)
解决方案 You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.
这篇关于OpenMP / C ++:并行for循环后还原 - 最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!