OpenMP并行线程 [英] OpenMP parallel thread
问题描述
我需要并行化此循环,尽管我想使用它是一个好主意,但我之前从未研究过它们.
I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
在这种情况下,循环未并行化,因为它使用迭代器,并且编译器无法 了解如何将其切开.
In this case the loop is not parallelized because it uses iterator and the compiler cannot understand how to slit it.
你能帮我吗?
推荐答案
OpenMP要求并行for
循环中的控制谓词具有以下关系运算符之一:<
,<=
,>
或>=
.只有随机访问迭代器提供这些运算符,因此OpenMP并行循环仅适用于提供随机访问迭代器的容器. std::set
仅提供双向迭代器.您可以使用显式任务来克服该限制.减少可以通过首先对每个线程变量的私有部分进行局部缩减,然后对部分值进行全局缩减.
OpenMP requires that the controlling predicate in parallel for
loops has one of the following relational operators: <
, <=
, >
or >=
. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set
provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(我假设mesh->element_quality()
返回double
)
一些要点:
- 该循环仅由一个线程串行执行,但是每次迭代都会创建一个新任务.这些很可能由空闲线程排队等待执行.
- 等待
single
构造的隐式屏障的空闲线程在创建任务后立即开始使用它们. - 由
it
指向的值在任务正文之前被取消引用.如果在任务正文中取消引用,则it
将是firstprivate
,并且将为每个任务创建迭代器的副本(即在每次迭代中).这不是您想要的. - 每个线程在其
t_worst_q[]
的私有部分中执行部分缩减. - 为了防止由于错误共享而导致的性能下降,每个线程访问的
t_worst_q[]
元素被分隔开,从而以单独的缓存行结尾.在x86/x64上,缓存行为64字节,因此线程号乘以cb = 64 / sizeof(double)
. - 全局min减少是在
critical
构造内部执行的,以防止worst_q
被多个线程一次访问.这仅出于说明目的,因为也可以通过并行区域之后的主线程中的循环来执行减少操作.
- The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
- Idle threads waiting at the implicit barrier of the
single
construct begin consuming tasks as soon as they are created. - The value pointed by
it
is dereferenced before the task body. If dereferenced inside the task body,it
would befirstprivate
and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want. - Each thread performs partial reduction in its private part of the
t_worst_q[]
. - In order to prevent performance degradation due to false sharing, the elements of
t_worst_q[]
that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied bycb = 64 / sizeof(double)
. - The global min reduction is performed inside a
critical
construct to protectworst_q
from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
请注意,显式任务需要支持OpenMP 3.0或3.1的编译器.这排除了所有版本的Microsoft C/C ++编译器(仅支持OpenMP 2.0).
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
这篇关于OpenMP并行线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!