如何使用OpenMP通过C ++ std :: list并行化一个for循环? [英] How do I parallelize a for loop through a C++ std::list using OpenMP?

查看:1919
本文介绍了如何使用OpenMP通过C ++ std :: list并行化一个for循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用OpenMP以并行方式遍历std :: list中的所有元素。循环应该能够改变列表的元素。有没有一个简单的解决方案?看来OpenMP 3.0支持并行循环,当迭代器是随机访问迭代器,但不是其他。在任何情况下,我更喜欢使用OpenMP 2.0,因为我不能完全控制哪些编译器可用于我。



如果我的容器是一个向量,我可能使用:

  #pragma omp parallel for 
for(auto it = v.begin(); it!= v.end(); ++ it){
it-> process();
}



我理解我可以将列表复制到一个向量中,然后复制一切。

解决方案

如果你决定使用 Openmp 3.0 ,您可以使用任务功能:

  #pragma omp parallel 
#pragma omp single
{
for(auto it = l.begin(); it!= l.end(); ++ it)
#pragma omp task firstprivate(it)
it-> process();
#pragma omp taskwait
}

这将在一个线程中执行循环,但将元素的处理委托给其他人。



没有 OpenMP 3.0 最简单的方法是将所有指针元素在列表中(或者迭代器在一个向量和迭代的那一个。这样,你不必复制任何回,避免复制元素本身的开销,因此它不应该有很多开销:

  std :: vector< my_element *> elements; // my_element是列表中的任何内容
for(auto it = list .begin(); it!= list.end(); ++ it)
elements.push_back(&(* it));

#pragma omp parallel shared
{
#pragma omp for
for(size_t i = 0; i 元素中使用迭代器[i] - > process();
}

甚至指针,你总是可以手动创建一个并行for循环。您可以让线程访问列表的交织元素(如KennyTM所提议的),或者在迭代和迭代之前,将大致相等的部分分割。后者似乎更可取,因为线程避免访问当前由其他线程处理的列表节点(即使只有下一个指针),这可能导致假共享。这看起来大致如下:

  #pragma omp parallel 
{
int thread_count = omp_get_num_threads ;
int thread_num = omp_get_thread_num();
size_t chunk_size = list.size()/ thread_count;
auto begin = list.begin();
std :: advance(begin,thread_num * chunk_size);
auto end = begin;
if(thread_num = thread_count - 1)//最后一个线程迭代剩余的序列
end = list.end();
else
std :: advance(end,chunk_size);
#pragma omp barrier
for(auto it = begin; it!= end; ++ it)
it-> process();
}

不一定需要屏障,但如果 改变处理的元素(意味着它不是一个const方法),如果线程迭代已经被改变的序列,可能会有某种假共享没有它。这种方式将在序列上迭代3 * n次(其中n是线程数),因此对于大量线程,扩展可能不是最优的。



为了减少开销,你可以把范围生成在 #pragma omp parallel 之外,但是你需要知道有多少线程将构成并行段。所以你可能需要手动设置 num_threads ,或者使用 omp_get_max_threads()创建的线程少于 omp_get_max_threads()(这只是一个上限)。最后一种方式可以通过分配每个线程在这种情况下使用块来处理(使用 #pragma omp 应该这样做):

  int max_threads = omp_get_max_threads(); 
std :: vector< std :: pair< std :: list< ...> :: iterator,std :: list< ...> :: iterator> >块;
chunks.reserve(max_threads);
size_t chunk_size = list.size()/ max_threads;
auto cur_iter = list.begin();
for(int i = 0; i {
auto last_iter = cur_iter;
std :: advance(cur_iter,chunk_size);
chunks.push_back(std :: make_pair(last_iter,cur_iter);
}
chunks.push_back(cur_iter,list.end();

#pragma omp并行共享(chunks)
{
#pragma omp for
for(int i = 0; i for(auto it = chunks [i ] .first; it!= chunks [i] .second; ++ it)
it-> process();
}

这将只需要在 list (两个,如果你可以获取列表的大小而不迭代)我认为这是关于你可以做的最好的非随机访问迭代器,而不使用任务或迭代一些out of place数据结构(像指针的向量)。


I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me.

If my container were a vector, I might use:

#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
    it->process();
}

I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.

解决方案

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

这篇关于如何使用OpenMP通过C ++ std :: list并行化一个for循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆