C ++ OpenMP:在偶数块中拆分for循环静态并在结尾连接数据 [英] C++ OpenMP: Split for loop in even chunks static and join data at the end

查看:110
本文介绍了C ++ OpenMP:在偶数块中拆分for循环静态并在结尾连接数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在C ++中创建一个for循环多线程,以便将计算分为多个线程。但它包含需要按顺序连接在一起的数据。



因此,我们的想法是首先加入许多核心上的小数位(25.000+个循环),然后在最后再加入组合数据。

  std :: vector< int> ids; // mappings 
std :: map< int,myData> combineData; // data per id
myData outputData; //根据映射合并数据
myData threadData; //每个线程的数据

#pragma parallel for default(none)private(data,threadData)shared(combineData)
for(int i = 0; i< 30000; i ++)
{
threadData + = combineData [ids [i]];
}

//这里我想得到所有的独立线程数据并以类似的方式组合
// Ie:对于每个threadData:outputData + = threadData

有效率和有效的方法是什么?



我如何调度openmp循环,以便将调度平均分配到chunks



例如对于2个线程:
[ 1,2,3,4,..,14999]& [15000,15001,15002,15003,15004,..,29999]



如果有更好的方法来加入数据

添加信息




  • 添加是关联的,但不是可交换的。

  • myData不是内部类型。它是一个包含数据作为多个std :: vector(以及一些与Autodesk Maya API相关的数据)的类。

  • 每个循环都对多个点进行类似的矩阵乘法,一个向量(理论上,每个周期的计算时间应该保持大致相似)



基本上是添加网格数据到彼此(组合网格),尽管整个事物的顺序考虑了顶点的索引值。顶点索引应该是一致和可重建的。

解决方案

这取决于<$ c $的加法运算符的几个属性c> myData 。如果运算符是关联(A + B)+ C = A +(B + C)以及可交换 A + B = B + A ,那么您可以使用临界部分,或者如果数据是普通的旧数据(例如float,int,...)a reduction



然而,如果它不是可交换的,因为你说(操作顺序很重要),但仍然是关联的,你可以填充一个数组元素等于线程数组合数据并行,然后按顺序将它们合并(见下面的代码)使用schedule(static)将或多或少地均匀分割块,并随着你想要的线程数的增加。



如果运算符既不是关联也不是交换,那么我不认为你可以并行化它(有效地 - 例如尝试并行化一个斐波那契系列高效)。

  std :: vector< int> ids; // mappings 
std :: map< int,myData> combineData; // data per id
myData outputData; /基于映射的组合数据
myData * threadData;
int nthreads;
#pragma omp parallel
{
#pragma omp single
{
nthreads = omp_get_num_threads();
threadData = new myData [nthreads];
}
myData tmp;
#pragma omp for schedule(static)
for(int i = 0; i <30000; i ++){
tmp + = combineData [ids [i]];
}
threadData [omp_get_thread_num()] = tmp;
}
for(int i = 0; i outputData + = threadData [i];
}
delete [] threadData;

编辑:我不能100%确定如果块将按照增加的顺序分配线程号与 #pragma omp为schedule(静态)(虽然我会惊讶,如果他们不)。有正在进行的讨论。同时,如果你想100%肯定,而不是

  #pragma omp为schedule(静态)
(int i = 0; i <30000; i ++){
tmp + = combineData [id [i]];
}

您可以执行

  const int nthreads = omp_get_num_threads(); 
const int ithread = omp_get_thread_num();
const int start = ithread * 30000 / nthreads;
const int finish =(ithread + 1)* 30000 / nthreads;
for(int i = start; i tmp + = combineData [ids [i]];
}

编辑:



我发现了一种更加优雅的并行填充方式,但是按顺序合并

  #pragma omp parallel 
{
myData tmp;
#pragma omp for schedule(static)nowait
for(int i = 0; i <30000; i ++){
tmp + = combineData [ids [i]];
}
#pragma omp for schedule(static)ordered
for(int i = 0; i #pragma omp ordered
outputData + = tmp;
}
}

这避免为每个线程分配数据$ c> threadData )并在并行区域外合并。


I'm trying to make a for loop multi-threaded in C++ so that the calculation gets divided to the multiple threads. Yet it contains data that needs to be joined together in the order as they are.

So the idea is to first join the small bits on many cores (25.000+ loops) and then join the combined data once more at the end.

std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings
myData threadData;                  // data per thread

    #pragma parallel for default(none) private(data, threadData) shared(combineData)
        for (int i=0; i<30000; i++)
        {
            threadData += combineData[ids[i]];
        }

    // Then here I would like to get all the seperate thread data and combine them in a similar manner
    // I.e.: for each threadData:  outputData += threadData

What would be the efficient and good way to approach this?

How can I schedule the openmp loop so that the scheduling is split evenly into chunks

For example for 2 threads: [0, 1, 2, 3, 4, .., 14999] & [15000, 15001, 15002, 15003, 15004, .., 29999]

If there's a better way to join the data (which involves joining a lot of std::vectors together and some matrix math), yet preserve the order of additions pointers to that would help as well.

Added information

  • The addition is associative, though not commutative.
  • myData is not an intrinsic type. It's a class containing data as multiple std::vectors (and some data related to the Autodesk Maya API.)
  • Each cycle is doing a similar matrix multiplication to many points and adds these points to a vector (in theory the calculation time should stay roughly similar per cycle)

Basically it's adding mesh data (consisting of vectors of data) to eachother (combining meshes) though the order of the whole thing accounts for the index value of the vertices. The vertex index should be consistent and rebuildable.

解决方案

This depends on a few properties of the the addition operator of myData. If the operator is both associative (A + B) + C = A + (B + C) as well as commutative A + B = B + A then you can use a critical section or if the data is plain old data (e.g. a float, int,...) a reduction.

However, if it's not commutative as you say (order of operation matters) but still associative you can fill an array with a number of elements equal to the number of threads of the combined data in parallel and then merge them in order in serial (see the code below. Using schedule(static) will split the chunks more or less evenly and with increasing thread number as you want.

If the operator is neither associative nor commutative then I don't think you can parallelize it (efficiently - e.g. try parallelizing a Fibonacci series efficiently).

std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings
myData *threadData;
int nthreads;
#pragma omp parallel
{
    #pragma omp single
    {
        nthreads = omp_get_num_threads();
        threadData = new myData[nthreads];
    }
    myData tmp;
    #pragma omp for schedule(static)
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    threadData[omp_get_thread_num()] = tmp;
}
for(int i=0; i<nthreads; i++) {
     outputData += threadData[i];
}
delete[] threadData;

Edit: I'm not 100% sure at this point if the chunks will assigned in order of increasing thread number with #pragma omp for schedule(static) (though I would be surprised if they are not). There is an ongoing discussion on this issue. Meanwhile, if you want to be 100% sure then instead of

#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
    tmp += combineData[ids[i]];
}

you can do

const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*30000/nthreads;
const int finish = (ithread+1)*30000/nthreads;
for(int i = start; i<finish; i++) {
     tmp += combineData[ids[i]];          
}

Edit:

I found a more elegant way to fill in parallel but merge in order

#pragma omp parallel
{
    myData tmp;
    #pragma omp for schedule(static) nowait 
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    #pragma omp for schedule(static) ordered 
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        outputData += tmp;
    }
}

This avoids allocating data for each thread (threadData) and merging outside the parallel region.

这篇关于C ++ OpenMP:在偶数块中拆分for循环静态并在结尾连接数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆