一个preFIX总和并行(OpenMP的) [英] Parallelization of a prefix sum (Openmp)

查看:158
本文介绍了一个preFIX总和并行(OpenMP的)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个向量,一个[n]的和b [n​​],其中n是一个很大的数字。

  A [0] = B [0];对于(i = 1; I<大小;我++){
        一个由[i] = A [I-1] + B [I];
}

有了这个code,我们努力实现一个[I]包含了所有的数字的b的和[],直到B〔1]。我需要使用OpenMP来parallelise这个循环。

主要的问题是,[I]依赖的[I-1],所以唯一的直接浮现在我的脑海里的办法是等待每一个[I-1]号要准备好,这需要一个大量的时间和没有意义的。是否有解决这个问题的OpenMP中任何方式?


解决方案

您是卡尔·弗里德里希·高斯在18世纪和你的小学老师已决定,需要大量的世俗或反复算术功课问题惩罚类。在previous周老师告诉你要加起来第100计数数字,因为你很聪明,你想出的一个快速的解决方案。你的老师不喜欢这样,所以他想出了他认为不能优化一个新的问题。在用户自己的符号,你重写这样的问题。

  A [0] = B [0];
对(INT I = 1; I&下;大小;我++)一个由[i] = A [I-1] + B [I];

然后你意识到

  A0 = B [0]
A1 =(B [0])+ B [1];
A2 =((B [0])+ B [1])+ B [2]
A_N = B [0] + B [1] + B [2] +。B [n]的

用你的符号再次改写你的问题,因为

  INT总和= 0;
对(INT I = 0; I&下;大小;我++)总和+ = B [I]中,一个由[i] =总和;

如何优化呢?首先定义

  INT总和(N0,N){
    INT总和= 0;
    的for(int i = N0; I< N;我++)之和+ = B [I],一个[I] =总和;
    返回总和;
}

然后你意识到

  A_N + 1 = SUM(0,N)+ SUM(N,N + 1)
A_N + 2 =总和(0,N)+和(N,N + 2)
A_N + M =总和(0,N)+和(N,N + M)
A_N + M + K =总和(0,N)+和(N,N + M)+和(N + M,N + M + k)的

所以,现在你知道该怎么做。找到 T 同学。对数的一个子集每一个的工作。为了保持它的简单,你选择尺寸为100和四个同学 T0,T1,T2,T3 那么每个人做

  T0 T1 T2 T3
 S0 = SUM(0,25)S1 = SUM(25,50)S2 = SUM(50,75)S3 = SUM(75100)

同时。然后定义

 修复(INT N0,整数N,偏移的int){
    的for(int i = N0; I< N;我++)A [I] + = OFFSET
}

,然后将每个同学在同一时间可追溯到其子再次像这样

  T0 T1 T2 T3
修复(0,25,0)修复(25,50,S0)修复(50,75,S0 + S1)的修复(75,100,S0 + S1 + S2)

您认识到,与 T 同学采取大致相同的 K 秒钟做算术,你可以完成任务在 2 * K *尺寸/吨秒,而一个人将采取 K *尺寸秒。很明显你将至少需要两个同学只是收支平衡。因此,有四个同学,他们应该在一半的时间完成为一体的同班同学。

现在你写你的算法在用户自己的符号

 为int * SUMA; //从每个同学的部分结果数组
OMP的#pragma并行
{
    INT ithread = omp_get_thread_num(); //同学标签
    INT来确定nthreads = omp_get_num_threads(); //同学数
    OMP的#pragma单
    SUMA = malloc的(sizeof的* SUMA *(来确定nthreads + 1)),SUMA [0] = 0;    //现在每个同学计算它们的部分结果S = SUM(N0,N)
    INT S = 0;
    OMP的#pragma附表(静态)NOWAIT
    对(INT I = 0; I&下;大小;我++)S + = B [I]中,一个由[i] =总和;
    SUMA [ithread + 1] = S;    //现在等待每个同学完成
    OMP的#pragma屏障    //现在每个同学每个$ P $的pvious同学成绩总和
    INT偏移= 0;
    对(INT I = 0; I&≤(ithread + 1);我++)偏移+ = SUMA [I];    //现在每个同学纠正其结果
    OMP的#pragma附表(静态)
    的for(int i = 0; I<大小;我++)A [I] + =偏移;
}
免费(SUMA)

您知道,您可以优化的部分,每个同学都有添加previous同学的结果,而是因为尺寸>> ŧ你不认为这是值得的。

您的解决方案是不是几乎一样快,您的解决方案添加计数的数字但尽管如此,你的老师是不是高兴的是,他的几个学生完成比其他同学要早得多。所以,现在他决定一个学生都有慢慢阅读 B 数组类,当你报告结果 A 它必须缓慢进行为好。你叫这势必被读取/写入带宽。 <一href=\"https://stackoverflow.com/questions/18159455/why-vectorizing-the-loop-does-not-have-performance-improvement/18159503#18159503\">This严重限制了你的算法的有效性。的什么是你现在打算怎么办?

你能想到的唯一的事情是让同学们多在相同的阅读和记录号码之类的不同的子集时间

I have two vectors, a[n] and b[n], where n is a large number.

a[0] = b[0];

for (i = 1; i < size; i++) {
        a[i] = a[i-1] + b[i];
}

With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelise this loop using openmp.

The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in openmp for solving this problem?

解决方案

You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this

a[0] = b[0];   
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];

then you realize that

a0  = b[0]
a1  = (b[0]) + b[1];
a2  = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]

using your notation again you rewrite the problem as

int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;

How to optimize this? First you define

int sum(n0, n) { 
    int sum = 0;
    for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
    return sum;
}

Then you realize that

a_n+1   = sum(0, n) + sum(n, n+1)
a_n+2   = sum(0, n) + sum(n, n+2)
a_n+m   = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)

So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does

 t0               t1                t2              t3
 s0 = sum(0,25)   s1 = sum(25,50)   s2 = sum(50,75) s3 = sum(75,100)

at the same time. Then define

fix(int n0, int n, int offset) {
    for(int i=n0; i<n; i++) a[i] += offset
}

and then each classmates goes back over their subset at the same time again like this

t0             t1               t2                  t3 
fix(0, 25, 0)  fix(25, 50, s0)  fix(50, 75, s0+s1)  fix(75, 100, s0+s1+s2)

You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.

Now your write up your algorithm in your own notation

int *suma;  // array of partial results from each classmate
#pragma omp parallel
{
    int ithread = omp_get_thread_num();    //label of classmate
    int nthreads = omp_get_num_threads();  //number of classmates
    #pragma omp single
    suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;

    //now have each classmate calculate their partial result s = sum(n0, n)
    int s = 0;
    #pragma omp for schedule(static) nowait
    for (int i=0; i<size; i++) s += b[i], a[i] = sum;
    suma[ithread+1] = s;

    //now wait for each classmate to finish
    #pragma omp barrier

    // now each classmate sums each of the previous classmates results
    int offset = 0;
    for(int i=0; i<(ithread+1); i++) offset += suma[i];

    //now each classmates corrects their result 
    #pragma omp for schedule(static)
    for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)

You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.

Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?

The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.

这篇关于一个preFIX总和并行(OpenMP的)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆