前缀和的并行化 (Openmp) [英] Parallelization of a prefix sum (Openmp)

查看:25
本文介绍了前缀和的并行化 (Openmp)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个向量,a[n] 和 b[n],其中 n 是一个大数.

I have two vectors, a[n] and b[n], where n is a large number.

a[0] = b[0];

for (i = 1; i < size; i++) {
        a[i] = a[i-1] + b[i];
}

通过这段代码,我们试图实现 a[i] 包含 b[] 中所有数字的总和,直到 b[i].我需要使用 openmp 并行化这个循环.

With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelise this loop using openmp.

主要问题是 a[i] 依赖于 a[i-1],所以我想到的唯一直接方法是等待每个 a[i-1] 数准备好,这需要一个很多时间,没有任何意义.openmp 有没有办法解决这个问题?

The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in openmp for solving this problem?

推荐答案

你是 18 世纪的卡尔弗里德里希高斯,你的小学老师决定用一个需要大量或平凡的重复算术的家庭作业问题来惩罚全班.上周你的老师告诉你把前 100 个数加起来,因为你很聪明,你想出了 快速解决方案.你的老师不喜欢这样,所以他提出了一个他认为无法优化的新问题.用你自己的符号重写这个问题

You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this

a[0] = b[0];   
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];

然后你意识到

a0  = b[0]
a1  = (b[0]) + b[1];
a2  = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]

再次使用你的符号,你将问题改写为

using your notation again you rewrite the problem as

int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;

如何优化?首先你定义

int sum(n0, n) { 
    int sum = 0;
    for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
    return sum;
}

然后你意识到

a_n+1   = sum(0, n) + sum(n, n+1)
a_n+2   = sum(0, n) + sum(n, n+2)
a_n+m   = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)

现在你知道该怎么做了.找t 同学.让每个人处理数字的一个子集.为简单起见,您选择 size 为 100,四个同学 t0, t1, t2, t3 然后每个同学都做

So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does

 t0               t1                t2              t3
 s0 = sum(0,25)   s1 = sum(25,50)   s2 = sum(50,75) s3 = sum(75,100)

同时.然后定义

fix(int n0, int n, int offset) {
    for(int i=n0; i<n; i++) a[i] += offset
}

然后每个同学像这样再次同时返回他们的子集

and then each classmates goes back over their subset at the same time again like this

t0             t1               t2                  t3 
fix(0, 25, 0)  fix(25, 50, s0)  fix(50, 75, s0+s1)  fix(75, 100, s0+s1+s2)

你意识到,使用 t 的同学用大约相同的 K 秒来做算术,你可以在 2*K*size/t 秒,而一个人需要 K*size 秒.很明显,您至少需要两个同学才能收支平衡.因此,如果有四个同学,他们应该以一个同学的一半时间完成.

You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.

现在你用你自己的符号写你的算法

Now your write up your algorithm in your own notation

int *suma;  // array of partial results from each classmate
#pragma omp parallel
{
    int ithread = omp_get_thread_num();    //label of classmate
    int nthreads = omp_get_num_threads();  //number of classmates
    #pragma omp single
    suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;

    //now have each classmate calculate their partial result s = sum(n0, n)
    int s = 0;
    #pragma omp for schedule(static) nowait
    for (int i=0; i<size; i++) s += b[i], a[i] = sum;
    suma[ithread+1] = s;

    //now wait for each classmate to finish
    #pragma omp barrier

    // now each classmate sums each of the previous classmates results
    int offset = 0;
    for(int i=0; i<(ithread+1); i++) offset += suma[i];

    //now each classmates corrects their result 
    #pragma omp for schedule(static)
    for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)

你意识到你可以优化每个同学必须添加前一个同学的结果的部分,但由于size>>t 你认为不值得付出努力.

You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.

您的解决方案没有您添加计数数字的解决方案快,但是您的老师对他的几个学生比其他学生早得多完成感到不高兴.所以现在他决定一个学生必须慢慢地向全班阅读b数组,当你报告结果a时,它也必须慢慢地完成.你称之为读/写带宽限制.这严重限制了算法的有效性. 你现在要做什么?

Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?

你唯一能想到的就是让多个同学同时读取不同的数字子集并记录到班级时间.

这篇关于前缀和的并行化 (Openmp)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆