削减并行数时间 [英] Reductions in parallel in logarithmic time

查看:144
本文介绍了削减并行数时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于 N 部分和有可能在总结LOG2并行步骤所有部分和。例如,假设有八个线程与八个部分款项: S0,S1,S2,S3,S4,S5,S6,S7 。这可以在减少 LOG2(8)= 3 顺序步骤如下;

Given n partial sums it's possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums: s0, s1, s2, s3, s4, s5, s6, s7. This could be reduced in log2(8) = 3 sequential steps like this;

thread0     thread1    thread2    thread4
s0 += s1    s2 += s3   s4 += s5   s6 +=s7
s0 += s2    s4 += s6
s0 += s4

我想使用OpenMP要做到这一点,但我不希望使用的OpenMP的减少条款。我想出了一个解决方案,但我认为更好的解决方法也许可以用找到的OpenMP的任务子句。

I would like to do this with OpenMP but I don't want to use OpenMP's reduction clause. I have come up with a solution but I think a better solution can be found maybe using OpenMP's task clause.

这是比标量增加比较一般。让我选择一个更有用的情况:一个数组减少(见<一href=\"http://stackoverflow.com/questions/20413995/reducing-on-array-in-openmp/20421200#20421200\">here, <一href=\"http://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic\">here,和<一个href=\"http://stackoverflow.com/questions/30943452/openmp-c-parallel-for-loop-with-reduction-afterwards-best-practice/31002943#31002943\">here更多有关数组减少)。

This is more general than scalar addition. Let me choose a more useful case: an array reduction (see here, here, and here for more about array reductions).

比方说,我想要做一个数组 A 上的一组减少。下面是一些code填补私人阵列并行每个线程。

Let's say I want to do an array reduction on an array a. Here is some code which fills private arrays in parallel for each thread.

int bins = 20;
int a[bins];
int **at;  // array of pointers to arrays
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
    #pragma omp single   
    at = (int**)malloc(sizeof *at * omp_get_num_threads());        
    at[omp_get_thread_num()] = (int*)malloc(sizeof **at * bins);
    int a_private[bins];
    //arbitrary function to fill the arrays for each thread
    for(int i = 0; i<bins; i++) at[omp_get_thread_num()][i] = i + omp_get_thread_num();
}

在这一点上我有一个指针数组的数组为每个线程。现在我想所有这些阵列添加在一起,最后一笔写 A 。这里是我想出了一个解决方案。

At this point I have have an array of pointers to arrays for each thread. Now I want to add all these arrays together and write the final sum to a. Here is the solution I came up with.

#pragma omp parallel
{
    int n = omp_get_num_threads();
    for(int m=1; n>1; m*=2) {
        int c = n%2;
        n/=2;
        #pragma omp for
        for(int i = 0; i<n; i++) {
            int *p1 = at[2*i*m], *p2 = at[2*i*m+m];
            for(int j = 0; j<bins; j++) p1[j] += p2[j];
        }
        n+=c;
    }
    #pragma omp single
    memcpy(a, at[0], sizeof *a*bins);
    free(at[omp_get_thread_num()]);
    #pragma omp single
    free(at);
}

让我来解释一下这个code做了什么。让我们假设有八个线程。让我们来定义的 + = 运营商意味着求和阵列。例如 S0 + S1 =

Let me try and explain what this code does. Let's assume there are eight threads. Let's define the += operator to mean to sum over the array. e.g. s0 += s1 is

for(int i=0; i<bins; i++) s0[i] += s1[i]

那么这个code会做

then this code would do

n   thread0     thread1    thread2    thread4
4   s0 += s1    s2 += s3   s4 += s5   s6 +=s7
2   s0 += s2    s4 += s6
1   s0 += s4

但是,这code是不理想的,因为我想它。

But this code is not ideal as I would like it.

的一个问题是,有需要的所有线程同步几个隐障碍。这些障碍不应该是必要的。第一阻挡是填充阵列和操作的减少之间。第二个障碍是在的#pragma OMP在减少声明。但我不能使用 NOWAIT 子句用这种方法去除障碍。

One problem is that there are a few implicit barriers which require all the threads to sync. These barriers should not be necessary. The first barrier is between filling the arrays and doing the reduction. The second barrier is in the #pragma omp for declaration in the reduction. But I can't use the nowait clause with this method to remove the barrier.

的另一个问题是,有不需要使用多个线程。例如有八个线程。在还原所述第一步骤仅需要四个线程,第二步两个线程,最后一步只有一个线程。但是,这种方法将涉及在还原所有八个线程。尽管其他线程没有做很多不管怎样,应该去正确的障碍,并等待所以它可能不是一个问题了。

Another problem is that there are several threads that don't need to be used. For example with eight threads. The first step in the reduction only needs four threads, the second step two threads, and the last step only one thread. However, this method would involve all eight threads in the reduction. Although, the other threads don't do much anyway and should go right to the barrier and wait so it's probably not much of an issue.

我的直觉是,一个更好的方法可以使用​​OMP 任务子句中找到。不幸的是我与任务经验少子句和我所有的努力,到目前为止,用它做的减少比我现在已经没有更好的。

My instinct is that a better method can be found using the omp task clause. Unfortunately I have little experience with the task clause and all my efforts so far with it do a reduction better than what I have now have failed.

一个人能否提供一个更好的解决方案做使用例如,在对数时间的减少OpenMP的的任务条款?

Can someone suggest a better solution to do the reduction in logarithmic time using e.g. OpenMP's task clause?

,我发现一种解决了屏障问题的方法。这种异步减少。唯一剩下的问题是,它仍然把它不参与还原成一个忙循环线程。此方法使用类似一个堆栈指针推到堆栈(但从来没有弹出他们)在临界区(这是关键的之一关键部分没有隐含的壁垒,堆栈操作上连续,但减少并行。

I found a method which solves the barrier problem. This reduces asynchronously. The only remaining problem is that it still puts threads which don't participate in the reduction into a busy loop. This method uses something like a stack to push pointers to the stack (but never pops them) in critical sections (this was one of the keys as critical sections don't have implicit barriers. The stack is operated on serially but the reduction in parallel.

下面是一个工作的例子。

Here is a working example.

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <string.h>

void foo6() {
    int nthreads = 13;
    omp_set_num_threads(nthreads);
    int bins= 21;
    int a[bins];
    int **at;
    int m = 0;
    int nsums = 0;
    for(int i = 0; i<bins; i++) a[i] = 0;
    #pragma omp parallel
    {
        int n = omp_get_num_threads();
        int ithread = omp_get_thread_num();
        #pragma omp single
        at = (int**)malloc(sizeof *at * n * 2);
        int* a_private = (int*)malloc(sizeof *a_private * bins);

        //arbitrary fill function
        for(int i = 0; i<bins; i++) a_private[i] = i + omp_get_thread_num();

        #pragma omp critical (stack_section)
        at[nsums++] = a_private;

        while(nsums<2*n-2) {
            int *p1, *p2;
            char pop = 0;
            #pragma omp critical (stack_section)
            if((nsums-m)>1) p1 = at[m], p2 = at[m+1], m +=2, pop = 1;
            if(pop) {
                for(int i = 0; i<bins; i++) p1[i] += p2[i];
                #pragma omp critical (stack_section)
                at[nsums++] = p1;
            }
        }

        #pragma omp barrier
        #pragma omp single
        memcpy(a, at[2*n-2], sizeof **at *bins);
        free(a_private);
        #pragma omp single
        free(at);
    }
    for(int i = 0; i<bins; i++) printf("%d ", a[i]); puts("");
    for(int i = 0; i<bins; i++) printf("%d ", (nthreads-1)*nthreads/2 +nthreads*i); puts("");
}

int main(void) {
    foo6();
}

我觉得窗台一个更好的方法可以使用​​任务不把没有在繁忙的循环中使用线程被发现。

I sill feel a better method may be found using tasks which does not put the threads not being used in a busy loop.

推荐答案

其实,这是很简单的实现一个干净地使用递归的分而治之的方法任务。这几乎是教科书 code。

Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.

void operation(int* p1, int* p2, size_t bins)
{
    for (int i = 0; i < bins; i++)
        p1[i] += p2[i];
}

void reduce(int** arrs, size_t bins, int begin, int end)
{
    assert(begin < end);
    if (end - begin == 1) {
        return;
    }
    int pivot = (begin + end) / 2;
    /* Moving the termination condition here will avoid very short tasks,
     * but make the code less nice. */
#pragma omp task
    reduce(arrs, bins, begin, pivot);
#pragma omp task
    reduce(arrs, bins, pivot, end);
#pragma omp taskwait
    /* now begin and pivot contain the partial sums. */
    operation(arrs[begin], arrs[pivot], bins);
}

/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);

据我所知,没有任何不必要的同步,而且对关键部分没有怪异的投票。它还与数据大小比你列数不同的作品自然。我觉得很干净,很容易理解。因此,我确实认为这是的更好的比你的两个解决方案。

As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.

不过,让我们来看看它如何执行在实践中*。为此,我们可以使用得分-P 和的 VAMPIR

But let's look at how it performs in practice*. For that we can use Score-p and Vampir:

* <子> 箱= 10000 因此实际减少的花费一点点时间。执行的24芯Haswell的系统上的w / o涡轮增压。 GCC 4.8.4, -O3 。我加了一些缓冲各地的实际执行隐藏初始化/后处理

*bins=10000 so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3. I added some buffer around the actual execution to hide initialization/post-processing

三个变种的执行

图像揭示了什么是在上一水平时间轴的应用程序内的任何线程发生。树实现从上到下:

The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:


  1. OMP为循环

  2. OMP关键样的任务的。

  3. OMP任务

  1. omp for loop
  2. omp critical kind of tasking.
  3. omp task

这很好地说明了具体的实现实际上是如何执行。目前看来,for循环实际上是最快的,尽管不必要的同步。但仍有一些在这个性能分析的缺陷。例如,我没有脚的线程。在实践中NUMA(非一致内存访问)是相当重要的:是否核心确实有它自己的插座它自己的缓存/内存这些数据?这就是任务溶液变得非确定性。重复之间的非常显著方差不会在简单的比较被认为

This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.

如果还原操作变得在运行时变量,则任务解决方案将变得比你的for循环同步更好。

If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.

关键解决方案有一些有趣的方面,被动线程不连续等,所以他们会更容易消耗CPU资源。这可不好的性能例如在Turbo模式的情况下。

The critical solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.

记住任务的解决方案,避免产卵任务立即返回有更多的优化潜力。如何将这些解决方案还执行高度取决于具体OpenMP运行。英特尔的运行时好像做任务更糟糕。

Remember that the task solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.

我的建议是:


  • 实施与优化算法最维护的解决方案
    复杂

  • 测量其中code的部分实际上很重要的运行时间

  • 根据实际测量的是什么瓶颈分析。在我的经验,更多的是NUMA和调度,而不是一些不必要的障碍。

  • 根据您的实际测量结果进行微优化

下面是线性 proccess_data_v1 问题

平行时间轴

于是我想到了OpenMP的减少。最棘手的部分似乎越来越从内循环数组中的数据没有副本。我做初始化 NULL 工人阵列和简单移动指针第一次:

So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at array inside the loop without a copy. I do initialize the worker array with NULL and simply move the pointer the first time:

void meta_op(int** pp1, int* p2, size_t bins)
{
    if (*pp1 == NULL) {
        *pp1 = p2;
        return;
    }
    operation(*pp1, p2, bins);
}

// ...

// declare before parallel region as global
int* awork = NULL;

#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)

#pragma omp for reduction(merge : awork)
        for (int t = 0; t < n; t++) {
            meta_op(&awork, at[t], bins);
        }

出人意料的是,这个看起来并不太好:

Surprisingly, this doesn't look too good:

时间表omp4减排

<子>顶部 ICC 16.0.2 ,下方是 GCC 5.3.0 ,均与 -O3

top is icc 16.0.2, bottom is gcc 5.3.0, both with -O3.

似乎都实现序列化的减少。我试图寻找到 GCC / libgomp ,但它不是立即显现出来给我发生了什么。从中间code /拆卸,他们似乎被包裹在一个 GOMP_atomic_start / 结束时的最终合并 - 和这似乎是一个全球性的互斥量。同样 ICC 包装了调用操作 kmpc_critical 。我想没有太多的优化进入昂贵的定制减少操作。传统的减少可以用硬件支持的原子操作来完成。

Both seem to implement the reduction serialized. I tried to look into gcc / libgomp, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start/end - and that seems to be a global mutex. Similarly icc wraps the call to the operation in a kmpc_critical. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.

注意如何每个操作是更快,因为输入本地缓存,但由于序列化是整体速度较慢。再次,这是不是一个完美的,由于比较高的差异,以及更早版本的截图分别用不同的 GCC 版本。但趋势是明确的,而且我也有上的缓存影响的数据。

Notice how each operation is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc version. But the trend is clear, and I also have data on the cache effects.

这篇关于削减并行数时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆