削减并行数时间 [英] Reductions in parallel in logarithmic time
问题描述
由于 N
部分和有可能在总结LOG2并行步骤所有部分和。例如,假设有八个线程与八个部分款项: S0,S1,S2,S3,S4,S5,S6,S7
。这可以在减少 LOG2(8)= 3
顺序步骤如下;
Given n
partial sums it's possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums: s0, s1, s2, s3, s4, s5, s6, s7
. This could be reduced in log2(8) = 3
sequential steps like this;
thread0 thread1 thread2 thread4
s0 += s1 s2 += s3 s4 += s5 s6 +=s7
s0 += s2 s4 += s6
s0 += s4
我想使用OpenMP要做到这一点,但我不希望使用的OpenMP的减少
条款。我想出了一个解决方案,但我认为更好的解决方法也许可以用找到的OpenMP的任务
子句。
I would like to do this with OpenMP but I don't want to use OpenMP's reduction
clause. I have come up with a solution but I think a better solution can be found maybe using OpenMP's task
clause.
这是比标量增加比较一般。让我选择一个更有用的情况:一个数组减少(见<一href=\"http://stackoverflow.com/questions/20413995/reducing-on-array-in-openmp/20421200#20421200\">here, <一href=\"http://stackoverflow.com/questions/16789242/fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic\">here,和<一个href=\"http://stackoverflow.com/questions/30943452/openmp-c-parallel-for-loop-with-reduction-afterwards-best-practice/31002943#31002943\">here更多有关数组减少)。
This is more general than scalar addition. Let me choose a more useful case: an array reduction (see here, here, and here for more about array reductions).
比方说,我想要做一个数组 A
上的一组减少。下面是一些code填补私人阵列并行每个线程。
Let's say I want to do an array reduction on an array a
. Here is some code which fills private arrays in parallel for each thread.
int bins = 20;
int a[bins];
int **at; // array of pointers to arrays
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
#pragma omp single
at = (int**)malloc(sizeof *at * omp_get_num_threads());
at[omp_get_thread_num()] = (int*)malloc(sizeof **at * bins);
int a_private[bins];
//arbitrary function to fill the arrays for each thread
for(int i = 0; i<bins; i++) at[omp_get_thread_num()][i] = i + omp_get_thread_num();
}
在这一点上我有一个指针数组的数组为每个线程。现在我想所有这些阵列添加在一起,最后一笔写 A
。这里是我想出了一个解决方案。
At this point I have have an array of pointers to arrays for each thread. Now I want to add all these arrays together and write the final sum to a
. Here is the solution I came up with.
#pragma omp parallel
{
int n = omp_get_num_threads();
for(int m=1; n>1; m*=2) {
int c = n%2;
n/=2;
#pragma omp for
for(int i = 0; i<n; i++) {
int *p1 = at[2*i*m], *p2 = at[2*i*m+m];
for(int j = 0; j<bins; j++) p1[j] += p2[j];
}
n+=c;
}
#pragma omp single
memcpy(a, at[0], sizeof *a*bins);
free(at[omp_get_thread_num()]);
#pragma omp single
free(at);
}
让我来解释一下这个code做了什么。让我们假设有八个线程。让我们来定义的 + =
运营商意味着求和阵列。例如 S0 + S1 =
是
Let me try and explain what this code does. Let's assume there are eight threads. Let's define the +=
operator to mean to sum over the array. e.g. s0 += s1
is
for(int i=0; i<bins; i++) s0[i] += s1[i]
那么这个code会做
then this code would do
n thread0 thread1 thread2 thread4
4 s0 += s1 s2 += s3 s4 += s5 s6 +=s7
2 s0 += s2 s4 += s6
1 s0 += s4
但是,这code是不理想的,因为我想它。
But this code is not ideal as I would like it.
的一个问题是,有需要的所有线程同步几个隐障碍。这些障碍不应该是必要的。第一阻挡是填充阵列和操作的减少之间。第二个障碍是在的#pragma OMP在减少
声明。但我不能使用 NOWAIT
子句用这种方法去除障碍。
One problem is that there are a few implicit barriers which require all the threads to sync. These barriers should not be necessary. The first barrier is between filling the arrays and doing the reduction. The second barrier is in the #pragma omp for
declaration in the reduction. But I can't use the nowait
clause with this method to remove the barrier.
的另一个问题是,有不需要使用多个线程。例如有八个线程。在还原所述第一步骤仅需要四个线程,第二步两个线程,最后一步只有一个线程。但是,这种方法将涉及在还原所有八个线程。尽管其他线程没有做很多不管怎样,应该去正确的障碍,并等待所以它可能不是一个问题了。
Another problem is that there are several threads that don't need to be used. For example with eight threads. The first step in the reduction only needs four threads, the second step two threads, and the last step only one thread. However, this method would involve all eight threads in the reduction. Although, the other threads don't do much anyway and should go right to the barrier and wait so it's probably not much of an issue.
我的直觉是,一个更好的方法可以使用OMP 任务
子句中找到。不幸的是我与任务经验少
子句和我所有的努力,到目前为止,用它做的减少比我现在已经没有更好的。
My instinct is that a better method can be found using the omp task
clause. Unfortunately I have little experience with the task
clause and all my efforts so far with it do a reduction better than what I have now have failed.
一个人能否提供一个更好的解决方案做使用例如,在对数时间的减少OpenMP的的任务
条款?
Can someone suggest a better solution to do the reduction in logarithmic time using e.g. OpenMP's task
clause?
,我发现一种解决了屏障问题的方法。这种异步减少。唯一剩下的问题是,它仍然把它不参与还原成一个忙循环线程。此方法使用类似一个堆栈指针推到堆栈(但从来没有弹出他们)在临界区(这是关键的之一关键部分没有隐含的壁垒,堆栈操作上连续,但减少并行。
I found a method which solves the barrier problem. This reduces asynchronously. The only remaining problem is that it still puts threads which don't participate in the reduction into a busy loop. This method uses something like a stack to push pointers to the stack (but never pops them) in critical sections (this was one of the keys as critical sections don't have implicit barriers. The stack is operated on serially but the reduction in parallel.
下面是一个工作的例子。
Here is a working example.
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <string.h>
void foo6() {
int nthreads = 13;
omp_set_num_threads(nthreads);
int bins= 21;
int a[bins];
int **at;
int m = 0;
int nsums = 0;
for(int i = 0; i<bins; i++) a[i] = 0;
#pragma omp parallel
{
int n = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
at = (int**)malloc(sizeof *at * n * 2);
int* a_private = (int*)malloc(sizeof *a_private * bins);
//arbitrary fill function
for(int i = 0; i<bins; i++) a_private[i] = i + omp_get_thread_num();
#pragma omp critical (stack_section)
at[nsums++] = a_private;
while(nsums<2*n-2) {
int *p1, *p2;
char pop = 0;
#pragma omp critical (stack_section)
if((nsums-m)>1) p1 = at[m], p2 = at[m+1], m +=2, pop = 1;
if(pop) {
for(int i = 0; i<bins; i++) p1[i] += p2[i];
#pragma omp critical (stack_section)
at[nsums++] = p1;
}
}
#pragma omp barrier
#pragma omp single
memcpy(a, at[2*n-2], sizeof **at *bins);
free(a_private);
#pragma omp single
free(at);
}
for(int i = 0; i<bins; i++) printf("%d ", a[i]); puts("");
for(int i = 0; i<bins; i++) printf("%d ", (nthreads-1)*nthreads/2 +nthreads*i); puts("");
}
int main(void) {
foo6();
}
我觉得窗台一个更好的方法可以使用任务不把没有在繁忙的循环中使用线程被发现。
I sill feel a better method may be found using tasks which does not put the threads not being used in a busy loop.
推荐答案
其实,这是很简单的实现一个干净地使用递归的分而治之的方法任务。这几乎是教科书 code。
Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.
void operation(int* p1, int* p2, size_t bins)
{
for (int i = 0; i < bins; i++)
p1[i] += p2[i];
}
void reduce(int** arrs, size_t bins, int begin, int end)
{
assert(begin < end);
if (end - begin == 1) {
return;
}
int pivot = (begin + end) / 2;
/* Moving the termination condition here will avoid very short tasks,
* but make the code less nice. */
#pragma omp task
reduce(arrs, bins, begin, pivot);
#pragma omp task
reduce(arrs, bins, pivot, end);
#pragma omp taskwait
/* now begin and pivot contain the partial sums. */
operation(arrs[begin], arrs[pivot], bins);
}
/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);
据我所知,没有任何不必要的同步,而且对关键部分没有怪异的投票。它还与数据大小比你列数不同的作品自然。我觉得很干净,很容易理解。因此,我确实认为这是的更好的比你的两个解决方案。
As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.
不过,让我们来看看它如何执行在实践中*。为此,我们可以使用得分-P 和的 VAMPIR :
But let's look at how it performs in practice*. For that we can use Score-p and Vampir:
* <子> 箱= 10000
因此实际减少的花费一点点时间。执行的24芯Haswell的系统上的w / o涡轮增压。 GCC 4.8.4, -O3
。我加了一些缓冲各地的实际执行隐藏初始化/后处理
*bins=10000
so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3
. I added some buffer around the actual execution to hide initialization/post-processing
图像揭示了什么是在上一水平时间轴的应用程序内的任何线程发生。树实现从上到下:
The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:
-
OMP为
循环 -
OMP关键
样的任务的。 -
OMP任务
omp for
loopomp critical
kind of tasking.omp task
这很好地说明了具体的实现实际上是如何执行。目前看来,for循环实际上是最快的,尽管不必要的同步。但仍有一些在这个性能分析的缺陷。例如,我没有脚的线程。在实践中NUMA(非一致内存访问)是相当重要的:是否核心确实有它自己的插座它自己的缓存/内存这些数据?这就是任务溶液变得非确定性。重复之间的非常显著方差不会在简单的比较被认为
This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.
如果还原操作变得在运行时变量,则任务解决方案将变得比你的for循环同步更好。
If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.
的关键
解决方案有一些有趣的方面,被动线程不连续等,所以他们会更容易消耗CPU资源。这可不好的性能例如在Turbo模式的情况下。
The critical
solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.
记住任务
的解决方案,避免产卵任务立即返回有更多的优化潜力。如何将这些解决方案还执行高度取决于具体OpenMP运行。英特尔的运行时好像做任务更糟糕。
Remember that the task
solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.
我的建议是:
- 实施与优化算法最维护的解决方案
复杂 - 测量其中code的部分实际上很重要的运行时间
- 根据实际测量的是什么瓶颈分析。在我的经验,更多的是NUMA和调度,而不是一些不必要的障碍。
- 根据您的实际测量结果进行微优化
下面是线性 proccess_data_v1
问题一时间轴>。
于是我想到了OpenMP的减少。最棘手的部分似乎越来越从在
内循环数组中的数据没有副本。我做初始化 NULL
工人阵列和简单移动指针第一次:
So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at
array inside the loop without a copy. I do initialize the worker array with NULL
and simply move the pointer the first time:
void meta_op(int** pp1, int* p2, size_t bins)
{
if (*pp1 == NULL) {
*pp1 = p2;
return;
}
operation(*pp1, p2, bins);
}
// ...
// declare before parallel region as global
int* awork = NULL;
#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)
#pragma omp for reduction(merge : awork)
for (int t = 0; t < n; t++) {
meta_op(&awork, at[t], bins);
}
出人意料的是,这个看起来并不太好:
Surprisingly, this doesn't look too good:
<子>顶部 ICC 16.0.2
,下方是 GCC 5.3.0
,均与 -O3
。
top is icc 16.0.2
, bottom is gcc 5.3.0
, both with -O3
.
似乎都实现序列化的减少。我试图寻找到 GCC
/ libgomp
,但它不是立即显现出来给我发生了什么。从中间code /拆卸,他们似乎被包裹在一个 GOMP_atomic_start
/ 结束时的最终合并
- 和这似乎是一个全球性的互斥量。同样 ICC
包装了调用操作
在 kmpc_critical
。我想没有太多的优化进入昂贵的定制减少操作。传统的减少可以用硬件支持的原子操作来完成。
Both seem to implement the reduction serialized. I tried to look into gcc
/ libgomp
, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start
/end
- and that seems to be a global mutex. Similarly icc
wraps the call to the operation
in a kmpc_critical
. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.
注意如何每个操作
是更快,因为输入本地缓存,但由于序列化是整体速度较慢。再次,这是不是一个完美的,由于比较高的差异,以及更早版本的截图分别用不同的 GCC
版本。但趋势是明确的,而且我也有上的缓存影响的数据。
Notice how each operation
is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc
version. But the trend is clear, and I also have data on the cache effects.
这篇关于削减并行数时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!