Java 8中新引入的Arrays.parallelPrefix(...)如何工作? [英] How does newly introduced Arrays.parallelPrefix(...) in Java 8 work?

查看:879
本文介绍了Java 8中新引入的Arrays.parallelPrefix(...)如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了此重载方法以累积方式对输入数组的每个元素执行操作.例如来自文档:

This overloaded method performs operation on each element of the input array in a cumulative fashion. For e.g. from docs:

并行累积给定数组中每个元素的位置, 使用提供的功能.例如,如果数组最初成立 [2,1,0,3]并执行加法运算,然后返回 数组保存[2、3、3、6].并行前缀计算通常会更多 对于大型数组,它比顺序循环更有效.

Cumulates, in parallel, each element of the given array in place, using the supplied function. For example if the array initially holds [2, 1, 0, 3] and the operation performs addition, then upon return the array holds [2, 3, 3, 6]. Parallel prefix computation is usually more efficient than sequential loops for large arrays.

那么,当一个术语的运算依赖于前一个术语的运算结果,等等时,Java如何在parallel中完成此任务?

So, how does Java achieve this task in parallel when the operation on a term is dependent on the operation result on the previous term, and so on?

我尝试自己检查代码,他们确实使用了ForkJoinTasks,但是要如何合并结果以获得最终数组并不是那么简单.

I tried going through the code myself and they do use ForkJoinTasks, but it's not so straightforward how they would merge result to get the final array.

推荐答案

Eran的答案中所述,此操作利用了函数的关联性.

As explained in Eran’s answer, this operation utilizes the associativity property of the function.

然后,有两个基本步骤.第一个是实际的前缀操作(就评估而言需要前面的一个或多个元素),它并行应用于数组的各个部分.每个部分运算的结果(与最后一个元素相同)是剩余数组的偏移量.

Then, there are two fundamental steps. The first one, is an actual prefix operation (in the sense of requiring the previous element(s) for the evaluation), applied to parts of the array in parallel. The result of each partial operation (identical to the resulting last element), is the offset for the remaining array.

例如对于以下数组,使用sum作为前缀运算,并使用四个处理器

E.g. for the following array, using sum as prefix operation, and four processors

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  

我们得到

  4 → 13 → 18 → 19    0 →  5 →  6 → 12    6 → 10 → 16 → 21    1 →  7 → 16 → 19  
                 ↓                   ↓                   ↓                   ↓  
                19                  12                  21                  19  

现在,我们利用关联性将前缀运算首先应用于偏移量

now, we utilize the associativity to apply the prefix operation to the offsets first

                 ↓                   ↓                   ↓                   ↓  
                19         →        31         →        52         →        71  

然后,我们进入第二阶段,将这些偏移量应用于下一个块的每个元素,这是一个完全可并行化的操作,因为不再依赖于先前的元素

Then, we get to the second phase, which is to apply these offsets to each element of the next chunk, which is a perfectly parallelizable operation, as there is no dependency to the previous element(s) anymore

                     19   19   19   19   31   31   31   31   52   52   52   52  
                      ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

当我们对八个线程使用相同的示例时,

When we use the same example for eight threads,

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  

  4 → 13    5 →  6    0 →  5    1 →  7    6 → 10    6 → 11    1 →  7    9 → 12  
       ↓         ↓         ↓         ↓         ↓         ↓         ↓         ↓  
      13         6         5         7        10        11         7        12  

       ↓         ↓         ↓         ↓         ↓         ↓         ↓         ↓  
      13    →   19    →   24    →   31    →   41    →   52    →   59    →   71  

           13   13   19   19   24   24   31   31   41   41   52   52   59   59  
            ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

我们看到,即使我们在两个步骤中使用使工作块保持相同的简单策略,换句话说,在第二阶段接受一个空闲的工作线程,也会有明显的好处.第一阶段需要大约aboutn,第二阶段需要⅛n,总共需要¼n的运算(其中 n 是整个数组的顺序前缀评估的成本).当然,只有在最佳情况下才能做到.

we see that there will be a clear benefit, even when we use the simpler strategy of keeping the work chunks the same for both steps, in other words, accept one idle worker thread in the second phase. We will need about ⅛n for the first phase and ⅛n for the second, needing ¼n total for the operation (where n is the cost of the sequential prefix evaluation of the entire array). Of course, only roughly and in the best case.

相反,当我们只有两个处理器时

In contrast, when we have only two processors

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  


  4 → 13 → 18 → 19 → 19 → 24 → 25 → 31    6 → 10 → 16 → 21 → 22 → 28 → 37 → 40  
                                     ↓                                       ↓  
                                    31                                      40  

                                     ↓                                       ↓  
                                    31                   →                  71  

                                         31   31   31   31   31   31   31   31  
                                          ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

只有重新分配第二阶段的工作,我们才能受益.如前所述,这是有可能的,因为第二阶段的工作不再存在元素之间的依赖关系.因此,我们可以任意拆分此操作,尽管它会使实现复杂化并可能带来额外的开销.

we can only gain a benefit, when we re-assign the work of the second phase. This is possible, as said, because the second phase’s work has no dependencies between the elements anymore. So we can split this operation arbitrarily, though it complicates the implementation and may introduce an additional overhead.

当我们在两个处理器之间分配第二阶段的工作时,第一阶段大约需要½n,第二阶段需要¼n,总共产生¾n,如果阵列足够大,这仍然是一个好处.

When we split the work of the second phase between both processors, the first phase needs about ½n and the second will need ¼n, yielding ¾n total, which still is a benefit, if the array is large enough.

作为附加说明,您可能会注意到在准备第二阶段时计算出的偏移量与块中最后一个元素的结果相同.因此,只需分配该值,就可以将所需的操作数量每块减少一个.但是典型的情况是只有几个带有大量元素的块(随处理器数量扩展),因此每个块保存一个操作是无关紧要的.

As an additional note, you might notice that the offsets calculated in preparation of the second phase are identical to the result for the last element of the chunk. So, you could reduce the required number of operations by one per chunk by simply assigning that value. But the typical scenario is to have only a few chunks (scaling with the number of processors) with a large number of elements, so saving one operation per chunk is not relevant.

这篇关于Java 8中新引入的Arrays.parallelPrefix(...)如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆