了解Spark中的treeReduce() [英] Understanding treeReduce() in Spark
问题描述
您可以在此处查看实现: > https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804
You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804
它与普通" reduce
功能有何不同?
depth = 2
是什么意思?
How does it different from the 'normal' reduce
function?
What does it mean depth = 2
?
我不希望reducer函数在分区上线性传递, 但首先减少每个可用的对,然后将像这样进行迭代,直到我只有一对,并将其减少为1,如图所示:
I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:
treeReduce
能做到吗?
推荐答案
标准reduce
正在获取该函数的包装版本并将其用于在本地减少一个司机.如果分区的数量很大和/或您使用的功能昂贵,那么这会在一台计算机上造成很大的负担.
Standard reduce
is taking a wrapped version of the function and using it to mapPartitions
. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.
treeReduce
的第一阶段与上面几乎相同,但是在此之后,部分结果被并行合并,并且仅在驱动程序上执行了最后的汇总.
The first phase of the treeReduce
is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.
depth
是可以提前停止在某些情况下.
值得注意的是,使用treeReduce
获得的不是二叉树.分区的数量在每个级别上进行了调整,并且很可能一次合并两个以上的分区.
It is worth to note that what you get with treeReduce
is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.
与标准的缩减,基于树的版本相比执行reduceByKey
每次迭代,这意味着要进行大量的数据混排.如果分区的数量相对较小,则使用普通reduce
会便宜得多.如果您怀疑reduce
的最后阶段是瓶颈tree*
版本,则值得尝试.
Compared to the standard reduce, tree based version performs reduceByKey
with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce
. If you suspect that the final phase of the reduce
is a bottleneck tree*
version could be worth trying.
这篇关于了解Spark中的treeReduce()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!