Understadning treeReduce()在星火 [英] Understadning treeReduce() in Spark
问题描述
您可以看到这里的实现:
<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804\" rel=\"nofollow\">https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804
You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804
它是如何从正常不同的减少
功能?
结果
这是什么意思深度= 2
?
How does it different from the 'normal' reduce
function?
What does it mean depth = 2
?
我不希望这样的减速功能将线性传递的分区,
但减少每个可用对第一,然后将迭代像直到我只有一对,并将其降低到1,如图中所示:
I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:
确实 treeReduce
做到这一点?
推荐答案
标准减少
走的是功能的封装版本,并使用它来<一个href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L799\"相对=nofollow> mapPartitions
。后的结果进行收集和<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L801\"相对=nofollow>上的驱动程序在本地减少。如果分区的数量较大和/或使用的功能是价格昂贵放置一个显著负载一台机器上。
Standard reduce
is taking a wrapped version of the function and using it to mapPartitions
. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.
treeReduce
是pretty的第一阶段多与上述相同但在这之后的部分结果被合并在平行和仅最后的聚合的进行驱动程序。
The first phase of the treeReduce
is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.
深度
是<一个href=\"https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce\"相对=nofollow>的建议树的深度的并自树的节点的深度定义为根,节点之间的边数应该你给你更多或更少预期的模式,虽然它看起来像一个分布式聚合<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L941\"相对=nofollow>可以在某些情况下月初停止。
这是值得大家注意的是你所得到的与 treeReduce
不是一个二叉树。调整每个级别上的分区数目,很可能不止一个两个分区将立即被合并。
It is worth to note that what you get with treeReduce
is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.
与标准的降低,基于树的版本<一个href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L953\"相对=nofollow>执行 reduceByKey
在每次迭代,这意味着大量的数据移动。如果分区的数量相对较少,这会便宜很多使用普通的减少
。如果您怀疑的最后阶段减少
是一个瓶颈树*
版本可能是值得一试。
Compared to the standard reduce, tree based version performs reduceByKey
with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce
. If you suspect that the final phase of the reduce
is a bottleneck tree*
version could be worth trying.
这篇关于Understadning treeReduce()在星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!