Understadning treeReduce()在星火 [英] Understadning treeReduce() in Spark

查看:222
本文介绍了Understadning treeReduce()在星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以看到这里的实现:
<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804\" rel=\"nofollow\">https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

它是如何从正常不同的减少功能?
结果
这是什么意思深度= 2

How does it different from the 'normal' reduce function?
What does it mean depth = 2?

我不希望这样的减速功能将线性传递的分区,
但减少每个可用对第一,然后将迭代像直到我只有一对,并将其降低到1,如图中所示:

I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:

在这里输入的形象描述

确实 treeReduce 做到这一点?

推荐答案

标准减少走的是功能的封装版本,并使用它来<一个href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L799\"相对=nofollow> mapPartitions 。后的结果进行收集和<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L801\"相对=nofollow>上的驱动程序在本地减少。如果分区的数量较大和/或使用的功能是价格昂贵放置一个显著负载一台机器上。

Standard reduce is taking a wrapped version of the function and using it to mapPartitions. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.

treeReduce 是pretty的第一阶段多与上述相同但在这之后的部分结果被合并在平行和仅最后的聚合的进行驱动程序。

The first phase of the treeReduce is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.

深度是<一个href=\"https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce\"相对=nofollow>的建议树的深度的并自树的节点的深度定义为根,节点之间的边数应该你给你更多或更少预期的模式,虽然它看起来像一个分布式聚合<一href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L941\"相对=nofollow>可以在某些情况下月初停止。

这是值得大家注意的是你所得到的与 treeReduce 不是一个二叉树。调整每个级别上的分区数目,很可能不止一个两个分区将立即被合并。

It is worth to note that what you get with treeReduce is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.

与标准的降低,基于树的版本<一个href=\"https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L953\"相对=nofollow>执行 reduceByKey 在每次迭代,这意味着大量的数据移动。如果分区的数量相对较少,这会便宜很多使用普通的减少。如果您怀疑的最后阶段减少是一个瓶颈树* 版本可能是值得一试。

Compared to the standard reduce, tree based version performs reduceByKey with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce. If you suspect that the final phase of the reduce is a bottleneck tree* version could be worth trying.

这篇关于Understadning treeReduce()在星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆