了解 Spark 中的 treeReduce() [英] Understanding treeReduce() in Spark

查看:23
本文介绍了了解 Spark 中的 treeReduce()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以在此处查看实现:

treeReduce 能做到吗?

解决方案

Standard reduce 正在采用该函数的包装版本并将其用于 mapPartitions.在收集结果并本地减少之后一个司机.如果分区数量很大和/或您使用的功能很昂贵,则会给单台机器带来很大的负载.

treeReduce 的第一阶段与上面几乎相同,但之后部分结果并行合并,仅在驱动程序上执行最终聚合.

depth树的建议深度 并且由于树中节点的深度被定义为根和节点之间的边数,所以你应该给你更多或虽然它看起来像分布式聚合,但不是预期的模式 可以在某些情况下提前停止.

值得注意的是,您使用 treeReduce 得到的不是二叉树.分区的数量在每个级别都进行了调整,很可能一次合并两个以上的分区.

与标准的reduce相比,基于树的版本performsreduceByKey 每次迭代,这意味着大量的数据混洗.如果分区的数量相对较少,使用普通的reduce 会便宜得多.如果您怀疑 reduce 的最后阶段是瓶颈,tree* 版本可能值得尝试.

You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

How does it different from the 'normal' reduce function?
What does it mean depth = 2?

I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:

Does treeReduce achieve that?

解决方案

Standard reduce is taking a wrapped version of the function and using it to mapPartitions. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.

The first phase of the treeReduce is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.

depth is suggested depth of the tree and since depth of the node in tree is defined as number of edges between the root and the node it should you give you more or less an expected pattern although it looks like a distributed aggregation can be stopped early in some cases.

It is worth to note that what you get with treeReduce is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.

Compared to the standard reduce, tree based version performs reduceByKey with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce. If you suspect that the final phase of the reduce is a bottleneck tree* version could be worth trying.

这篇关于了解 Spark 中的 treeReduce()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆