了解Spark中的treeReduce() [英] Understanding treeReduce() in Spark

查看:211
本文介绍了了解Spark中的treeReduce()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以在此处查看实现: > https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

You can see the implementation here: https://github.com/apache/spark/blob/ffa05c84fe75663fc33f3d954d1cb1e084ab3280/python/pyspark/rdd.py#L804

它与普通" reduce功能有何不同?
depth = 2是什么意思?

How does it different from the 'normal' reduce function?
What does it mean depth = 2?

我不希望reducer函数在分区上线性传递, 但首先减少每个可用的对,然后将像这样进行迭代,直到我只有一对,并将其减少为1,如图所示:

I don't want that the reducer function will pass linearly on the partitions, but reduce each available pairs first, and then will iterate like that until i have only one pair and reduce it to 1, as shown in the picture:

treeReduce能做到吗?

推荐答案

标准reduce正在获取该函数的包装版本并将其用于在本地减少一个司机.如果分区的数量很大和/或您使用的功能昂贵,那么这会在一台计算机上造成很大的负担.

Standard reduce is taking a wrapped version of the function and using it to mapPartitions. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.

treeReduce的第一阶段与上面几乎相同,但是在此之后,部分结果被并行合并,并且仅在驱动程序上执行了最后的汇总.

The first phase of the treeReduce is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.

depth可以提前停止在某些情况下.

值得注意的是,使用treeReduce获得的不是二叉树.分区的数量在每个级别上进行了调整,并且很可能一次合并两个以上的分区.

It is worth to note that what you get with treeReduce is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.

与标准的缩减,基于树的版本相比执行reduceByKey每次迭代,这意味着要进行大量的数据混排.如果分区的数量相对较小,则使用普通reduce会便宜得多.如果您怀疑reduce的最后阶段是瓶颈tree*版本,则值得尝试.

Compared to the standard reduce, tree based version performs reduceByKey with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce. If you suspect that the final phase of the reduce is a bottleneck tree* version could be worth trying.

这篇关于了解Spark中的treeReduce()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆