Apache Spark转换:groupByKey,reduceByKey,aggregateByKey [英] Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey
本文介绍了Apache Spark转换:groupByKey,reduceByKey,aggregateByKey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
这三个Apache Spark转换很少让人感到困惑.我有什么方法可以确定何时使用哪个,何时避免使用?
These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
推荐答案
I think official guide explains it well enough.
我将突出显示差异(您的RDD类型为(K, V)
):
I will highlight differences (you have RDD of type (K, V)
):
- 如果您需要保留值,请使用
groupByKey
- 如果您不需要保留值,但是需要获取有关每个组的一些汇总信息(原始RDD的项,具有相同的
K
),则有两种选择:reduceByKey
或aggregateByKey
(reduceByKey
是一种特殊的aggregateByKey
)- 2.1,如果您可以提供提供
(V, V)
作为输入并返回V
的操作,则该组的所有值都可以减少为相同类型,然后使用reduceByKey
.结果,您将具有相同(K, V)
类型的RDD. - 2.2如果您不能提供此聚合操作,请使用
aggregateByKey
.当您将值减小为另一种类型时,就会发生这种情况.因此,您将得到(K, V2)
.
- 2.1,如果您可以提供提供
- if you need to keep the values, then use
groupByKey
- if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same
K
), you have two choices:reduceByKey
oraggregateByKey
(reduceByKey
is kind of particularaggregateByKey
)- 2.1 if you can provide an operation which take as an input
(V, V)
and returnsV
, so that all the values of the group can be reduced to the one single value of the same type, then usereduceByKey
. As a result you will have RDD of the same(K, V)
type. - 2.2 if you can not provide this aggregation operation, then use
aggregateByKey
. It happens when you reduce values to another type. So you will have(K, V2)
as a result.
- 2.1 if you can provide an operation which take as an input
这篇关于Apache Spark转换:groupByKey,reduceByKey,aggregateByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文