Apache Spark转换:groupByKey,reduceByKey,aggregateByKey [英] Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

查看:70
本文介绍了Apache Spark转换:groupByKey,reduceByKey,aggregateByKey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这三个Apache Spark转换很少让人感到困惑.我有什么方法可以确定何时使用哪个,何时避免使用?

These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?

推荐答案

我认为

I think official guide explains it well enough.

我将突出显示差异(您的RDD类型为(K, V)):

I will highlight differences (you have RDD of type (K, V)):

  1. 如果您需要保留值,请使用groupByKey
  2. 如果您不需要保留值,但是需要获取有关每个组的一些汇总信息(原始RDD的项,具有相同的K),则有两种选择:reduceByKeyaggregateByKey(reduceByKey是一种特殊的aggregateByKey)
    • 2.1,如果您可以提供提供(V, V)作为输入并返回V的操作,则该组的所有值都可以减少为相同类型,然后使用reduceByKey.结果,您将具有相同(K, V)类型的RDD.
    • 2.2如果您不能提供此聚合操作,请使用aggregateByKey.当您将值减小为另一种类型时,就会发生这种情况.因此,您将得到(K, V2).
  1. if you need to keep the values, then use groupByKey
  2. if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
    • 2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
    • 2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.

这篇关于Apache Spark转换:groupByKey,reduceByKey,aggregateByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆