groupByKey 是否比 reduceByKey 更受欢迎 [英] Is groupByKey ever preferred over reduceByKey
问题描述
当我需要在 RDD 中对数据进行分组时,我总是使用 reduceByKey
,因为它在混洗数据之前执行 map side reduce,这通常意味着混洗的数据更少,从而获得更好的性能.即使地图端的reduce函数收集了所有的值并且实际上并没有减少数据量,我仍然使用reduceByKey
,因为我假设reduceByKey
的性能永远不会比 groupByKey
差.但是,我想知道这个假设是否正确,或者是否确实存在应该首选 groupByKey
的情况??
I always use reduceByKey
when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey
, because I'm assuming that the performance of reduceByKey
will never be worse than groupByKey
. However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey
should be preferred??
推荐答案
我相信 climbage<忽略了该问题的其他方面/a> 和 eliasah:
- 代码可读性
- 代码可维护性
- 代码库大小
如果操作不会减少数据量,它必须在语义上以一种或另一种方式等效于 GroupByKey
.假设我们有RDD[(Int,String)]
:
If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to GroupByKey
. Lets assume we haveRDD[(Int,String)]
:
import scala.util.Random
Random.setSeed(1)
def randomString = Random.alphanumeric.take(Random.nextInt(10)).mkString("")
val rdd = sc.parallelize((1 to 20).map(_ => (Random.nextInt(5), randomString)))
并且我们想要连接给定键的所有字符串.使用 groupByKey
非常简单:
and we want to concatenate all strings for a given key. With groupByKey
it is pretty simple:
rdd.groupByKey.mapValues(_.mkString(""))
使用 reduceByKey
的简单解决方案如下所示:
Naive solution with reduceByKey
looks like this:
rdd.reduceByKey(_ + _)
它很短,可以说很容易理解,但有两个问题:
It is short and arguably easy to understand but suffers from two issues:
- 效率极低,因为它每次都创建一个新的
String
对象* - 表明您执行的操作比实际成本低,尤其是当您仅分析 DAG 或调试字符串时
为了解决第一个问题,我们需要一个可变数据结构:
To deal with the first problem we need a mutable data structure:
import scala.collection.mutable.StringBuilder
rdd.combineByKey[StringBuilder](
(s: String) => new StringBuilder(s),
(sb: StringBuilder, s: String) => sb ++= s,
(sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)
它仍然暗示了其他正在发生的事情并且非常冗长,尤其是在您的脚本中重复多次时.你当然可以提取匿名函数
It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions
val createStringCombiner = (s: String) => new StringBuilder(s)
val mergeStringValue = (sb: StringBuilder, s: String) => sb ++= s
val mergeStringCombiners = (sb1: StringBuilder, sb2: StringBuilder) =>
sb1.append(sb2)
rdd.combineByKey(createStringCombiner, mergeStringValue, mergeStringCombiners)
但归根结底,它仍然意味着需要额外的努力来理解这段代码,增加了复杂性并且没有真正的附加价值.我发现特别麻烦的一件事是显式包含可变数据结构.即使 Spark 处理了几乎所有的复杂性,也意味着我们不再拥有优雅的、引用透明的代码.
but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.
我的观点是,如果您真的想尽一切办法减少数据量,请使用 reduceByKey
.否则你会让你的代码更难编写、更难分析并且一无所获.
My point is if you really reduce amount of data by all means use reduceByKey
. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.
注意:
此答案侧重于 Scala RDD
API.当前的 Python 实现与其对应的 JVM 实现完全不同,并且包括优化,在类似 groupBy
的操作的情况下,这些优化比简单的 reduceByKey
实现具有显着优势.
This answer is focused on Scala RDD
API. Current Python implementation is quite different from its JVM counterpart and includes optimizations which provide significant advantage over naive reduceByKey
implementation in case of groupBy
-like operations.
对于Dataset
API,请参阅DataFrame/Dataset groupBy behavior/optimization.
For Dataset
API see DataFrame / Dataset groupBy behaviour/optimization.
* 有关令人信服的示例,请参见Scala 与 Python 的 Spark 性能
* See Spark performance for Scala vs Python for a convincing example
这篇关于groupByKey 是否比 reduceByKey 更受欢迎的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!