groupByKey 是否比 reduceByKey 更受欢迎 [英] Is groupByKey ever preferred over reduceByKey

查看:30
本文介绍了groupByKey 是否比 reduceByKey 更受欢迎的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我需要在 RDD 中对数据进行分组时,我总是使用 reduceByKey,因为它在混洗数据之前执行 map side reduce,这通常意味着混洗的数据更少,从而获得更好的性能.即使地图端的reduce函数收集了所有的值并且实际上并没有减少数据量,我仍然使用reduceByKey,因为我假设reduceByKey的性能永远不会比 groupByKey 差.但是,我想知道这个假设是否正确,或者是否确实存在应该首选 groupByKey 的情况??

I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey, because I'm assuming that the performance of reduceByKey will never be worse than groupByKey. However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey should be preferred??

推荐答案

我相信 climbage<忽略了该问题的其他方面/a> 和 eliasah:

  • 代码可读性
  • 代码可维护性
  • 代码库大小

如果操作不会减少数据量,它必须在语义上以一种或另一种方式等效于 GroupByKey.假设我们有RDD[(Int,String)]:

If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to GroupByKey. Lets assume we haveRDD[(Int,String)]:

import scala.util.Random
Random.setSeed(1)

def randomString = Random.alphanumeric.take(Random.nextInt(10)).mkString("")

val rdd = sc.parallelize((1 to 20).map(_ => (Random.nextInt(5), randomString)))

并且我们想要连接给定键的所有字符串.使用 groupByKey 非常简单:

and we want to concatenate all strings for a given key. With groupByKey it is pretty simple:

rdd.groupByKey.mapValues(_.mkString(""))

使用 reduceByKey 的简单解决方案如下所示:

Naive solution with reduceByKey looks like this:

rdd.reduceByKey(_ + _)

它很短,可以说很容易理解,但有两个问题:

It is short and arguably easy to understand but suffers from two issues:

  • 效率极低,因为它每次都创建一个新的 String 对象*
  • 表明您执行的操作比实际成本低,尤其是当您仅分析 DAG 或调试字符串时

为了解决第一个问题,我们需要一个可变数据结构:

To deal with the first problem we need a mutable data structure:

import scala.collection.mutable.StringBuilder

rdd.combineByKey[StringBuilder](
    (s: String) => new StringBuilder(s),
    (sb: StringBuilder, s: String) => sb ++= s,
    (sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)

它仍然暗示了其他正在发生的事情并且非常冗长,尤其是在您的脚本中重复多次时.你当然可以提取匿名函数

It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions

val createStringCombiner = (s: String) => new StringBuilder(s)
val mergeStringValue = (sb: StringBuilder, s: String) => sb ++= s
val mergeStringCombiners = (sb1: StringBuilder, sb2: StringBuilder) => 
  sb1.append(sb2)

rdd.combineByKey(createStringCombiner, mergeStringValue, mergeStringCombiners)

但归根结底,它仍然意味着需要额外的努力来理解这段代码,增加了复杂性并且没有真正的附加价值.我发现特别麻烦的一件事是显式包含可变数据结构.即使 Spark 处理了几乎所有的复杂性,也意味着我们不再拥有优雅的、引用透明的代码.

but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.

我的观点是,如果您真的想尽一切办法减少数据量,请使用 reduceByKey.否则你会让你的代码更难编写、更难分析并且一无所获.

My point is if you really reduce amount of data by all means use reduceByKey. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.

注意:

此答案侧重于 Scala RDD API.当前的 Python 实现与其对应的 JVM 实现完全不同,并且包括优化,在类似 groupBy 的操作的情况下,这些优化比简单的 reduceByKey 实现具有显着优势.

This answer is focused on Scala RDD API. Current Python implementation is quite different from its JVM counterpart and includes optimizations which provide significant advantage over naive reduceByKey implementation in case of groupBy-like operations.

对于Dataset API,请参阅DataFrame/Dataset groupBy behavior/optimization.

For Dataset API see DataFrame / Dataset groupBy behaviour/optimization.

* 有关令人信服的示例,请参见Scala 与 Python 的 Spark 性能

* See Spark performance for Scala vs Python for a convincing example

这篇关于groupByKey 是否比 reduceByKey 更受欢迎的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆