是有史以来groupByKey美元,reduceByKey p $ pferred [英] Is groupByKey ever preferred over reduceByKey

查看:118
本文介绍了是有史以来groupByKey美元,reduceByKey p $ pferred的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我总是用 reduceByKey ,当我需要在RDDS组数据,因为它进行了地图边洗牌数据,这往往意味着更少的数据被抛去之前减少因此,我得到更好的性能。即使在地图一边减少功能收集所有值并不会真正减少数据量,我还是用 reduceByKey ,因为我假设的性能 reduceByKey 决不会比 groupByKey 更糟。不过,我想知道如果这个假设是正确的,或者如果确有情况,其中 groupByKey 应该是preferred ??

I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey, because I'm assuming that the performance of reduceByKey will never be worse than groupByKey. However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey should be preferred??

推荐答案

我相信,有以 climbage <忽略了这个问题的其他方面/ A>和 eliasah


  • code可读性

  • code可维护性

  • codeBase的大小

如果操作不减少数据量,它必须拉上语义上等同于 GroupByKey 。让我们假设我们有 RDD [(智力,字符串)]

If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to GroupByKey. Lets assume we haveRDD[(Int,String)]:

import scala.util.Random
Random.setSeed(1)

def randomString = Random.alphanumeric.take(Random.nextInt(10)).mkString("")

val rdd = sc.parallelize((1 to 20).map(_ => (Random.nextInt(5), randomString)))

和我们想连接所有字符串对于给定的关键。随着 groupByKey 是pretty简单:

and we want to concatenate all strings for a given key. With groupByKey it is pretty simple:

rdd.groupByKey.mapValues(_.mkString(""))

reduceByKey 天真的解决方案是这样的:

Naive solution with reduceByKey looks like this:

rdd.reduceByKey(_ + _)

这是短,可以说是很容易理解,但是从两个问题缺陷:

It is short and arguably easy to understand but suffers from two issues:


  • 是极其低效的,因为它创建了一个新的字符串对象每次*

  • 建议您执行操作比它在现实中是比较便宜,特别是如果你只分析DAG或调试字符串

  • is extremely inefficient since it creates a new String object every time*
  • suggests that operation you perform is less expensive than it is in reality, especially if you analyze only DAG or debug string

要处理,我们需要一个可变的数据结构的第一个问题:

To deal with the first problem we need a mutable data structure:

import scala.collection.mutable.StringBuilder

rdd.combineByKey[StringBuilder](
    (s: String) => new StringBuilder(s),
    (sb: StringBuilder, s: String) => sb ++= s,
    (sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)

这仍然意味着别的东西是怎么回事,是相当冗长,特别是如果在你的脚本重复多次。当然,你可以提取匿名函数

It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions

val createStringCombiner = (s: String) => new StringBuilder(s)
val mergeStringValue = (sb: StringBuilder, s: String) => sb ++= s
val mergeStringCombiners = (sb1: StringBuilder, sb2: StringBuilder) => 
  sb1.append(sb2)

rdd.combineByKey(createStringCombiner, mergeStringValue, mergeStringCombiners)

但在这一天结束,这仍然意味着更多的精力去了解这个code,增加了复杂性并没有真正的附加值。有一件事我觉得特别令人不安的是可变的数据结构明确纳入。即使星火处理几乎所有的复杂性就意味着我们不再有优雅,引用透明code。

but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.

我的观点是,如果你真的减少数据量,通过各种手段使用 reduceByKey 。否则,你让你的code更难写,难以分析和回报一无所获。

My point is if you really reduce amount of data by all means use reduceByKey. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.

*请参阅斯卡拉星火性能VS Python的一个有说服力的例子。

* See Spark performance for Scala vs Python for a convincing example

这篇关于是有史以来groupByKey美元,reduceByKey p $ pferred的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆