Spark中的有效数据分组 [英] Efficient grouping of data in Spark

查看:60
本文介绍了Spark中的有效数据分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Spark(Scala)中执行简单的数据分组.特别是,这是我的初始数据:

I need to perform simple grouping of data in Spark (Scala). In particular, this is my initial data:

1, a, X
1, b, Y
2, a, Y
1, a, Y

val seqs = Seq((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y"))

我需要按第一个键将其分组,如下所示:

I need to group it by the first key as follows:

1, (a, X), (b, Y), (a, Y)
2, (a, Y)

我最初的想法是使用 DataFrame groupBy ,但是我读到此操作非常昂贵,并且需要对所有数据进行完全改组.

My initial idia was to use DataFrame and groupBy, but I read that this operation is very expensive and requires a complete reshuffle of all data.

那么,进行分组的最便宜的选择是什么?一个具体的例子将不胜感激.

So, what is the less expensive option to perform grouping? A concrete example would be appreciated.

推荐答案

您可能会这样做:

  val rdd = sc.parallelize(List((1, "a", "X"),(1, "b", "Y"),(2, "a", "Y"),(1, "a", "Y")))
  val mapping = rdd.map(x=>(x._1,List((x._2,x._3))))
  val result = mapping.reduceByKey((x,y) => (x ++ y)) 

这使用了reduceByKey,但是所有的reduce过程都存在问题,每个组必须以1个键值对结束.因此,在这种情况下,您需要将每个值显式转换为列表,以便化简过程可以合并它们.

This uses the reduceByKey, but the problem with all reduce process, you must end up with 1 key value pair per group. So in this case, you need to explicitly convert each of your values into Lists, so the reduce process can then merge them.

您还可以考虑查看 combineByKey ,它使用了内部减少过程

You may also consider looking at combineByKey, which uses an internal reduce process

====== EDIT ======

======EDIT======

正如zero323所指出的,在此处减少将不会提高效率,相反-该过程将失去对groupByKey的优化.

As zero323 pointed out, reduce here will not increase the efficiency and on the contrary - the process will lose the optimization of groupByKey.

这篇关于Spark中的有效数据分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆