在Spark中将groupByKey替换为reduceByKey [英] Replace groupByKey with reduceByKey in Spark

查看:138
本文介绍了在Spark中将groupByKey替换为reduceByKey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我经常需要在代码中使用 groupByKey ,但是我知道这是一个非常繁琐的操作.由于我一直在努力提高性能,所以我想知道我删除所有 groupByKey 调用的方法是否有效.

Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient.

我曾经被用来从另一个RDD创建一个RDD并创建一对类型(Int,Int)

I was used to create an RDD from another RDD and creating pair of type (Int, Int)

rdd1 = [(1,2),(1,3),(2,3),(2,4),(3,5)]

并且由于我需要获得类似这样的东西:

and since I needed to obtain something like this:

[(1,[2,3]),(2,[3,4]),(3,[5])]

我使用的是 out = rdd1.groupByKey ,但是由于这种方法对于庞大的数据集可能存在很大的问题,因此我想使用此解决方案:

what I used was out = rdd1.groupByKey but since this approach might be very problematic with huge datasets I thought to use this solution:

不是创建类型对(Int,Int)的RDD rdd1 ,而是创建类型对(Int,List [Int])的RDD,所以我的 rdd1 就是这样

Instead of creating my RDD rdd1 of pairs of type (Int, Int) what I do is creating it of pairs of type (Int, List[Int]) so my rdd1 was something like this

rdd1 = [(1,[2]),(1,[3]),(2,[3]),(2,[4]),(3,[5])]

但是这次要达到相同的结果,我使用了 reduceByKey(_ ::: _)通过键连接所有值,这应该更快.您认为使用这种方法可能会提高性能吗?我担心这种类型(Int,List [Int])不会愚蠢地创建一对值仅包含1个元素的列表吗?

but this time to reach the same result I used reduceByKey(_ ::: _) joining all the values by key, which is supposed to be faster. Do you think using this approach might improve performance? I'm afraid of this type (Int, List[Int]) isn't stupid creating a pair which value is a list containing only 1 element?

您认为使用其他方法是否有更快的方法来达到相同的结果?谢谢.

Do you think is there a faster way to reach the same result, using some other method? Thank you.

推荐答案

如果您要获得 end 结果,我认为您不应该使用 reduceByKey >

I don't think you should use reduceByKey if your end result is to be

[(1, [2, 3]), (2 , [3, 4]), (3, [5])]

为什么?因为这就是 groupByKey 的用途,所以它可能是最好的.

Why? Because this is what groupByKey is made for, so it probably does it best.

groupByKey 的问题在于,您通常不需要具有相同键的所有值的列表(或数组),但可以从此列表中获得一些东西.如果您确实不需要此列表,则可以使用 reduceByKey 在与随机播放相同的步骤中进行缩减.

The problem with groupByKey is that you usually don't need a list (or an array) of all values with the same key, but something you can obtain from this list. If you don't really need the list, you probably can do the reduction in the same step as the shuffle, using reduceByKey.

reduceByKey 的两个优点:

  • 它可以在改组之前开始减少(减少同一执行程序上的值,以避免不必要的网络有效负载)
  • 它永远不会将具有相同键的整个值数组加载到内存中.这对于庞大的数据集很重要,因为数据集的大小可能为数GB.

在您的情况下,正如您所介绍的那样,第一点不是很重要(因为没有真正的数据缩减,只是串联),第二点不适用,因为您需要整个列表.

In your case, as you presented it, the first point is not very important (since there is no real reduction of the data, just concatenation), the second point does not apply since you want the whole list.

但是,我强烈建议您考虑是否确实需要整个列表,或者这只是计算中的一个步骤,尤其是在处理大型数据集时.

However, I strongly suggest that you think about if you really need the whole list, or if this is just a step in your computation, especially if you're working with large datasets.

这篇关于在Spark中将groupByKey替换为reduceByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆