Spark groupByKey替代 [英] Spark groupByKey alternative
问题描述
根据Databricks的最佳做法,应避免使用Spark groupByKey
,因为Spark groupByKey
处理的工作方式是首先在工作人员之间重排信息,然后再进行处理. 解释
According to Databricks best practices, Spark groupByKey
should be avoided as Spark groupByKey
processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation
所以,我的问题是,groupByKey
的替代方案是什么,它将以分布式和快速的方式返回以下内容?
So, my question is, what are the alternatives for groupByKey
in a way that it will return the following in a distributed and fast way?
// want this
{"key1": "1", "key1": "2", "key1": "3", "key2": "55", "key2": "66"}
// to become this
{"key1": ["1","2","3"], "key2": ["55","66"]}
在我看来,aggregateByKey
或glom
可以先在分区(map
)中执行此操作,然后将所有列表连接在一起(reduce
).
Seems to me that maybe aggregateByKey
or glom
could do it first in the partition (map
) and then join all the lists together (reduce
).
推荐答案
groupByKey
适用于我们想要每个键较小"的值收集的情况,如问题中所示.
groupByKey
is fine for the case when we want a "smallish" collection of values per key, as in the question.
groupByKey
上的请勿使用"警告适用于两种一般情况:
The "do not use" warning on groupByKey
applies for two general cases:
1)您想要汇总以下值:
1) You want to aggregate over the values:
- 不要:
rdd.groupByKey().mapValues(_.sum)
- 要做:
rdd.reduceByKey(_ + _)
- DON'T:
rdd.groupByKey().mapValues(_.sum)
- DO:
rdd.reduceByKey(_ + _)
在这种情况下,groupByKey
将浪费资源来实现一个集合,而我们想要的只是一个元素作为答案.
In this case, groupByKey
will waste resouces materializing a collection while what we want is a single element as answer.
2)您想通过低基数密钥对非常大的集合进行分组:
2) You want to group very large collections over low cardinality keys:
- 不要:
allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
- 绝对不要
- DON'T:
allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
- JUST DON'T
在这种情况下,groupByKey
可能会导致OOM错误.
In this case, groupByKey
will potentially result in an OOM error.
groupByKey
在一个执行程序中实现具有相同键的所有值的集合.如前所述,它具有内存限制,因此,根据情况,其他选项更好.
groupByKey
materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.
All the grouping functions, like groupByKey
, aggregateByKey
and reduceByKey
rely on the base: combineByKey
and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.
这篇关于Spark groupByKey替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!