Spark groupByKey 替代方案 [英] Spark groupByKey alternative
问题描述
根据 Databricks 的最佳实践,应该避免 Spark groupByKey
,因为 Spark groupByKey
处理的工作方式是,信息首先在工作人员之间进行混洗,然后处理发生.说明
According to Databricks best practices, Spark groupByKey
should be avoided as Spark groupByKey
processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation
那么,我的问题是,groupByKey
的替代方法是什么,以分布式和快速的方式返回以下内容?
So, my question is, what are the alternatives for groupByKey
in a way that it will return the following in a distributed and fast way?
// want this
{"key1": "1", "key1": "2", "key1": "3", "key2": "55", "key2": "66"}
// to become this
{"key1": ["1","2","3"], "key2": ["55","66"]}
在我看来,aggregateByKey
或 glom
可以先在分区 (map
) 中完成,然后将所有列表连接在一起(减少
).
Seems to me that maybe aggregateByKey
or glom
could do it first in the partition (map
) and then join all the lists together (reduce
).
推荐答案
groupByKey
适用于我们想要每个键的小"值集合的情况,如问题所示.
groupByKey
is fine for the case when we want a "smallish" collection of values per key, as in the question.
groupByKey
>
TL;DR
groupByKey
上的请勿使用"警告适用于两种一般情况:
The "do not use" warning on groupByKey
applies for two general cases:
1) 您想对值进行聚合:
1) You want to aggregate over the values:
- 不要:
rdd.groupByKey().mapValues(_.sum)
- 做:
rdd.reduceByKey(_ + _)
在这种情况下,groupByKey
将浪费资源实现集合,而我们想要的是单个元素作为答案.
In this case, groupByKey
will waste resouces materializing a collection while what we want is a single element as answer.
2) 您希望通过低基数键对非常大的集合进行分组:
2) You want to group very large collections over low cardinality keys:
- 不要:
allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
- 千万不要
在这种情况下,groupByKey
可能会导致 OOM 错误.
In this case, groupByKey
will potentially result in an OOM error.
groupByKey
在一个执行器中具体化一个集合,其中包含相同键的所有值.如前所述,它有内存限制,因此,根据情况,其他选项会更好.
groupByKey
materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.
所有的分组函数,比如 groupByKey
、aggregateByKey
和 reduceByKey
依赖于基础:combineByKey
因此,对于问题中的用例,没有其他替代方案会更好,它们都依赖于相同的通用流程.
All the grouping functions, like groupByKey
, aggregateByKey
and reduceByKey
rely on the base: combineByKey
and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.
这篇关于Spark groupByKey 替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!