Spark groupByKey 替代方案 [英] Spark groupByKey alternative

查看:56
本文介绍了Spark groupByKey 替代方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Databricks 的最佳实践,应该避免 Spark groupByKey,因为 Spark groupByKey 处理的工作方式是,信息首先在工作人员之间进行混洗,然后处理发生.说明

According to Databricks best practices, Spark groupByKey should be avoided as Spark groupByKey processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation

那么,我的问题是,groupByKey 的替代方法是什么,以分布式和快速的方式返回以下内容?

So, my question is, what are the alternatives for groupByKey in a way that it will return the following in a distributed and fast way?

// want this
{"key1": "1", "key1": "2", "key1": "3", "key2": "55", "key2": "66"}
// to become this
{"key1": ["1","2","3"], "key2": ["55","66"]}

在我看来,aggregateByKeyglom 可以先在分区 (map) 中完成,然后将所有列表连接在一起(减少).

Seems to me that maybe aggregateByKey or glom could do it first in the partition (map) and then join all the lists together (reduce).

推荐答案

groupByKey 适用于我们想要每个键的小"值集合的情况,如问题所示.

groupByKey is fine for the case when we want a "smallish" collection of values per key, as in the question.

groupByKey>

TL;DR

groupByKey 上的请勿使用"警告适用于两种一般情况:

The "do not use" warning on groupByKey applies for two general cases:

1) 您想对值进行聚合:

1) You want to aggregate over the values:

  • 不要:rdd.groupByKey().mapValues(_.sum)
  • :rdd.reduceByKey(_ + _)

在这种情况下,groupByKey 将浪费资源实现集合,而我们想要的是单个元素作为答案.

In this case, groupByKey will waste resouces materializing a collection while what we want is a single element as answer.

2) 您希望通过低基数键对非常大的集合进行分组:

2) You want to group very large collections over low cardinality keys:

  • 不要:allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
  • 千万不要

在这种情况下,groupByKey 可能会导致 OOM 错误.

In this case, groupByKey will potentially result in an OOM error.

groupByKey 在一个执行器中具体化一个集合,其中包含相同键的所有值.如前所述,它有内存限制,因此,根据情况,其他选项会更好.

groupByKey materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.

所有的分组函数,比如 groupByKeyaggregateByKeyreduceByKey 依赖于基础:combineByKey 因此,对于问题中的用例,没有其他替代方案会更好,它们都依赖于相同的通用流程.

All the grouping functions, like groupByKey, aggregateByKey and reduceByKey rely on the base: combineByKey and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.

这篇关于Spark groupByKey 替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆