Spark groupByKey替代 [英] Spark groupByKey alternative

查看:130
本文介绍了Spark groupByKey替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据Databricks的最佳做法,应避免使用Spark groupByKey,因为Spark groupByKey处理的工作方式是首先在工作人员之间重排信息,然后再进行处理. 解释

According to Databricks best practices, Spark groupByKey should be avoided as Spark groupByKey processing works in a way that the information will be first shuffled across workers and then the processing will occur. Explanation

所以,我的问题是,groupByKey的替代方案是什么,它将以分布式和快速的方式返回以下内容?

So, my question is, what are the alternatives for groupByKey in a way that it will return the following in a distributed and fast way?

// want this
{"key1": "1", "key1": "2", "key1": "3", "key2": "55", "key2": "66"}
// to become this
{"key1": ["1","2","3"], "key2": ["55","66"]}

在我看来,aggregateByKeyglom可以先在分区(map)中执行此操作,然后将所有列表连接在一起(reduce).

Seems to me that maybe aggregateByKey or glom could do it first in the partition (map) and then join all the lists together (reduce).

推荐答案

groupByKey适用于我们想要每个键较小"的值收集的情况,如问题中所示.

groupByKey is fine for the case when we want a "smallish" collection of values per key, as in the question.

groupByKey上的请勿使用"警告适用于两种一般情况:

The "do not use" warning on groupByKey applies for two general cases:

1)您想要汇总以下值:

1) You want to aggregate over the values:

  • 不要:rdd.groupByKey().mapValues(_.sum)
  • 要做:rdd.reduceByKey(_ + _)
  • DON'T: rdd.groupByKey().mapValues(_.sum)
  • DO: rdd.reduceByKey(_ + _)

在这种情况下,groupByKey将浪费资源来实现一个集合,而我们想要的只是一个元素作为答案.

In this case, groupByKey will waste resouces materializing a collection while what we want is a single element as answer.

2)您想通过低基数密钥对非常大的集合进行分组:

2) You want to group very large collections over low cardinality keys:

  • 不要:allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
  • 绝对不要
  • DON'T: allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
  • JUST DON'T

在这种情况下,groupByKey可能会导致OOM错误.

In this case, groupByKey will potentially result in an OOM error.

groupByKey在一个执行程序中实现具有相同键的所有值的集合.如前所述,它具有内存限制,因此,根据情况,其他选项更好.

groupByKey materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.

所有分组功能,例如

All the grouping functions, like groupByKey, aggregateByKey and reduceByKey rely on the base: combineByKey and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.

这篇关于Spark groupByKey替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆