我应该避免在数据集/数据框中使用 groupby() 吗? [英] Should I Avoid groupby() in Dataset/Dataframe?

查看:28
本文介绍了我应该避免在数据集/数据框中使用 groupby() 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在 RDD 中,我们不鼓励使用 groupByKey,并鼓励使用诸如 reduceByKey() 和 aggregateByKey() 之类的替代方法,因为这些其他方法将首先减少每个分区,然后执行 groupByKey() 从而减少被洗牌的数据量.

I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.

现在,我的问题是这是否仍然适用于数据集/数据帧?我在想,由于催化剂引擎做了很多优化,催化剂会自动知道它应该在每个分区上减少,然后执行 groupBy.我对么?或者我们仍然需要采取措施确保在 groupBy 之前对每个分区进行缩减.

Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.

推荐答案

groupBy 应该用于数据帧和数据集.您的想法完全正确,Catalyst Optimizer 将构建计划并优化 GroupBy 中的所有入口以及您想做的其他聚合.

The groupBy should be used at Dataframes and Datasets. You thinking is complete right, the Catalyst Optimizer will build the plan and optimize all the entrances in GroupBy and other aggregations that you want to do.

有一个很好的例子,就是在这个 link 上的 spark 1.4 展示了 reduceByKey 与 RDD 和 GroupBy 与 DataFrame 的比较.

There is a good example, that is in spark 1.4 on this link that show the comparison of reduceByKey with RDD and GroupBy with DataFrame.

而且你可以看到它确实比 RDD 快得多,所以 groupBy 优化所有执行的更多细节你可以看到 DataBricks 的官方帖子 DataFrames 介绍

And you can see that is really much more faster than RDD, so groupBy optimize all the execution for more details you can see the oficial post of DataBricks with the introduction of DataFrames

这篇关于我应该避免在数据集/数据框中使用 groupby() 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆