我应该避免在数据集/数据框中使用 groupby() 吗? [英] Should I Avoid groupby() in Dataset/Dataframe?

查看：28 发布时间：2021/11/14 22:46:36 apache-spark optimization group-by dataset spark-dataframe

本文介绍了我应该避免在数据集/数据框中使用 groupby() 吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道在 RDD 中，我们不鼓励使用 groupByKey，并鼓励使用诸如 reduceByKey() 和 aggregateByKey() 之类的替代方法，因为这些其他方法将首先减少每个分区，然后执行 groupByKey() 从而减少被洗牌的数据量.

I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled.

现在，我的问题是这是否仍然适用于数据集/数据帧?我在想，由于催化剂引擎做了很多优化，催化剂会自动知道它应该在每个分区上减少，然后执行 groupBy.我对么?或者我们仍然需要采取措施确保在 groupBy 之前对每个分区进行缩减.

Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it should reduce on each partition, and then perform the groupBy. Am I correct? Or we still need to take steps to ensure reduction on each partition is performed before groupBy.

我应该避免在数据集/数据框中使用 groupby() 吗? [英] Should I Avoid groupby() in Dataset/Dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我应该避免在数据集/数据框中使用 groupby() 吗? [英] Should I Avoid groupby() in Dataset/Dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭