Spark中的分区和存储分区有什么区别? [英] What is the difference between partitioning and bucketing in Spark?

查看：220 发布时间：2021/4/8 19:53:22 python apache-spark bucket data-partitioning

本文介绍了Spark中的分区和存储分区有什么区别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试优化两个Spark数据帧之间的联接查询，我们将它们称为df1，df2(在公共列"SaleId"上联接).df1非常小(5M)，因此我在Spark集群的节点之间广播了它.df2非常大(200M行)，因此我尝试通过"SaleId"对其进行存储/重新分区.

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".

在Spark中，按列划分数据和按列存储数据有什么区别?

In Spark, what is the difference between partitioning the data by column and bucketing the data by column?

例如:

分区:

df2 = df2.repartition(10, "SaleId")

存储桶:

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

在每种技术中，我都将df2和df1结合在一起.

After each one of those techniques I just joined df2 with df1.

我不知道哪种是最合适的技术.谢谢

I can't figure out which of those is the right technique to use. Thank you

Spark中的分区和存储分区有什么区别? [英] What is the difference between partitioning and bucketing in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark中的分区和存储分区有什么区别? [英] What is the difference between partitioning and bucketing in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭