Spark中的分区和存储分区有什么区别? [英] What is the difference between partitioning and bucketing in Spark?

查看:220
本文介绍了Spark中的分区和存储分区有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试优化两个Spark数据帧之间的联接查询,我们将它们称为df1,df2(在公共列"SaleId"上联接).df1非常小(5M),因此我在Spark集群的节点之间广播了它.df2非常大(200M行),因此我尝试通过"SaleId"对其进行存储/重新分区.

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".

在Spark中,按列划分数据和按列存储数据有什么区别?

In Spark, what is the difference between partitioning the data by column and bucketing the data by column?

例如:

分区:

df2 = df2.repartition(10, "SaleId")

存储桶:

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

在每种技术中,我都将df2和df1结合在一起.

After each one of those techniques I just joined df2 with df1.

我不知道哪种是最合适的技术.谢谢

I can't figure out which of those is the right technique to use. Thank you

推荐答案

分区 用于在相同 Spark作业中作为Action的一部分.

repartition is for using as part of an Action in the same Spark Job.

bucketBy 用于输出,写入.因此,为了避免在 next Spark应用中改组,通常将其作为ETL的一部分.想想JOIN.看 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html 简明扼要的阅读.bucketBy表只能读取由Spark负责.

bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.

这篇关于Spark中的分区和存储分区有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆