df.repartition 和 DataFrameWriter partitionBy 的区别? [英] Difference between df.repartition and DataFrameWriter partitionBy?

查看:22
本文介绍了df.repartition 和 DataFrameWriter partitionBy 的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DataFrame repartition() 和 DataFrameWriter partitionBy() 方法有什么区别?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?

我希望两者都习惯于根据数据框列对数据进行分区"?或者有什么区别?

I hope both are used to "partition data based on dataframe column"? Or is there any difference?

推荐答案

如果您运行 repartition(COL),您会在计算过程中更改分区 - 您将获得 spark.sql.shuffle.分区(默认:200)分区.如果您随后调用 .write,您将获得一个包含许多文件的目录.

If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.

如果您运行 .write.partitionBy(COL) 那么结果您将获得与 COL 中唯一值一样多的目录.这会加快进一步的数据读取速度(如果您按分区列过滤)并节省一些存储空间(分区列已从数据文件中删除).

If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).

更新:请参阅@conradlee 的回答.他不仅详细解释了应用不同方法后目录结构的样子,还详细解释了两种情况下产生的文件数量.

UPDATE: See @conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.

这篇关于df.repartition 和 DataFrameWriter partitionBy 的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆