仅覆盖分区 spark 数据集中的某些分区 [英] Overwrite only some partitions in a partitioned spark Dataset

查看:36
本文介绍了仅覆盖分区 spark 数据集中的某些分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们如何覆盖分区数据集,但只覆盖我们要更改的分区?比如重新计算上周的日常作业,只覆盖上周的数据.

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.

默认的 Spark 行为是覆盖整个表,即使只有一些分区会被写入.

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.

推荐答案

从 Spark 2.3.0 开始,这是覆盖表时的一个选项.覆盖需要将新的spark.sql.sources.partitionOverwriteMode设置为dynamic,数据集需要分区,写入模式overwrite.scala 中的示例:

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

我建议在写入之前根据您的分区列进行重新分区,这样最终每个文件夹不会有 400 个文件.

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

在 Spark 2.3.0 之前,最好的解决方案是启动 SQL 语句删除那些分区,然后使用模式追加写入它们.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

这篇关于仅覆盖分区 spark 数据集中的某些分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆