覆盖Spark数据框写入方法中的特定分区 [英] Overwrite specific partitions in spark dataframe write method

查看:310
本文介绍了覆盖Spark数据框写入方法中的特定分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想覆盖特定的分区,而不是全部覆盖.我正在尝试以下命令:

I want to overwrite specific partitions instead of all in spark. I am trying the following command:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

其中df是具有要覆盖的增量数据的数据帧.

where df is dataframe having the incremental data to be overwritten.

hdfs-base-path包含主数据.

hdfs-base-path contains the master data.

当我尝试上述命令时,它将删除所有分区,并将df中存在的那些分区插入hdfs路径.

When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.

我的要求是仅覆盖df中指定hdfs路径上的那些分区.有人可以帮我吗?

What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?

推荐答案

这是一个常见问题.使用Spark最高2.0版本的唯一解决方案是直接写入分区目录,例如

This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,

df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")

如果您在2.0之前使用Spark,则需要使用以下命令停止Spark发出元数据文件(因为它们会破坏自动分区发现):

If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

如果使用1.6.2之前的Spark,则还需要删除/root/path/to/data/partition_col=value中的_SUCCESS文件,否则它的存在将破坏自动分区发现. (我强烈建议您使用1.6.2或更高版本.)

If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)

您可以在 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆