使用spark覆盖配置单元分区 [英] overwrite hive partitions using spark

查看:344
本文介绍了使用spark覆盖配置单元分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AWS,并且我的工作流使用Spark和Hive.我的数据按日期进行了分区,因此每天我的S3存储中都会有一个新分区. 我的问题是,有一天加载数据失败时,我必须重新执行该分区.接下来写的代码是

I am working with AWS and I have workflows that use Spark and Hive. My data is partitioned by the date, so everyday I have a new partition in my S3 storage. My problem is when one day the load data fails and I have to re-execute that partition. The code that writes is next:

df                            // My data in a Dataframe
  .write
  .format(getFormat(target))  // csv by default, but could be parquet, ORC...
  .mode(getSaveMode("overwrite"))  // Append by default, but in future it should be Overwrite
  .partitionBy(partitionName) // Column of the partition, the date
  .options(target.options)    // header, separator...
  .option("path", target.path) // the path where it will be storage
  .saveAsTable(target.tableName)  // the table name

我的流程发生了什么?如果我使用SaveMode.Overwrite,则将删除完整的表,并且仅保存分区.如果我使用SaveMode.Append,则可能会有重复的数据.

What happens in my flow? If I use the SaveMode.Overwrite, the complete table will be delete and I will have only the partition saved. If I use the SaveMode.Append I could have duplicate data.

进行搜索时,我发现Hive支持这种覆盖,仅分区,但是使用hql语句,我没有它.

Making a search, I found that Hive support this kind of overwrite, only partition, but using the hql sentences, I don´t have it.

我们需要Hive上的解决方案,因此我们不能使用替代选项(直接指向CSV).

We need the solution on Hive, so we can´t use this alternative option (direct to csv).

我发现了这个吉拉票,该票务旨在解决我遇到的问题有,但是尝试使用最新版本的Spark(2.3.0)时,情况是一样的.它将删除整个表并保存分区,而不是覆盖我的数据所具有的分区.

I had found this Jira ticket that suppose to solve the problem that I´m having, but trying that with the last version of Spark (2.3.0), the situation was the same. It delete the whole table and save the partition instead of overwrite the partition that my data has.

试图弄清楚这一点,这是一个示例:

Trying to make clearer this, this is an example:

由A分区

数据:

| A | B | C | 
|---|---|---| 
| b | 1 | 2 | 
| c | 1 | 2 |

表格:

| A | B | C | 
|---|---|---| 
| a | 1 | 2 | 
| b | 5 | 2 | 

我想要的是:在表中,分区a留在表中,分区b用数据覆盖,然后添加分区c.是否有使用Spark的解决方案可以做到这一点?

What I want is: In Table, the partition a stay in table, partition b overwrite with the Data, and add the partition c. Is there any solution using Spark that I can do this?

我要做的最后一个选择是首先删除将要保存的分区,然后使用SaveMode.Append,但是在没有其他解决方案的情况下,我会尝试此操作.

My last option to do this is first deleting the partition that is going to be saved and then use the SaveMode.Append, but I would try this in case no other solution.

推荐答案

如果使用的是Spark 2.3.0,请尝试将spark.sql.sources.partitionOverwriteMode设置设置为dynamic,需要对数据集进行分区,并且写入模式将被覆盖.

If you are on Spark 2.3.0, try setting spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite.

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")

这篇关于使用spark覆盖配置单元分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆