Spark HiveContext:插入覆盖它从中读取的同一个表 [英] Spark HiveContext : Insert Overwrite the same table it is read from

查看:45
本文介绍了Spark HiveContext:插入覆盖它从中读取的同一个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 HiveContext 中使用 PySpark 应用 SCD1 和 SCD2.在我的方法中,我正在读取增量数据和目标表.阅读后,我加入了他们的 upsert 方法.我正在对所有源数据帧进行 registerTempTable.我正在尝试将最终数据集写入目标表,但我面临的问题是无法在读取它的表中插入覆盖.

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.

请为此提出一些解决方案.我不想将中间数据写入物理表并再次读取.

Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.

是否有任何属性或方法可以存储最终数据集而不保留对其读取的表的依赖性.这样,就有可能覆盖表.

Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.

请提出建议.

推荐答案

我正在浏览 spark 文档,当我检查那里的一个属性时,我突然想到了一个想法.

I was going through the documentation of spark and a thought clicked to me when I was checking one property there.

由于我的表是镶木地板,我使用 hive 元存储通过将此属性设置为 false 来读取数据.

As my table was parquet, I used hive meta store to read the data by setting this property to false.

hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")

这个解决方案对我来说很好.

This solution is working fine for me.

这篇关于Spark HiveContext:插入覆盖它从中读取的同一个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆