Spark HiveContext:插入覆盖从中读取的同一表 [英] Spark HiveContext : Insert Overwrite the same table it is read from

查看:218
本文介绍了Spark HiveContext:插入覆盖从中读取的同一表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在HiveContext中使用PySpark来应用SCD1和SCD2.在我的方法中,我正在读取增量数据和目标表.阅读后,我将加入他们的进阶方法.我正在对所有源数据帧执行registerTempTable.我正在尝试将最终数据集写入目标表,但是我面临的问题是,在读取表的表中无法进行插入覆盖.

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.

请为此提出一些解决方案.我不想将中间数据写入物理表并再次读取.

Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.

是否有任何属性或方法可以存储最终数据集,而无需保留对其读取表的依赖性.这样,就有可能覆盖该表.

Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.

请提出建议.

推荐答案

我正在查看spark的文档,当我在那检查一个属性时,一个念头就响了.

I was going through the documentation of spark and a thought clicked to me when I was checking one property there.

由于我的桌子是镶木地板,因此我使用蜂巢元存储通过将此属性设置为false来读取数据.

As my table was parquet, I used hive meta store to read the data by setting this property to false.

hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")

此解决方案对我来说效果很好.

This solution is working fine for me.

这篇关于Spark HiveContext:插入覆盖从中读取的同一表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆