如何使用Spark执行插入覆盖? [英] How can you perform a Insert overwrite using Spark?
问题描述
我正在尝试将我们的ETL Hive脚本之一转换为Spark,其中Hive ETL脚本维护着一个表,在该表中,每天晚上需要在新同步之前删除部分数据. Hive ETL使主表使用插入覆盖功能删除超过3天的数据.基本上用不超过三天的数据创建一个临时表,然后覆盖主表.
I'm trying to transition one of our ETL Hive script to Spark where the Hive ETL script maintains a table where part of data needs to be deleted every night before the new sync. The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite. Basically creates a temp table with data that doesn't surpass greater than three days and then overwrites the main table.
使用Spark(使用Scala)时,在无法写入同一源代码的情况下,我不断收到此错误.这是我的代码:
With Spark (using Scala) I keep getting this error where I cannot write to the same source. Here's my code:
spark.sql ("Select * from mytbl_hive where dt > date_sub(current_date, 3)").registerTempTable("tmp_mytbl")
val mytbl = sqlContext.table("tmp_mytbl")
mytbl.write.mode("overwrite").saveTableAs("tmp_mytbl")
//writing back to Hive ...
mytbl.write.mode("overwrite").insertInto("mytbl_hive")
我收到无法写入正在读取的表的错误.
I get the error that I cannot write to the table I'm reading from.
有人知道这样做的更好方法吗?
Does anyone know of a better way of doing this?
推荐答案
您不能.如您所知,Spark明确禁止覆盖表,该表用作查询的源.尽管存在一些取决于技术的变通办法,但这些变通办法并不可靠,应避免使用.
You cannot. As you've learned Spark explicitly prohibits overwriting table, which is used as a source for the query. While there exist some workarounds depending on the technicalities, there are not reliable and should be avoided.
相反:
- 将数据写入临时表.
- 丢掉旧桌子.
- 重命名临时表.
Hive ETL使用插入覆盖将主表删除超过3天的数据.
The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite.
最好按日期对数据进行分区,然后甚至不查看数据就删除分区.
It might a better idea to partition data by date, and just drop partitions, without even looking at the data.
这篇关于如何使用Spark执行插入覆盖?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!