什么时候在 spark 中执行 REFRESH TABLE my_table? [英] When to execute REFRESH TABLE my_table in spark?

查看:82
本文介绍了什么时候在 spark 中执行 REFRESH TABLE my_table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个代码;

 import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

这段代码使用相同的路径但不同的数据帧执行了两次.第一次运行成功,但随后出现报错:

This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

我试图清理缓存,调用 hiveContext.dropTempTable("tableName") 并且都没有效果.何时调用 REFRESH TABLE tableName 之前,之后(其他变体)修复此类错误?

I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?

推荐答案

您可以运行 spark.catalog.refreshTable(tableName)spark.sql(s"REFRESH TABLE $tableName") 就在写操作之前.我遇到了同样的问题,它解决了我的问题.

You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)

这篇关于什么时候在 spark 中执行 REFRESH TABLE my_table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆