何时在Spark中执行REFRESH TABLE my_table? [英] When to execute REFRESH TABLE my_table in spark?

查看:2646
本文介绍了何时在Spark中执行REFRESH TABLE my_table?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个代码;

 import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

此代码使用相同的路径但不同的dataFrames执行两次.第一次运行成功,但随后出现上升错误:

This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

我试图清理缓存,调用hiveContext.dropTempTable("tableName"),但都没有效果.何时(在其他变体中)之前,之后(其他变体)调用REFRESH TABLE tableName?

I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?

推荐答案

对于Google员工;

For the Googlers;

您可以在写操作之前运行spark.catalog.refreshTable(tableName)spark.sql(s"REFRESH TABLE $tableName").我遇到了同样的问题,并且解决了我的问题.

You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)

这篇关于何时在Spark中执行REFRESH TABLE my_table?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆