在Spark中跳过蜂巢表中丢失的文件以避免FileNotFoundException [英] Skip missing files from hive table in spark to avoid FileNotFoundException
问题描述
我正在使用spark.sql()
读取表,然后尝试打印计数.
但是某些文件丢失或直接从HDFS中删除.
I'm reading a table using spark.sql()
and then trying to print the count.
But some of the files are missing or removed from HDFS directly.
Spark失败,并显示以下错误:
Spark is failing with below Error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/some path.../data
Hive能够为我提供相同查询的无错误计数. 表是一个外部分区表.
Hive is able to give me give me the count without error for the same query. Table is an external and partitioned table.
我想忽略丢失的文件,并防止我的Spark作业失败. 我在网上搜索并尝试在创建Spark会话时在config参数下面进行设置,但是没有运气.
I wanted to ignore the missing files and prevent my Spark job from failing. I have searched over the internet and tried setting below config parameters while creating the spark session but no luck.
SparkSession.builder
.config("spark.sql.hive.verifyPartitionPath", "false")
.config("spark.sql.files.ignoreMissingFiles", true)
.config("spark.sql.files.ignoreCorruptFiles", true)
.enableHiveSupport()
.getOrCreate()
已引用 https://jaceklaskowski.gitbooks.io/以上配置参数的mastering-spark-sql/spark-sql-properties.html .
val sql = "SELECT count(*) FROM db.table WHERE date=20190710"
val df = spark.sql(sql)
println(df.count)
我希望在没有FileNotFoundException
的情况下火花代码也能成功完成,即使分区信息中缺少某些文件.
I'm expecting the spark code to complete successfully without FileNotFoundException
even if some of the files are missing from the partition information.
我想知道为什么spark.sql.files.ignoreMissingFiles
没有效果.
I'm wondering why spark.sql.files.ignoreMissingFiles
has no effect.
Spark版本是2.2.0.cloudera1版本. 请提示.预先感谢.
Spark version is version 2.2.0.cloudera1. Kindly suggest. Thanks in advance.
推荐答案
在配置参数下进行设置可以解决此问题:
Setting below config parameter resolved the issue:
对于Hive:
mapred.input.dir.recursive=true
对于Spark会话:
SparkSession.builder
.config("mapred.input.dir.recursive",true)
.enableHiveSupport()
.getOrCreate()
在进一步分析中,我发现分区目录的一部分已注册为表中的分区位置,并且在该目录下有许多不同的文件夹,并且在每个文件夹内我们都有实际的数据文件. 因此,我们需要在spark中打开递归发现以读取数据.
On further analysis I found that a part of the partition directory is registered as partition location in table and under that many different folders are there and inside each folder we have actual data files. So we need to turn on recursive discovery in spark to read the data.
这篇关于在Spark中跳过蜂巢表中丢失的文件以避免FileNotFoundException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!