在Spark中跳过蜂巢表中丢失的文件以避免FileNotFoundException [英] Skip missing files from hive table in spark to avoid FileNotFoundException

查看:325
本文介绍了在Spark中跳过蜂巢表中丢失的文件以避免FileNotFoundException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spark.sql()读取表,然后尝试打印计数. 但是某些文件丢失或直接从HDFS中删除.

I'm reading a table using spark.sql() and then trying to print the count. But some of the files are missing or removed from HDFS directly.

Spark失败,并显示以下错误:

Spark is failing with below Error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/some path.../data

Hive能够为我提供相同查询的无错误计数. 表是一个外部分区表.

Hive is able to give me give me the count without error for the same query. Table is an external and partitioned table.

我想忽略丢失的文件,并防止我的Spark作业失败. 我在网上搜索并尝试在创建Spark会话时在config参数下面进行设置,但是没有运气.

I wanted to ignore the missing files and prevent my Spark job from failing. I have searched over the internet and tried setting below config parameters while creating the spark session but no luck.

    SparkSession.builder
    .config("spark.sql.hive.verifyPartitionPath", "false")
    .config("spark.sql.files.ignoreMissingFiles", true)
    .config("spark.sql.files.ignoreCorruptFiles", true)
    .enableHiveSupport()
    .getOrCreate()

已引用 https://jaceklaskowski.gitbooks.io/以上配置参数的mastering-spark-sql/spark-sql-properties.html .

    val sql = "SELECT count(*) FROM db.table WHERE date=20190710"
    val df = spark.sql(sql)
    println(df.count)

我希望在没有FileNotFoundException的情况下火花代码也能成功完成,即使分区信息中缺少某些文件.

I'm expecting the spark code to complete successfully without FileNotFoundException even if some of the files are missing from the partition information.

我想知道为什么spark.sql.files.ignoreMissingFiles没有效果.

I'm wondering why spark.sql.files.ignoreMissingFiles has no effect.

Spark版本是2.2.0.cloudera1版本. 请提示.预先感谢.

Spark version is version 2.2.0.cloudera1. Kindly suggest. Thanks in advance.

推荐答案

在配置参数下进行设置可以解决此问题:

Setting below config parameter resolved the issue:

对于Hive:

mapred.input.dir.recursive=true

对于Spark会话:

SparkSession.builder
.config("mapred.input.dir.recursive",true)
.enableHiveSupport()
.getOrCreate()

在进一步分析中,我发现分区目录的一部分已注册为表中的分区位置,并且在该目录下有许多不同的文件夹,并且在每个文件夹内我们都有实际的数据文件. 因此,我们需要在spark中打开递归发现以读取数据.

On further analysis I found that a part of the partition directory is registered as partition location in table and under that many different folders are there and inside each folder we have actual data files. So we need to turn on recursive discovery in spark to read the data.

这篇关于在Spark中跳过蜂巢表中丢失的文件以避免FileNotFoundException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆