在 spark 中跳过 hive 表中丢失的文件以避免 FileNotFoundException [英] Skip missing files from hive table in spark to avoid FileNotFoundException

查看:179
本文介绍了在 spark 中跳过 hive 表中丢失的文件以避免 FileNotFoundException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 spark.sql() 读取表格,然后尝试打印计数.但部分文件丢失或直接从 HDFS 中删除.

I'm reading a table using spark.sql() and then trying to print the count. But some of the files are missing or removed from HDFS directly.

Spark 失败并出现以下错误:

Spark is failing with below Error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/some path.../data

Hive 能够为我提供相同查询的无错误计数.表是外部分区表.

Hive is able to give me give me the count without error for the same query. Table is an external and partitioned table.

我想忽略丢失的文件并防止我的 Spark 作业失败.我在互联网上搜索并尝试在创建 spark 会话时设置以下配置参数,但没有成功.

I wanted to ignore the missing files and prevent my Spark job from failing. I have searched over the internet and tried setting below config parameters while creating the spark session but no luck.

    SparkSession.builder
    .config("spark.sql.hive.verifyPartitionPath", "false")
    .config("spark.sql.files.ignoreMissingFiles", true)
    .config("spark.sql.files.ignoreCorruptFiles", true)
    .enableHiveSupport()
    .getOrCreate()

引用 https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-properties.html 用于上述配置参数.

Referred https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-properties.html for above config parameters.

    val sql = "SELECT count(*) FROM db.table WHERE date=20190710"
    val df = spark.sql(sql)
    println(df.count)

即使分区信息中缺少某些文件,我也希望 Spark 代码能够在没有 FileNotFoundException 的情况下成功完成.

I'm expecting the spark code to complete successfully without FileNotFoundException even if some of the files are missing from the partition information.

我想知道为什么 spark.sql.files.ignoreMissingFiles 没有效果.

I'm wondering why spark.sql.files.ignoreMissingFiles has no effect.

Spark 版本为 2.2.0.cloudera1.请建议.提前致谢.

Spark version is version 2.2.0.cloudera1. Kindly suggest. Thanks in advance.

推荐答案

设置下面的配置参数解决了问题:

Setting below config parameter resolved the issue:

对于蜂巢:

mapred.input.dir.recursive=true

对于 Spark 会话:

For Spark Session:

SparkSession.builder
.config("mapred.input.dir.recursive",true)
.enableHiveSupport()
.getOrCreate()

在进一步分析中,我发现分区目录的一部分被注册为表中的分区位置,并且在那里有许多不同的文件夹,并且在每个文件夹中我们都有实际的数据文件.所以我们需要在spark中开启递归发现来读取数据.

On further analysis I found that a part of the partition directory is registered as partition location in table and under that many different folders are there and inside each folder we have actual data files. So we need to turn on recursive discovery in spark to read the data.

这篇关于在 spark 中跳过 hive 表中丢失的文件以避免 FileNotFoundException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆