Spark:仅在路径存在时读取文件 [英] Spark : Read file only if the path exists

查看：250 发布时间：2020/9/4 2:01:03 scala apache-spark parquet

本文介绍了Spark:仅在路径存在时读取文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试读取scala中Paths Sequence中存在的文件.下面是示例(伪)代码:

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code:

val paths = Seq[String] //Seq of paths
val dataframe = spark.read.parquet(paths: _*)

现在，按照上述顺序，存在一些路径，而有些则不存在.

Now, in the above sequence, some paths exist whereas some don't. Is there any way to ignore the missing paths while reading parquet files (to avoid org.apache.spark.sql.AnalysisException: Path does not exist)?

我尝试了以下操作，但似乎可以正常工作，但是随后，我最终两次读取相同的路径，所以我想避免这样做:

I have tried the below and it seems to be working, but then, I end up reading the same path twice which is something I would like to avoid doing:

val filteredPaths = paths.filter(p => Try(spark.read.parquet(p)).isSuccess)

我检查了options方法中的DataFrameReader，但是似乎没有任何类似于ignore_if_missing的选项.

I checked the options method for DataFrameReader but that does not seem to have any option that is similar to ignore_if_missing.

此外，这些路径可以是hdfs或s3(此Seq作为方法参数传递)，并且在读取时，我不知道路径是s3还是hdfs，所以可以不要使用s3或hdfs特定的API来检查其存在.

Also, these paths can be hdfs or s3 (this Seq is passed as a method argument) and while reading, I don't know whether a path is s3 or hdfs so can't use s3 or hdfs specific API to check the existence.

推荐答案

您可以像@Psidom的答案一样过滤掉不相关的文件.在spark中，最好的方法是使用内部spark hadoop配置.鉴于spark会话变量被称为"spark"，您可以执行以下操作:

You can filter out the irrelevant files as in @Psidom's answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called "spark" you can do:

import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)

def testDirExist(path: String): Boolean = {
  val p = new Path(path)
  hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val filteredPaths = paths.filter(p => testDirExists(p))
val dataframe = spark.read.parquet(filteredPaths: _*)

这篇关于Spark:仅在路径存在时读取文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:仅在路径存在时读取文件 [英] Spark : Read file only if the path exists

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:仅在路径存在时读取文件 [英] Spark : Read file only if the path exists

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭