Spark:从具有空路径的路径列表中读取数据帧 [英] Spark: Reading data frame from list of paths with empty path
问题描述
我正在尝试从Spark中的路径列表加载数据帧.如果所有提到的路径中都存在文件,则代码工作正常.如果至少有一条路径为空,则抛出错误.
I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.
这是我的代码:
val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)
我查看了其他选项.
- 构建一个包含所有路径的正则表达式字符串.
- 通过检查spark是否可以读取来从路径的主列表中构建列表.
.
for(path <- paths) {
if(Try(spark.read.json(path)).isSuccess) {
//add path to list
}
}
第一种方法不适用于我的情况,因为我无法在必须读取的路径之外创建正则表达式. 第二种方法可行,但我认为它会降低性能,因为它必须从所有路径读取两次.
The first approach won't work for my case because I can't create a regex out the paths I have to read. Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.
请提出解决此问题的方法.
Please suggest an approach to solve this issue.
注意:
- 所有路径都在hdfs中
- 每个路径本身就是一个正则表达式字符串,将从多个文件中读取
推荐答案
如注释中所述,您可以使用HDFS FileSystem
API来获取基于您的正则表达式存在的路径的列表(只要它是有效的正则表达式).
As mentioned in the comments, you can use HDFS FileSystem
API to get a list of paths that exist based on your regex (as long as it's a valid regex).
import org.apache.hadoop.fs._
val path = Array("path_prefix/folder1[2-8]/*", "path_prefix/folder2[2-8]/*")
val fs: FileSystem = FileSystem.get(sc.hadoopConfiguration) // sc = SparkContext
val paths = path.flatMap(p => fs.globStatus(new Path(p)).map(_.getPath.toString))
这样,即使/path_prefix/folder13
为空,其内容也不会在变量paths
中列出,该变量将是Array[String]
包含正则表达式中所有可用文件的变量.
This way even if, say, /path_prefix/folder13
is empty, it's contents will not get listed in the variable paths
which will be a Array[String]
containing all the available files in the regex.
最后,您可以这样做:
spark.read.json(paths : _*)
这篇关于Spark:从具有空路径的路径列表中读取数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!