Spark:从具有空路径的路径列表中读取数据帧 [英] Spark: Reading data frame from list of paths with empty path

查看：86 发布时间：2020/9/4 5:02:42 scala apache-spark apache-spark-sql

本文介绍了Spark:从具有空路径的路径列表中读取数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从Spark中的路径列表加载数据帧.如果所有提到的路径中都存在文件，则代码工作正常.如果至少有一条路径为空，则抛出错误.

I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.

这是我的代码:

val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)

我查看了其他选项.

构建一个包含所有路径的正则表达式字符串.
通过检查spark是否可以读取来从路径的主列表中构建列表.

for(path <- paths) {
  if(Try(spark.read.json(path)).isSuccess) {
    //add path to list
  }
}

第一种方法不适用于我的情况，因为我无法在必须读取的路径之外创建正则表达式. 第二种方法可行，但我认为它会降低性能，因为它必须从所有路径读取两次.

The first approach won't work for my case because I can't create a regex out the paths I have to read. Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.

请提出解决此问题的方法.

Please suggest an approach to solve this issue.

注意:

所有路径都在hdfs中
每个路径本身就是一个正则表达式字符串，将从多个文件中读取

推荐答案

如注释中所述，您可以使用HDFS FileSystem API来获取基于您的正则表达式存在的路径的列表(只要它是有效的正则表达式).

As mentioned in the comments, you can use HDFS FileSystem API to get a list of paths that exist based on your regex (as long as it's a valid regex).

import org.apache.hadoop.fs._

val path = Array("path_prefix/folder1[2-8]/*", "path_prefix/folder2[2-8]/*")

val fs: FileSystem = FileSystem.get(sc.hadoopConfiguration)  // sc = SparkContext

val paths = path.flatMap(p => fs.globStatus(new Path(p)).map(_.getPath.toString))

这样，即使/path_prefix/folder13为空，其内容也不会在变量paths中列出，该变量将是Array[String]包含正则表达式中所有可用文件的变量.

This way even if, say, /path_prefix/folder13 is empty, it's contents will not get listed in the variable paths which will be a Array[String] containing all the available files in the regex.

最后，您可以这样做:

spark.read.json(paths : _*)

这篇关于Spark:从具有空路径的路径列表中读取数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:从具有空路径的路径列表中读取数据帧 [英] Spark: Reading data frame from list of paths with empty path

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:从具有空路径的路径列表中读取数据帧 [英] Spark: Reading data frame from list of paths with empty path

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭