Spark:从具有空路径的路径列表中读取数据帧 [英] Spark: Reading data frame from list of paths with empty path

查看:86
本文介绍了Spark:从具有空路径的路径列表中读取数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Spark中的路径列表加载数据帧.如果所有提到的路径中都存在文件,则代码工作正常.如果至少有一条路径为空,则抛出错误.

I am trying to load dataframe from a list of paths in spark. If a file exists in all the mentioned paths then the code is working fine. If there is at least one path that is empty then it is throwing error.

这是我的代码:

val paths = List("path1", "path2")
val df = spark.read.json(paths: _*)

我查看了其他选项.

  1. 构建一个包含所有路径的正则表达式字符串.
  2. 通过检查spark是否可以读取来从路径的主列表中构建列表.

.

for(path <- paths) {
  if(Try(spark.read.json(path)).isSuccess) {
    //add path to list
  }
}

第一种方法不适用于我的情况,因为我无法在必须读取的路径之外创建正则表达式. 第二种方法可行,但我认为它会降低性能,因为它必须从所有路径读取两次.

The first approach won't work for my case because I can't create a regex out the paths I have to read. Second approach works but I feel it is going to degrade performance as it has to read from all the paths twice.

请提出解决此问题的方法.

Please suggest an approach to solve this issue.

注意:

  1. 所有路径都在hdfs中
  2. 每个路径本身就是一个正则表达式字符串,将从多个文件中读取

推荐答案

如注释中所述,您可以使用HDFS FileSystem API来获取基于您的正则表达式存在的路径的列表(只要它是有效的正则表达式).

As mentioned in the comments, you can use HDFS FileSystem API to get a list of paths that exist based on your regex (as long as it's a valid regex).

import org.apache.hadoop.fs._

val path = Array("path_prefix/folder1[2-8]/*", "path_prefix/folder2[2-8]/*")

val fs: FileSystem = FileSystem.get(sc.hadoopConfiguration)  // sc = SparkContext

val paths = path.flatMap(p => fs.globStatus(new Path(p)).map(_.getPath.toString))

这样,即使/path_prefix/folder13为空,其内容也不会在变量paths中列出,该变量将是Array[String]包含正则表达式中所有可用文件的变量.

This way even if, say, /path_prefix/folder13 is empty, it's contents will not get listed in the variable paths which will be a Array[String] containing all the available files in the regex.

最后,您可以这样做:

spark.read.json(paths : _*)

这篇关于Spark:从具有空路径的路径列表中读取数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆