读取不同文件夹深度的多个csv文件 [英] Reading multiple csv files at different folder depths

查看：77 发布时间：2020/7/12 1:10:31 scala csv apache-spark dataframe wildcard

本文介绍了读取不同文件夹深度的多个csv文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果可能的话，我想使用单个路径将给定文件夹中的所有csv文件递归读取到Spark SQL DataFrame中.

I want to recursively read all csv files in a given folder into a Spark SQL DataFrame using a single path, if possible.

我的文件夹结构看起来像这样，我想用一个路径包括所有文件:

My folder structure looks something like this and I want to include all of the files with one path:

resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv

resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv

这是我的代码:

def read: DataFrame =
      sparkSession
        .read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("charset", "UTF-8")
        .csv(path)

将path设置为.../resource/*/*.csv会忽略1.，而.../resource/*.csv会忽略2.和3.

Setting path to .../resource/*/*.csv omits 1. while .../resource/*.csv omits 2. and 3.

我知道csv()也将多个字符串用作路径参数，但是如果可能的话，希望避免这种情况.

I know csv() also takes multiple strings as path arguments, but want to avoid that, if possible.

注意::我知道我的问题类似于

note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.

推荐答案

如果resources目录中仅包含csv文件和仅一个子文件夹级别，则可以使用resources/**.

If there are only csv files and only one level of subfolder in your resources directory then you can use resources/**.

编辑

否则，您可以使用Hadoop FileSystem类以递归方式列出resources目录中的每个csv文件，然后将该列表传递给.csv()

Else you can use Hadoop FileSystem class to recursively list every csv files in your resources directory and then pass the list to .csv()

    val fs = FileSystem.get(new Configuration())
    val files = fs.listFiles(new Path("resources/", true))
    val filePaths = new ListBuffer[String]
    while (files.hasNext()) {
        val file = files.next()
        filePaths += file.getPath.toString
    }

    val df: DataFrame = spark
        .read
        .options(...)
        .csv(filePaths: _*)

这篇关于读取不同文件夹深度的多个csv文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取不同文件夹深度的多个csv文件 [英] Reading multiple csv files at different folder depths

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

读取不同文件夹深度的多个csv文件 [英] Reading multiple csv files at different folder depths

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭