读取不同文件夹深度的多个csv文件 [英] Reading multiple csv files at different folder depths
问题描述
如果可能的话,我想使用单个路径将给定文件夹中的所有csv文件递归读取到Spark SQL DataFrame
中.
I want to recursively read all csv files in a given folder into a Spark SQL DataFrame
using a single path, if possible.
我的文件夹结构看起来像这样,我想用一个路径包括所有文件:
My folder structure looks something like this and I want to include all of the files with one path:
-
resources/first.csv
-
resources/subfolder/second.csv
-
resources/subfolder/third.csv
resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv
这是我的代码:
def read: DataFrame =
sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.option("charset", "UTF-8")
.csv(path)
将path
设置为.../resource/*/*.csv
会忽略1.,而.../resource/*.csv
会忽略2.和3.
Setting path
to .../resource/*/*.csv
omits 1. while .../resource/*.csv
omits 2. and 3.
我知道csv()
也将多个字符串用作路径参数,但是如果可能的话,希望避免这种情况.
I know csv()
also takes multiple strings as path arguments, but want to avoid that, if possible.
note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.
推荐答案
如果resources
目录中仅包含csv文件和仅一个子文件夹级别,则可以使用resources/**
.
If there are only csv files and only one level of subfolder in your resources
directory then you can use resources/**
.
编辑
否则,您可以使用Hadoop FileSystem
类以递归方式列出resources
目录中的每个csv文件,然后将该列表传递给.csv()
Else you can use Hadoop FileSystem
class to recursively list every csv files in your resources
directory and then pass the list to .csv()
val fs = FileSystem.get(new Configuration())
val files = fs.listFiles(new Path("resources/", true))
val filePaths = new ListBuffer[String]
while (files.hasNext()) {
val file = files.next()
filePaths += file.getPath.toString
}
val df: DataFrame = spark
.read
.options(...)
.csv(filePaths: _*)
这篇关于读取不同文件夹深度的多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!