阅读在星火嵌套文件夹中的所有文件 [英] Read all files in a nested folder in Spark
问题描述
如果我们有一个文件夹文件夹
让所有 .TXT
文件,我们可以看到他们都使用 sc.textFile(文件夹/ *。TXT)
。但是,如果我有一个文件夹包含名为datewise一样,
03
, > 04
,...,其中还含有一些 .LOG
文件。如何在星火阅读这些?
If we have a folder folder
having all .txt
files, we can read them all using sc.textFile("folder/*.txt")
. But what if I have a folder folder
containing even more folders named datewise, like, 03
, 04
, ..., which further contain some .log
files. How do I read these in Spark?
在我的情况下,结构更是嵌套和放大器;复杂,因此一般的答案是preferred。
In my case, the structure is even more nested & complex, so a general answer is preferred.
推荐答案
如果目录结构是有规律的,可以说是这样的:
If directory structure is regular, lets say something like this:
folder
├── a
│ ├── a
│ │ └── aa.txt
│ └── b
│ └── ab.txt
└── b
├── a
│ └── ba.txt
└── b
└── bb.txt
您可以使用 *
通配符嵌套每个级别,如下所示:
you can use *
wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
[u'file:/folder/a/a/aa.txt',
u'file:/folder/a/b/ab.txt',
u'file:/folder/b/a/ba.txt',
u'file:/folder/b/b/bb.txt']
这篇关于阅读在星火嵌套文件夹中的所有文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!