如何使用正则表达式在 sc.textFile 中包含/排除某些输入文件? [英] How to use regex to include/exclude some input files in sc.textFile?
问题描述
我尝试使用文件中的 Apache spark 过滤掉特定文件的日期到 RDD 函数 sc.textFile()
.
I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile()
.
我尝试执行以下操作:
sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")
这应该匹配以下内容:
/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz
知道如何实现这一目标吗?
Any idea how to achieve this?
推荐答案
看接受的答案,它似乎使用了某种形式的 glob 语法.它还揭示了该 API 是 Hadoop 的FileInputFormat
的暴露.
Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat
.
搜索显示提供给 FileInputFormat
的 addInputPath
或 setInputPath
"可以表示一个文件、一个目录,或者,通过使用glob,文件和目录的集合".或许,SparkContext
也使用这些 API 来设置路径.
Searching reveals that paths supplied to FileInputFormat
's addInputPath
or setInputPath
"may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext
also uses those APIs to set the path.
glob 的语法 包括:
*
(匹配 0 个或多个字符)?
(匹配单个字符)[ab]
(字符类)[^ab]
(否定字符类)[a-b]
(字符范围){a,b}
(交替)\c
(转义符)
*
(match 0 or more character)?
(match single character)[ab]
(character class)[^ab]
(negated character class)[a-b]
(character range){a,b}
(alternation)\c
(escape character)
按照已接受答案中的示例,可以将您的路径写为:
Following the example in the accepted answer, it is possible to write your path as:
sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")
这里不清楚如何使用交替语法,因为逗号用于分隔路径列表(如上所示).根据 zero323 的评论,不需要转义:
It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:
sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")
这篇关于如何使用正则表达式在 sc.textFile 中包含/排除某些输入文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!