如何使用正则表达式在sc.textFile中包含/排除某些输入文件? [英] How to use regex to include/exclude some input files in sc.textFile?

查看：317 发布时间：2020/9/3 23:17:19 scala apache-spark

本文介绍了如何使用正则表达式在sc.textFile中包含/排除某些输入文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使用RDD函数sc.textFile()内的Apache Spark过滤掉特定文件的日期.

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile().

我尝试执行以下操作:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

这应符合以下条件:

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

有什么想法要实现吗?

推荐答案

查看

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

搜索表明，提供给FileInputFormat的addInputPath或setInputPath的路径可能表示文件，目录，或者使用glob表示文件和目录的集合" .也许SparkContext也使用这些API来设置路径.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

全局语法包括:

*(匹配0个或更多字符)
?(匹配单个字符)
[ab](字符类)
[^ab](否定字符类)
[a-b](字符范围)
{a,b}(替代)
\c(转义字符)

* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[a-b] (character range)
{a,b} (alternation)
\c (escape character)

按照接受的答案中的示例，可以将路径写为:

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

目前尚不清楚如何在这里使用交替语法，因为逗号用于分隔路径列表(如上所示).根据 zero323 的评论，无需转义:

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

这篇关于如何使用正则表达式在sc.textFile中包含/排除某些输入文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用正则表达式在sc.textFile中包含/排除某些输入文件? [英] How to use regex to include/exclude some input files in sc.textFile?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用正则表达式在sc.textFile中包含/排除某些输入文件? [英] How to use regex to include/exclude some input files in sc.textFile?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭