使用正则表达式进行火花过滤 [英] Spark filtering with regex

查看：91 发布时间：2020/9/4 3:41:24 scala apache-spark rdd

本文介绍了使用正则表达式进行火花过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试按日期将文件数据过滤为好坏数据，因此将获得2个结果文件.从测试文件中，前4行需要输入正确的数据，后2行需要输入错误的数据.

I am trying to filter file data into good and bad data per the date, hence will get 2 result files. From test file, first 4 lines need to go in good data and last 2 lines in bad data.

我有2个问题

我没有得到任何好的数据，结果文件为空
和错误的数据结果如下所示-仅拾取名称字符

I am not getting any good data, result file is empty
and bad data result looks like following - picking up the name characters only

(，C，h) (，J，u) (，T，h) (，J，o) (，N，e) (，B，i)

(,C,h) (,J,u) (,T,h) (,J,o) (,N,e) (,B,i)

测试文件

Christopher|Jan 11, 2017|5 
Justin|11 Jan, 2017|5 
Thomas|6/17/2017|5 
John|11-08-2017|5 
Neli|2016|5 
Bilu||5

加载和RDD

scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split("|"))

RegEx

scala> val singleReg = """(\w(3))\s(\d+)(,)\s(\d(4))|(\d+)\s(\w(3))(,)\s(\d(4))|(\d+)(\/)(\d+)(\/)(\d(4))|(\d+)(-)(\d+)(-)(\d(4))""".r

开头和结尾是否有三个(双引号)，. r在这里很重要吗?

Is three " (double quotes) in the beginning and end and .r important here?

过滤器 问题区域

scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))

将数组转换为字符串

scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))

写入文件

scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")

更新1 我上面的正则表达式是错误的，我已将其更新为.在scala反斜杠中是转义字符，因此需要重复

Update 1 My regex above was wrong, i have updated it as. in scala backslash is a escape character, so need to duplicate

val singleReg = """\\w{3}\\s\\d+,\\s\\d{4}|\\d+\\s\\w{3},\\s\\d{4}|\\d+\/\\d+\/\\d{4}|\\d+-\\d+-\\d{4}""".r

检查了regex101上的正则表达式，并且前四行中的日期都通过了.

Checked the regex on regex101 and the dates in the first 4 lines pass.

我再次运行了测试，但是我仍然得到相同的结果.

I have run the the test again and i am still getting the same result.

推荐答案

代码有2个问题:

您用来分隔data.txt行的字符是错误的.应该是'|'而不是"|".
正则表达式singleReg是错误的.

The character that you are using to split the lines of data.txt is wrong. It should be '|' instead of "|".
The regex singleReg is wrong.

正确的代码如下:

加载和RDD

scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split('|'))

RegEx

scala> val singleReg = """\w{3}\s\d{2},\s\d{4}|\d{2}\s\w{3},\s\d{4}|\d{1}\/\d{2}\/\d{4}|\d{2}-\d{2}-\d{4}""".r

过滤器

scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))

将数组转换为字符串

scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))

写入文件

scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")

上面的代码将为您提供以下输出-

The above code will give you following output -

data/singValid

(Christopher,Jan 11, 2017,5 )
(Justin,11 Jan, 2017,5 )
(Thomas,6/17/2017,5 )
(John,11-08-2017,5 )

data/singBad

(Neli,2016,5 )
(Bilu,,5)

这篇关于使用正则表达式进行火花过滤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用正则表达式进行火花过滤 [英] Spark filtering with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用正则表达式进行火花过滤 [英] Spark filtering with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭