猪中的正则表达式匹配 [英] Regexp matching in pig

查看:22
本文介绍了猪中的正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 apache pig 和文本

哈哈哈.我哥哥只是没有做错任何事.他考试作弊?没门!

我正在尝试匹配我的兄弟只是没有做错任何事."

理想情况下,我希望匹配以我的兄弟只是"开头并以标点符号(句尾)或 EOL 结尾的任何内容.

查看 pig 文档,然后按照指向 java.util.regex.Pattern 的链接,我想我应该可以使用

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(我的兄弟只是 .*\\p{Punct})')) as (txt:chararray);

但这似乎一直匹配到行尾.对进行这场比赛有什么建议吗?我已经准备好拔头发了,拔头发的意思是切换到python流

解决方案

默认量词是 贪婪.这意味着它们尽可能匹配.在这种情况下,您只想匹配第一个标点符号.换句话说,您希望尽可能少地匹配.

所以为了解决你的问题,你应该通过在它之后立即添加一个 ? 来使量词不贪婪:

<前>我的兄弟只是 .*?\\p{Punct}^

请注意,这里使用 ? 与其用作量词不同,后者表示匹配零或一".

Using apache pig and the text

hahahah.  my brother just didnt do anything wrong. He cheated on a test? no way!

I'm trying to match "my brother just didnt do anything wrong."

Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);

But that seems to match until the end of the line. Any suggestions for performing this match? I'm ready to pull my hair out, and by pull my hair out, I mean switch to python streaming

解决方案

By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.

So to solve your problem you should make the quanitifer non greedy by adding a ? immediately after it:

my brother just .*?\\p{Punct}
                  ^

Note that the use of ? here is different from its use as a quantifier where it means 'match zero or one'.

这篇关于猪中的正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆