正则表达式匹配在猪 [英] Regexp matching in pig

查看:164
本文介绍了正则表达式匹配在猪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用apache猪和文本

  haha​​hah。我的兄弟没有做任何错事。他在考试中作弊?没门! 

我试图匹配我的兄弟只是没有做错任何事情。



理想情况下,我想匹配任何以我的兄弟开头并以标点符号(句尾)或EOL结尾的内容。

查看猪文档,然后按照指向java.util.regex.Pattern的链接,我想我应该可以使用

 (code> extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just。* \\p {Punct})'))as(txt:chararray); 

但是这似乎一直持续到行尾。有关进行这场比赛的任何建议?我准备把我的头发拉出来,然后把头发拉出来,我的意思是切换到python流式传输中。

默认情况下量词是贪婪。这意味着它们尽可能匹配。在这种情况下,您只想匹配到第一个标点符号。换句话说,你想匹配尽可能少。

因此,要解决您的问题,您应该通过添加一个?使量词非贪心立即在它之后:

 
我的兄弟只是。*?\\p {Punct}
^

请注意,这里使用与它作为量词,它的意思是'匹配零或一个'。


Using apache pig and the text

hahahah.  my brother just didnt do anything wrong. He cheated on a test? no way!

I'm trying to match "my brother just didnt do anything wrong."

Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);

But that seems to match until the end of the line. Any suggestions for performing this match? I'm ready to pull my hair out, and by pull my hair out, I mean switch to python streaming

解决方案

By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.

So to solve your problem you should make the quanitifer non greedy by adding a ? immediately after it:

my brother just .*?\\p{Punct}
                  ^

Note that the use of ? here is different from its use as a quantifier where it means 'match zero or one'.

这篇关于正则表达式匹配在猪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆