非贪婪字符串正则表达式匹配 [英] Non-greedy string regular expression matching

查看:74
本文介绍了非贪婪字符串正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很确定我在这里遗漏了一些明显的东西,但我不能让 R 使用非贪婪的正则表达式:

I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")                                         
     [,1]   
[1,] "aaaab"

基本函数的行为方式相同:

Base functions behave the same way:

> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

根据 ab/base/html/regex.html" rel="noreferrer">http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

I would expect the match to be just ab as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

默认重复是贪婪的,所以使用最大可能的重复次数.这可以通过附加 ?到量词.(还有允许近似匹配的量词:请参阅 TRE 文档.)

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

有人可以解释一下这是怎么回事吗?

Could someone please explain me what's going on?

更新.疯狂的是,在其他一些情况下,非贪婪模式的行为符合预期:

Update. What's crazy is that in some other cases non-greedy patterns behave as expected:

> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
     [,1]                                          
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
     [,1]              
[1,] "<a href=\"abc\">"

推荐答案

困难的概念,所以我会尽我所能......如果它有点令人困惑,有人可以随意编辑和解释得更好.

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

从左到右搜索与您的模式匹配的表达式.是的,以下所有字符串 aaaabaaabaabab 都与您的模式匹配,但是aaaab 从最左边开始的就是返回的那个.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

所以在这里,你的非贪婪模式不是很有用.也许这个另一个例子会帮助你更好地理解何时出现非贪婪模式:

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
#      [,1]     
# [1,] "aaaab y"

这里所有的字符串 aaaab y, aaaab yy, aaaab yyy 都匹配模式并从相同的位置开始,但是第一个由于非贪婪模式返回了一个.

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.

那么你能做些什么来捕捉最后一个ab?使用这个:

So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
#      [,1]        [,2]
# [1,] "xxx aaaab" "ab"

它是如何工作的?通过在前面添加贪婪模式 .*,您现在正在强制进程将最后一个可能的 a 放入捕获的组中.

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.

这篇关于非贪婪字符串正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆