非贪婪字符串正则表达式匹配 [英] Non-greedy string regular expression matching
问题描述
我很确定我在这里遗漏了一些明显的东西,但我不能让 R 使用非贪婪的正则表达式:
I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:
> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")
[,1]
[1,] "aaaab"
基本函数的行为方式相同:
Base functions behave the same way:
> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
I would expect the match to be just ab
as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:
默认重复是贪婪的,所以使用最大可能的重复次数.这可以通过附加 ?到量词.(还有允许近似匹配的量词:请参阅 TRE 文档.)
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)
有人可以解释一下这是怎么回事吗?
Could someone please explain me what's going on?
更新.疯狂的是,在其他一些情况下,非贪婪模式的行为符合预期:
Update. What's crazy is that in some other cases non-greedy patterns behave as expected:
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
[,1]
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
[,1]
[1,] "<a href=\"abc\">"
推荐答案
困难的概念,所以我会尽我所能......如果它有点令人困惑,有人可以随意编辑和解释得更好.
Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.
从左到右搜索与您的模式匹配的表达式.是的,以下所有字符串 aaaab
、aaab
、aab
和 ab
都与您的模式匹配,但是aaaab
从最左边开始的就是返回的那个.
Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab
, aaab
, aab
, and ab
are matches to your pattern, but aaaab
being the one that starts the most to the left is the one that is returned.
所以在这里,你的非贪婪模式不是很有用.也许这个另一个例子会帮助你更好地理解何时出现非贪婪模式:
So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:
str_match('xxx aaaab yyy', "a.*?y")
# [,1]
# [1,] "aaaab y"
这里所有的字符串 aaaab y
, aaaab yy
, aaaab yyy
都匹配模式并从相同的位置开始,但是第一个由于非贪婪模式返回了一个.
Here all of the strings aaaab y
, aaaab yy
, aaaab yyy
matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.
那么你能做些什么来捕捉最后一个ab
?使用这个:
So what can you do to catch that last ab
? Use this:
str_match('xxx aaaab yyy', ".*(a.*b)")
# [,1] [,2]
# [1,] "xxx aaaab" "ab"
它是如何工作的?通过在前面添加贪婪模式 .*
,您现在正在强制进程将最后一个可能的 a
放入捕获的组中.
How does it work? By adding a greedy pattern .*
in the front, you are now forcing the process to put the last possible a
into the captured group.
这篇关于非贪婪字符串正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!