正则表达式不够贪婪 [英] Regex not being greedy enough
问题描述
我有下面的正则表达式可以正常工作,直到出现新情况
^.*[?&]U(?:RL)?=(?<URL>.*)$
基本上,它用于URL,以获取U =或URL =之后的所有内容,并在URL匹配中将其返回
因此,对于以下内容
http://localhost?a = b& u = http ://otherhost?foo = bar
URL = http://otherhost?foo = bar
不幸的是,出现了一个奇怪的案件
http://localhost ?a = b& u = http://otherhost?foo = bar& url = http://someotherhost
理想情况下,我希望URL为" http://otherhost?foo = bar& ; url = http://someotherhost ",相反,它只是" http://someotherhost " >
我认为这可以解决它……虽然不是很漂亮
^.*[?&](?<![?&]U(?:RL)?=.*)U(?:RL)?=(?<URL>.*)$
问题
问题不在于.*
不够贪婪.是因为先前出现的 other .*
也是也是贪婪.
为说明此问题,让我们考虑一个不同的示例.考虑以下两种模式;它们是相同的,除了在第二种模式中不愿意\1
:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
在这里,我们有两个捕获组. \1
捕获[0-5]*
,而\2
捕获[5-9]*
.这是这些模式匹配和捕获的并排比较:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
Input Group 1 Group 2 Group 1 Group 2
54321098765 543210 98765 543210 98765
007 00 7 00 7
0123456789 012345 6789 01234 56789
0506 050 6 050 6
555 555 <empty> <empty> 555
5550555 5550555 <empty> 5550 555
请注意,与\2
一样贪婪,它只能抓住\1
尚未抢先的东西!因此,如果要使\2
尽可能多地抓住5
,则必须使\1
不愿意,因此5
实际上是由\2
抓住的.
附件
相关问题
修复
因此,将其应用于您的问题,有两种方法可以解决此问题:您可以使第一个.*
不愿意,因此(请参见rubular.com ):
[?&]U(?:RL)?=(?<URL>.*)$
I've got the following regex that was working perfectly until a new situation arose
^.*[?&]U(?:RL)?=(?<URL>.*)$
Basically, it's used against URLs, to grab EVERYTHING after the U=, or URL= and return it in the URL match
So, for the following
http://localhost?a=b&u=http://otherhost?foo=bar
URL = http://otherhost?foo=bar
Unfortunately an odd case came up
http://localhost?a=b&u=http://otherhost?foo=bar&url=http://someotherhost
Ideally, I want URL to be "http://otherhost?foo=bar&url=http://someotherhost", instead, it is just "http://someotherhost"
EDIT: I think this fixed it...though it's not pretty
^.*[?&](?<![?&]U(?:RL)?=.*)U(?:RL)?=(?<URL>.*)$
The issue
The problem is not that .*
is not being greedy enough; it's that the other .*
that appears earlier is also greedy.
To illustrate the issue, let's consider a different example. Consider the following two patterns; they're identical, except in reluctance of \1
in second pattern:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
Here we have two capturing groups. \1
captures [0-5]*
, and \2
captures [5-9]*
. Here's a side-by-side comparison of what these patterns match and capture:
\1 greedy, \2 greedy \1 reluctant, \2 greedy
^([0-5]*)([5-9]*)$ ^([0-5]*?)([5-9]*)$
Input Group 1 Group 2 Group 1 Group 2
54321098765 543210 98765 543210 98765
007 00 7 00 7
0123456789 012345 6789 01234 56789
0506 050 6 050 6
555 555 <empty> <empty> 555
5550555 5550555 <empty> 5550 555
Note that as greedy as \2
is, it can only grab what \1
didn't already grab first! Thus, if you want to make \2
grab as many 5
as possible, you have to make \1
reluctant, so the 5
is actually up for grab by \2
.
Attachments
Related questions
The fix
So applying this to your problem, there are two ways that you can fix this: you can make the first .*
reluctant, so (see on rubular.com):
^.*?[?&]U(?:RL)?=(?<URL>.*)$
Alternatively you can just get rid of the prefix matching part altogether (see on rubular.com):
[?&]U(?:RL)?=(?<URL>.*)$
这篇关于正则表达式不够贪婪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!