为什么简单的.*?非贪婪正则表达式在比赛前贪婪地包含其他字符? [英] Why does a simple .*? non-greedy regex greedily include additional characters before a match?
问题描述
我有一个非常简单的正则表达式与此类似:
I have a very simple regex similar to this:
HOHO.*?_HO_
使用此测试字符串...
With this test string...
fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev
- 我希望它只匹配
_HOHO___HO_
(最短匹配,非贪婪) - 相反,它匹配
_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_
(最长的匹配,看起来很贪婪).
- I expect it to match just
_HOHO___HO_
(shortest match, non-greedy) - Instead it matches
_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_
(longest match, looks greedy).
为什么?如何使它匹配最短的匹配?
Why? How can I make it match the shortest match?
添加和删除?
会得到相同的结果.
Adding and removing the ?
gives the same result.
编辑-更好的测试字符串,用于显示[^HOHO]
不起作用的原因:fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye
Edit - better test string that shows why [^HOHO]
doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye
我能想到的是,它可能多次匹配-但是_HO_
只有一个匹配项,所以我不明白为什么它没有采用以_HO_
结尾的最短匹配项,丢弃其余的.
All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_
, so I don't understand why it isn't taking the shortest match that ends at the _HO_
, discarding the rest.
我已经浏览了所有标题为非贪婪正则表达式贪婪"的问题,但它们似乎都存在其他问题.
I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.
推荐答案
In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.
如果一个左手比赛有很多左手比赛,它将始终从到达的第一个开始(实际上会给出 最长 比赛) .
Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).
从本质上讲,它一次穿过字符串一个字符,询问此字符是否匹配?如果匹配,则匹配最短字符并结束.如果不匹配,则移动到下一个字符,重复".我希望它是此字符串中是否有任何匹配项?如果是,则匹配所有字符串中最短的一个."
Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".
您可以通过将.
替换为否定含义不是左侧匹配项"来近似表示两个方向上都不贪心的正则表达式.要否定这样的字符串,需要否定的前瞻性和未捕获的组,但这就像将字符串放入(?:(?!).)
中一样简单.例如,(?:(?!HOHO).)
You can approximate a regex that is non-greedy in both directions by replacing the .
with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).)
. For example, (?:(?!HOHO).)
例如,与HOHO.*?_HO_
等效(左右非贪婪)将是:
For example, the equivalent of HOHO.*?_HO_
which is non-greedy on the left and right would be:
HOHO(?:(?!HOHO).)*?_HO_
因此,正则表达式引擎本质上会遍历每个字符,如下所示:
So the regex engine is essentially going through each character like this:
-
HOHO
-这与左侧匹配吗? -
(?:(?!HOHO).)*
-如果可以,我能否到达右侧而左侧没有重复? -
_HO_
-如果是这样,抓住所有东西,直到右侧比赛 -
?
修饰符-如果有多个右手匹配项,请选择最接近的匹配项
*
或+
上的HOHO
- Does this match the left side?(?:(?!HOHO).)*
- If so, can I reach the right-hand side without any repeats of the left side?_HO_
- If so, grab everything until the right-hand match?
modifier on*
or+
- If there are multiple right-hand matches, choose the nearest one
这篇关于为什么简单的.*?非贪婪正则表达式在比赛前贪婪地包含其他字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!