为什么简单的.*?非贪婪正则表达式在比赛前贪婪地包含其他字符? [英] Why does a simple .*? non-greedy regex greedily include additional characters before a match?

查看:78
本文介绍了为什么简单的.*?非贪婪正则表达式在比赛前贪婪地包含其他字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的正则表达式与此类似:

I have a very simple regex similar to this:

HOHO.*?_HO_

使用此测试字符串...

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • 我希望它只匹配_HOHO___HO_(最短匹配,非贪婪)
  • 相反,它匹配_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_(最长的匹配,看起来很贪婪).
  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

为什么?如何使它匹配最短的匹配?

Why? How can I make it match the shortest match?

添加和删除?会得到相同的结果.

Adding and removing the ? gives the same result.

编辑-更好的测试字符串,用于显示[^HOHO]不起作用的原因:fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye

我能想到的是,它可能多次匹配-但是_HO_只有一个匹配项,所以我不明白为什么它没有采用以_HO_结尾的最短匹配项,丢弃其余的.

All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

我已经浏览了所有标题为非贪婪正则表达式贪婪"的问题,但它们似乎都存在其他问题.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.

推荐答案

我从在正则表达式引擎中,例如Javascript(

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

如果一个左手比赛有很多左手比赛,它将始终从到达的第一个开始(实际上会给出 最长 比赛) .

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

从本质上讲,它一次穿过字符串一个字符,询问此字符是否匹配?如果匹配,则匹配最短字符并结束.如果不匹配,则移动到下一个字符,重复".我希望它是此字符串中是否有任何匹配项?如果是,则匹配所有字符串中最短的一个."

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".

您可以通过将.替换为否定含义不是左侧匹配项"来近似表示两个方向上都不贪心的正则表达式.要否定这样的字符串,需要否定的前瞻性和未捕获的组,但这就像将字符串放入(?:(?!).)中一样简单.例如,(?:(?!HOHO).)

You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

例如,与HOHO.*?_HO_等效(左右非贪婪)将是:

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

因此,正则表达式引擎本质上会遍历每个字符,如下所示:

So the regex engine is essentially going through each character like this:

  • HOHO-这与左侧匹配吗?
  • (?:(?!HOHO).)*-如果可以,我能否到达右侧而左侧没有重复?
  • _HO_-如果是这样,抓住所有东西,直到右侧比赛
  • *+上的
  • ?修饰符-如果有多个右手匹配项,请选择最接近的匹配项
  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one

这篇关于为什么简单的.*?非贪婪正则表达式在比赛前贪婪地包含其他字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆