强化的贪婪令牌 - 在负面的前瞻之前放置点有什么不同 [英] Tempered Greedy Token - What is different about placing the dot before the negative lookahead
问题描述
<表((小于?!/表>))*< /表>
匹配我所有的表格标签,但是,
≤(?!(小于/表>))表* LT; /表>
不。第二个似乎是有道理的,如果我试图用文字写出表达,但我不能理解第一个。
有人可以解释差异我是这样来的,我从这里得到了一个名为贪婪的贪心令牌这个词: http://www.rexegg.com/regex-quantifiers.html#tempered_greed
由于Google在缓和贪婪令牌
的结果之上返回此SO问题,所以我觉得有义务提供更全面的答案。 >
什么是强化贪婪令牌?
rexegg.com tempered贪心令牌 引用非常简洁:
在
(?:(?!{END})。)*
,*
量词适用于一个点,但它现在是一个 tempered 点。负面的前瞻性(?!{END})
断言当前位置跟随不在字符串{END}
。因此,该点不能与{END}
的大括号匹配,保证我们不会跳过{END}
分隔符。
就是这样:一个 tempered贪心令牌是一种序列 中的否定字符类(参见单个字符,否定了rel =noreferrer> 否定字符类 。
注意:温和的贪心令牌和否定的角色类别之间的区别在于前者并不真的匹配序列本身以外的文本,但是不启动该序列的单个字符。即 (?:(?!abc | xyz))+
将与 defabc
中的 def
不匹配,但将匹配 def
和 bc
,因为 a
开始禁止 abc
序列,而 bc
不。
它包括:
-
(?:...)*
- 量化的非捕获组(它可能是一个捕获组,但捕获每个个人角色没有任何意义)(一个*
可以是+
,这取决于是否预期空字符串匹配) -
(?!...)
- 实际上是一个负面的前瞻对当前位置的权利施加限制 -
。(或任何(通常是单个)字符)消费
然而,我们可以随时使用负面的前瞻中的替代来进一步对令牌进行回调(例如 ?!{(?:结束| START | MID)})
)或者通过将所有匹配的点替换为否定字符类(例如, $ b $ (?:(!!START | END | MID)[^]]
b
消费零件放置
请注意,没有提到消费部分(原始调温贪心令牌中的点)被放置的结构 之前的前瞻。 Avinash的答案清楚地解释了这一点:(。(?!< / table>))*
首先匹配任何字符(但没有DOTALL修饰符的换行符),然后检查如果不符合< / table>
导致在 e https://regex101.com/r/yX9bJ1/1rel =noreferrer> < table>表< / table>
。 消费部分(。
)必须放在回火前瞻性之后。
何时使用温和的贪心令牌?
Rexegg.com提出了一个想法:
- 当我们要在分隔符1和分隔符2之间匹配一个文本块时,没有中间的子串3(例如
{START}(?:(?!{(?: MID | RESTART)}) 。)*?{END}
- 当我们想在中匹配包含特定模式的文本块,而不会溢出后续块(例如,
< table>。*?chair。 ?< / table>
,我们会使用 < (?!:(小于表>)。| code><表>(?:(?椅子< /表>))*椅子)* LT; /表> )。 - 当我们要匹配2个字符串之间的最短窗口时,懒惰匹配不会帮助你需要从
abc 1 abc 2 xyz
获得abc 2 xyz
(请参阅abc。*?xyz
和abc(?:(?!abc))*?xyz
)。
性能问题
淬火贪心令牌资源消耗,因为执行了前瞻性检查每个字符与消费模式相匹配。 展开循环技术可以显着增加温和的贪心令牌表现。
说,我们要匹配 abc 1 abc 2 xyz 3 xyz 中的 abc 2 xyz
EM>。而不是使用 abc 和 xyz
之间的每个字符/ xU4gM0 / 6rel =noreferrer> abc(?:(?!abc | xyz))* xyz
,我们可以跳过所有字符不是 a
或 x
with [^ ax] *
,然后匹配所有 a
( bc
( a(? bc)
)和所有 x
不跟随 yz
(带$ code> x(?!yz)): abc [^斧] *(?:一个(BC)[^斧] * |?!?!X(YZ)[^斧] *)。* XYZ
<table((?!</table>).)*</table>
matches all my table tags, however,
<table(.(?!</table>))*</table>
does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.
Can someone explain the difference to me?
For reference, I got the term `Tempered Greedy Token' from here: http://www.rexegg.com/regex-quantifiers.html#tempered_greed
Since Google returns this SO question on top of the results for the tempered greedy token
, I feel obliged to provide a more comprehensive answer.
What is a Tempered Greedy Token?
The rexegg.com tempered greedy token reference is quite concise:
In
(?:(?!{END}).)*
, the*
quantifier applies to a dot, but it is now a tempered dot. The negative lookahead(?!{END})
asserts that what follows the current position is not the string{END}
. Therefore, the dot can never match the opening brace of{END}
, guaranteeing that we won't jump over the{END}
delimiter.
That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character).
NOTE: The difference between a tempered greedy token and a negated character class is that the former does not really match the text other than the sequence itself, but a single character that does not start that sequence. I.e. (?:(?!abc|xyz).)+
won't match def
in defabc
, but will match def
and bc
, because a
starts the forbidden abc
sequence, and bc
does not.
It consists of:
(?:...)*
- a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a*
can be+
, it depends on whether an empty string match is expected)(?!...)
- a negative lookahead that actually imposes a restriction on the value to the right of the current location.
- (or any (usually single) character) a consuming pattern.
However, we can always further temper the token by using alternations in the negative lookahead (e.g. (?!{(?:END|START|MID)})
) or by replacing the all-matching dot with a negated character class (e.g. (?:(?!START|END|MID)[^<>])
when trying to match text only inside tags).
Consuming part placement
Note there is no mentioning of a construction where a consuming part (the dot in the original tempered greedy token) is placed before the lookahead. Avinash's answer is explaining that part clearly: (.(?!</table>))*
first matches any character (but a newline without a DOTALL modifier) and then checks if it is not followed with </table>
resulting in a failure to match e
in <table>table</table>
. The consuming part (the .
) MUST be placed after the tempering lookahead.
When to use tempered greedy token?
Rexegg.com gives an idea:
- When we want to match a block of text between Delimiter 1 and Delimiter 2 with no Substring 3 in-between (e.g.
{START}(?:(?!{(?:MID|RESTART)}).)*?{END}
- When we want to match a block of text containing a specific pattern inside without overflowing subsequent blocks (e.g. instead of lazy dot matching as in
<table>.*?chair.*?</table>
, we'd use something like<table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>
). - When we want to match the shortest window possible between 2 strings. Lazy matching won't help when you need to get
abc 2 xyz
fromabc 1 abc 2 xyz
(seeabc.*?xyz
andabc(?:(?!abc).)*?xyz
).
Performance Issue
Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.
Say, we want to match abc 2 xyz
in abc 1 abc 2 xyz 3 xyz. Instead of checking each character between abc
and xyz
with abc(?:(?!abc|xyz).)*xyz
, we can skip all characters that are not a
or x
with [^ax]*
, and then match all a
that are not followed with bc
(with a(?!bc)
) and all x
that are not followed with yz
(with x(?!yz)
): abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz
.
这篇关于强化的贪婪令牌 - 在负面的前瞻之前放置点有什么不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!