Tempered Greedy Token - 在负前瞻之前放置点有什么不同? [英] Tempered Greedy Token - What is different about placing the dot before the negative lookahead?

查看:27
本文介绍了Tempered Greedy Token - 在负前瞻之前放置点有什么不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<table((?!</table>).)*</table>

匹配我所有的表格标签.然而,

matches all my table tags. However,

<table(.(?!</table>))*</table>

没有.如果我尝试用文字写出表达式,第二个似乎有意义,但我无法理解第一个.

does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.

有什么区别?

作为参考,我得到了术语Tempered Greedy Token";从这里:Tempered Greedy Token 解决方案

For reference, I got the term "Tempered Greedy Token" from here: Tempered Greedy Token Solution

推荐答案

什么是Tempered Greedy Token?

rexegg.com tempered greedy token 参考资料很简洁:

What is a Tempered Greedy Token?

The rexegg.com tempered greedy token reference is quite concise:

(?:(?!{END}).)* 中,* 量词适用于一个点,但它现在是一个 tempered 点.否定前瞻 (?!{END}) 断言当前位置后面的不是字符串 {END}.因此,点永远不能匹配{END}的左大括号,保证我们不会跳过{END}分隔符.

In (?:(?!{END}).)*, the * quantifier applies to a dot, but it is now a tempered dot. The negative lookahead (?!{END}) asserts that what follows the current position is not the string {END}. Therefore, the dot can never match the opening brace of {END}, guaranteeing that we won't jump over the {END} delimiter.

就是这样:温和的贪婪令牌是一种否定字符类,用于字符序列 (参见 否定字符类单个字符).

That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character).

注意:缓和的贪婪标记和否定的字符类之间的区别在于前者并不真正匹配序列本身以外的文本,而是一个 不开始该序列的单个字符.IE.(?:(?!abc|xyz).)+不会匹配 defabc 中的 def,但会匹配 def and bc,因为 a 开始禁止的 abc 序列,而 bc 没有.

NOTE: The difference between a tempered greedy token and a negated character class is that the former does not really match the text other than the sequence itself, but a single character that does not start that sequence. I.e. (?:(?!abc|xyz).)+ won't match def in defabc, but will match def and bc, because a starts the forbidden abc sequence, and bc does not.

它包括:

  • (?:...)* - 一个量化的非捕获组(它可能是一个捕获组,但捕获每个单独的字符没有意义)(一个 * 可以是 +,这取决于是否需要空字符串匹配)
  • (?!...) - 实际上对当前位置右侧的值施加限制的负前瞻
  • . -(或任何(通常是单个)字符)一个消耗模式.
  • (?:...)* - a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a * can be +, it depends on whether an empty string match is expected)
  • (?!...) - a negative lookahead that actually imposes a restriction on the value to the right of the current location
  • . - (or any (usually single) character) a consuming pattern.

然而,我们总是可以通过在负前瞻中使用交替(例如 (?!{(?:END|START|MID)}))或通过替换所有 -匹配点与否定字符类(例如 (?:(?!START|END|MID)[^<>]) 当尝试仅匹配标签内的文本时).

However, we can always further temper the token by using alternations in the negative lookahead (e.g. (?!{(?:END|START|MID)})) or by replacing the all-matching dot with a negated character class (e.g. (?:(?!START|END|MID)[^<>]) when trying to match text only inside tags).

请注意,没有提到将消耗部分(原始调和贪婪令牌中的点)放在前瞻之前的结构.Avinash 的回答清楚地解释了这一部分: (.(?!</table>))* 首先匹配任何字符(但没有 DOTALL 修饰符的换行符),然后检查它后面是否没有</table> 导致在

table
.*消耗部分(.) 必须 放在回火前.

Note there is no mentioning of a construction where a consuming part (the dot in the original tempered greedy token) is placed before the lookahead. Avinash's answer is explaining that part clearly: (.(?!</table>))* first matches any character (but a newline without a DOTALL modifier) and then checks if it is not followed with </table> resulting in a failure to match e in <table>table</table>. *The consuming part (the .) must be placed after the tempering lookahead.

Rexegg.com 给出了一个想法:

Rexegg.com gives an idea:

  • 当我们想要匹配分隔符 1 和分隔符 2 之间的文本块时,中间没有子字符串 3(例如 {START}(?:(?!{(?:MID|RESTART)}).)*?{END}
  • 当我们想要匹配包含特定模式的文本块时不会溢出后续块(例如,代替 <table>.*?chair.*?</table>,我们会使用类似 (?:(?!chair|).)*chair(?:(?!
    ).)*
    ).
  • 当我们想要匹配 2 个字符串之间可能的最短窗口时.当您需要从 abc 1 abc 2 xyz 获取 abc 2 xyz 时,延迟匹配将无济于事(请参阅 abc.*?xyzabc(?:(?!abc).)*?xyz).
  • When we want to match a block of text between Delimiter 1 and Delimiter 2 with no Substring 3 in-between (e.g. {START}(?:(?!{(?:MID|RESTART)}).)*?{END}
  • When we want to match a block of text containing a specific pattern inside without overflowing subsequent blocks (e.g. instead of lazy dot matching as in <table>.*?chair.*?</table>, we'd use something like <table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>).
  • When we want to match the shortest window possible between 2 strings. Lazy matching won't help when you need to get abc 2 xyz from abc 1 abc 2 xyz (see abc.*?xyz and abc(?:(?!abc).)*?xyz).

Tempered greedy token 消耗资源,因为在每个字符与消耗模式匹配后都会执行先行检查.展开循环技术可以显着提高缓和贪婪令牌的性能.

Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.

比如说,我们想在abc 1 abc 2 xyz 3 xyz中匹配abc 2 xyz.而不是使用 检查 abcxyz 之间的每个字符abc(?:(?!abc|xyz).)*xyz,我们可以跳过所有不是ax[^ax]*,然后匹配所有后面没有 bca(用 a(?!bc)) 和所有没有跟在 yz 后面的 x(带有 x(?!yz)):abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz.

Say, we want to match abc 2 xyz in abc 1 abc 2 xyz 3 xyz. Instead of checking each character between abc and xyz with abc(?:(?!abc|xyz).)*xyz, we can skip all characters that are not a or x with [^ax]*, and then match all a that are not followed with bc (with a(?!bc)) and all x that are not followed with yz (with x(?!yz)): abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz.

这篇关于Tempered Greedy Token - 在负前瞻之前放置点有什么不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆