JavaScript中的贪婪表现不同? [英] Greediness behaving differently in JavaScript?

查看:87
本文介绍了JavaScript中的贪婪表现不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题让我意识到量词的贪婪在某些正则表达式引擎中并不总是相同。从该问题中取出正则表达式并对其进行修改:

There was this question which made me realise that greediness of quantifiers is not always the same in certain regex engines. Taking the regex from that question and modifying it a bit:

!\[(.*?)*\]

(我知道 * 是在这里多余,但我发现以下内容是一个非常有趣的行为。)

(I know that * is redundant here, but I found what's following to be quite an interesting behaviour).

如果我们尝试匹配:

![][][]

我希望第一个捕获组为空,因为(。*?)是懒惰的并且会在第一个] <停止/ code>它遇到了。这确实发生在以下情况:

I expected to get the first capture group to be empty, because (.*?) is lazy and will stop at the first ] it comes across. This is indeed what happens in:

  • PCRE
  • Python
  • but not Javascript where it matches the whole ][][. (jsfiddle)

我用其他一些语言环顾四周,例如 ruby​​ java C#但所有行为都像我预期的那样(即返回空捕获组)。

I looked around with some other languages, for instance ruby, java, C# but all behave like I expected them to (i.e. return empty capture groups).

(regexplanet的 golang 风味显然也会得到非空的捕获组)

(regexplanet's golang flavour apparently also gets non-empty capture groups)

似乎JavaScript的正则表达式引擎正在解释第二个 * 。*?从懒惰转换为贪婪。请注意,将第二个 * 转换为 *?似乎使正则表达式按预期工作(删除量词)完全,因为我知道在这种情况下它是多余的,但这不是重点。)

It seems that JavaScript's regex engine is interpreting the second * to convert .*? from lazy to greedy. Note that converting the second * to *? seems to make the regex work as I expected (as does removing the quantifier completely, because I know that it's being redundant in this situation, but that's not the point).

* 是在正则表达式中使用,但此行为类似于 + {m, n} 并将它们转换为懒惰版本会产生与 *?相同的结果。

* was used in the regex, but this behaviour is similar with +, ? or {m,n} and converting them to their lazy version gives the same results as with *?.

有谁知道发生了什么事?

Does anyone know what's really happening?

推荐答案

简短回答



行为在 ECMA-262规范中定义 15.10.2模式语义部分,特别是在 15.10.2.5 中,它讨论了生产 Term :: Atom Quantifier 的语义。

Short answer

The behavior is defined in ECMA-262 specs in section 15.10.2 Pattern Semantics, especially in 15.10.2.5 where it discusses the semantics of the production Term :: Atom Quantifier.

略微概括:Le t E 可以匹配空字符串的模式。如果存在输入字符串S,其中空字符串是E中的第一个匹配选项,则包含贪婪重复模式E的模式会受到影响。例如:

As a slight generalization: Let E be a pattern that can match empty string. If there exists an input string S where empty string is the first matchable choice in E, patterns which contain a greedy repetition of pattern E are affected. For example:

(a*?)*              against     aaaa
!\[(.*?)*\]         against     ![][][]
(.*?){1,4}          against     asdf
(|a)*               against     aaaa
(a|()|b){0,4}       against     aaabbb

Firefox和其他基于Webkit的浏览器似乎都遵循这些规范,而当受影响的模式没有续集时,IE的规格不符合规格。

Firefox and other Webkit-based browsers seems to follow the specs closely, while IE is off the specs in the case when there is no sequel to the affected pattern.

下面引用规范的相关部分。省略了某些部分规范([...])以保持讨论的相关性。我将通过缩小规范中的信息进行解释,同时保持简单。

The relevant part of the specs is quoted below. Some part of the specs is omitted ([...]) to keep the discussion relevant. I will explain by condensing the information from the specs while keeping it simple.



  • State 是一个有序对( endIndex 捕获),其中 endIndex 是一个整数,捕获 NcapturingParens 值的内部数组。 States 用于表示正则表达式匹配算法中的部分匹配状态。 endIndex 是一个加上到目前为止由模式匹配的最后一个输入字符的索引,而捕获保存捕获括号的结果。 [...]。由于回溯,许多可能在匹配过程中随时使用。

  • MatchResult 状态或特殊标记失败,表示匹配失败。

  • A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is an internal array of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. [...]. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token failure that indicates that the match failed.

这里应该没有混淆。 State 跟踪到目前为止已处理的位置(以及我们目前不感兴趣的捕获)。 MatchResult ,嗯,是匹配结果(呃!)。

There should be no confusion here. State keeps track of the position that has been processed so far (and also the captures, which we are not interested in for the moment). MatchResult, well, is the match result (duh!).



  • A Continuation 过程是一个内部闭包(即一个内部过程,其中一些参数已绑定到值),它接受一个 State 参数并返回 MatchResult 结果。如果内部闭包引用了在创建闭包的函数中绑定的变量,则闭包使用这些变量在创建闭包时具有的值。 Continuation 尝试将模式的剩余部分(由闭包的已绑定参数指定)与输入String匹配,从其 State 参数给出的中间状态开始。如果匹配成功, Continuation 将返回它到达的最终 State ;如果匹配失败, Continuation 将返回失败

  • 匹配过程是内部关闭采用两个参数 - State Continuation - 并返回 MatchResult 结果。 Matcher 尝试将模式的中间子模式(由闭包的已绑定参数指定)与输入String匹配,从其 State 参数给出的中间状态开始。 Continuation 参数应该是一个与模式的其余部分匹配的闭包。在匹配模式的子模式以获得新的状态之后,匹配器然后在新的状态上调用 Continuation 测试模式的其余部分是否也匹配。如果可以,匹配器将返回 Continuation 返回的 State ;如果没有, Matcher 可以在其选择点尝试不同的选择,反复调用 Continuation 直到它成功或所有可能性都已用尽。

  • A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
  • A Matcher procedure is an internal closure that takes two arguments -- a State and a Continuation -- and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against the input String, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.

匹配器包含与其代表的子模式匹配的代码,它将调用 Continuation 继续比赛。并且 Continuation 包含用于继续 Matcher 停止的匹配的代码。他们都接受 State 作为参数来知道从哪里开始匹配。

A Matcher contains code to match a subpattern that it represents, and it will call Continuation to continue the match. And Continuation contains code to continue the match from where Matcher left off. They both accepts State as an argument to know where to start matching.


生产 Term :: Atom Quantifier 评估如下:


  1. 评估 Atom 以获取匹配器 m

  2. 评估量词以获得三个结果:整数 min ,整数(或∞) max ,以及Boolean greedy

  3. 如果 max 是有限且小于 min ,则抛出语法错误异常。

  4. parenIndex 成为此生产扩展左侧的整个正则表达式中左捕获括号的数量>期限。 [...]

  5. parenCount 成为此制作 Atom 扩展中左侧捕获括号的数量。 [...]

  6. 返回一个内部匹配器闭包,它带有两个参数,一个状态 x 和一个Continuation c ,并执行以下内容:

  1. Evaluate Atom to obtain a Matcher m.
  2. Evaluate Quantifier to obtain the three results: an integer min, an integer (or ∞) max, and Boolean greedy.
  3. If max is finite and less than min, then throw a SyntaxError exception.
  4. Let parenIndex be the number of left capturing parentheses in the entire regular expression that occur to the left of this production expansion's Term. [...]
  5. Let parenCount be the number of left capturing parentheses in the expansion of this production's Atom. [...]
  6. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following:

  1. 调用RepeatMatcher( m,min,max,greedy,x,c,parenIndex,parenCount )并返回结果。

  1. Call RepeatMatcher(m, min, max, greedy, x, c, parenIndex, parenCount) and return its result.



请注意 m 是正在重复的 Atom 的匹配器,并且从更高级别生成的代码传入Continuation c 生产规则。

Take note that m is the Matcher for the Atom that is being repeated, and Continuation c is passed in from the code generated from higher level production rules.


抽象操作 RepeatMatcher 需要八个参数,一个匹配器 m ,整数 min ,整数(或∞) max ,布尔贪婪,状态 x ,a继续 c ,整数 parenIndex 和整数 parenCount ,并执行以下操作:

The abstract operation RepeatMatcher takes eight parameters, a Matcher m, an integer min, an integer (or ∞) max, a Boolean greedy, a State x, a Continuation c, an integer parenIndex, and an integer parenCount, and performs the following:


  1. 如果 max 为零,则调用 c(x) d返回其结果。

  2. 创建一个内部延续闭包 d ,它接受一个State参数 y 并执行以下操作:

  1. If max is zero, then call c(x) and return its result.
  2. Create an internal Continuation closure d that takes one State argument y and performs the following:

  1. 如果 min 为零且 y endIndex 等于 x endIndex ,然后返回失败

  2. 如果 min 为零然后让 min2 为零;否则让 min2 min - 1。

  3. 如果 max 是∞,那么让 max2 是∞;否则让 max2 max - 1。

  4. 调用RepeatMatcher( m,min2,max2,greedy,y,c ,parenIndex,parenCount )并返回其结果。

  1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.
  2. If min is zero then let min2 be zero; otherwise let min2 be min - 1.
  3. If max is ∞, then let max2 be ∞; otherwise let max2 be max - 1.
  4. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.


  • cap 成为新副本 x 捕获内部数组。

  • 对于满足 parenIndex的每个 k 整数< k k parenIndex + parenCount ,将 cap [ k ]设置为 undefined

  • e 成为 x endIndex

  • xr 是状态( e,cap )。

  • 如果 min 不为零,则调用 m(xr,d)并返回其结果。

  • 如果 greedy false ,则

  • Let cap be a fresh copy of x's captures internal array.
  • For every integer k that satisfies parenIndex < k and k ? parenIndex + parenCount, set cap[k] to undefined.
  • Let e be x's endIndex.
  • Let xr be the State (e, cap).
  • If min is not zero, then call m(xr, d) and return its result.
  • If greedy is false, then


    1. 致电 c(x)并让 z 成为其结果。

    2. 如果 z 不是失败,请返回 z

    3. 致电 m( xr,d)并返回其结果。

    1. Call c(x) and let z be its result.
    2. If z is not failure, return z.
    3. Call m(xr, d) and return its result.


  • 致电 m(xr,d)并让 z 成为其结果。

  • 如果 z 不是失败,请返回 z

  • 致电 c(x)并返回结果。

  • Call m(xr, d) and let z be its result.
  • If z is not failure, return z.
  • Call c(x) and return its result.

  • 第2步定义了一个Continuation d ,我们尝试匹配另一个repea实例ted Atom,后来在第7步( min > 0 case),步骤8.3(懒惰的情况)和第9步(贪婪的情况)通过Matcher m

    Step 2 defines a Continuation d where we try to match another instance of the repeated Atom, which is later called in step 7 (min > 0 case), step 8.3 (lazy case) and step 9 (greedy case) via the Matcher m.

    步骤5和6创建当前状态的副本,用于回溯目的,并在步骤2中检测空字符串匹配。

    Step 5 and 6 creates a copy of the current State, for backtracking purpose, and to detect empty string match in step 2.

    此处的步骤描述了3个不同的案例:

    The steps from here on describes 3 separate cases:


    • 第7步涵盖量词具有非零 min 值的情况。贪婪无论如何,在允许我们调用Continuation c 之前,我们需要至少匹配Atom的 min 实例。

    • Step 7 covers the case where the quantifier has non-zero min value. Greediness regardless, we need to match at least min instances of Atom, before we are allowed to call the Continuation c.

    由于步骤7中的条件,此时 min 为0。步骤8处理量词是惰性的情况。步骤9,10,11处理量词贪婪的情况。请注意相反的调用顺序。

    Due to the condition in step 7, min is 0 from this point on. Step 8 deals with the case where the quantifier is lazy. Step 9, 10, 11 deals with the case where the quantifier is greedy. Note the reversed order of calling.

    调用 m(xr,d)表示匹配一个Atom实例,然后调用Continuation d 继续重复。

    Calling m(xr, d) means matching one instance of the Atom, then calling Continuation d to continue the repetition.

    调用Continuation c(x)意味着摆脱重复并匹配下一步。注意Continuation c 如何传递给Continuation d 中的RepeatMatcher,以便它可以用于重复的所有后续迭代。

    Calling Continuation c(x) means getting out of the repetition and matching whatever comes next. Note how Continuation c is passed to the RepeatMatcher inside Continuation d, so that it is available to all subsequent iteration of the repetition.

    步骤2.1 是导致PCRE和JavaScript之间结果出现差异的罪魁祸首。

    Step 2.1 is the culprit that causes the discrepancy in the result between PCRE and JavaScript.



    1. 如果 min 为零且 y endIndex 等于 x endIndex ,然后返回失败

    1. If min is zero and y's endIndex is equal to x's endIndex, then return failure.


    当量词最初为0 min <时,达到条件 min = 0 / em>( * {0,n} )或通过第7步,必须调用 min > 0(原始量词是 + {n,} {n,m} )。

    The condition min = 0 is reached when the quantifier originally has 0 as min (* or {0,n}) or via step 7, which must be called as long as min > 0 (original quantifier is + or {n,} or {n,m}).

    min = 0 量词是贪心的,Matcher m wi将被调用(在步骤9中),它尝试匹配Atom的实例并调用在步骤2中设置的Continuation d 。继续 d 将递归调用RepeatMatcher,然后再次调用Matcher m (步骤9)。

    When min = 0 and the quantifier is greedy, Matcher m will be called (in step 9), which attempts to match an instance of Atom and call Continuation d that is set up in step 2. Continuation d will recursively call RepeatMatcher, which in turn will call Matcher m (step 9) again.

    如果没有步骤2.1,如果匹配 m 空字符串作为其在输入中向前推进的第一个可能选择,迭代将(理论上)永远地继续前进。鉴于JavaScript RegExp支持的功能有限(没有前向声明的反向引用或其他花哨功能),当前迭代匹配空字符串时,无需继续进行另一次迭代 - 无论如何,空字符串将再次匹配。步骤2.1是JavaScript处理空字符串重复的方法。

    Without step 2.1, if Matcher m has empty string as its first possible choice to advance ahead in the input, the iteration will (theoretically) continue forever without advance ahead. Given the limited features that JavaScript RegExp supports (no forward-declared back-reference or other fancy features), there is no need to go ahead with another iteration when the current iteration matches empty string - an empty string is going to be matched again anyway. Step 2.1 is JavaScript's method of dealing with repetition of empty string.

    如步骤2.1中所设置的,当 min = 0时, m (返回失败)匹配空字符串时,将修剪JavaScript 正则表达式引擎。这个有效地拒绝有限多次重复空字符串是一个空字符串

    As set up in step 2.1, when min = 0, the JavaScript regex engine will prune when an empty string is matched by the Matcher m (return failure). This effectively rejects "empty string repeated finitely many times is an empty string".

    步骤2.1的另一个副作用是从那时起 min = 0,只要有一个实例,其中Matcher m 匹配非空字符串, Atom的最后一次重复保证非空。

    Another side effect of step 2.1 is that from the point when min = 0, as long as there is one instance where Matcher m matches non-empty string, the last repetition of Atom is guaranteed to be non-empty.

    相比之下,似乎 PCRE (以及其他引擎)接受有限多次重复的空字符串是一个空字符串,并继续使用该模式的其余部分。这解释了上面列出的案例中PCRE的行为。至于算法,PCRE连续两次匹配空字符串后停止重复。

    In contrast, it seems PCRE (and other engines) accepts "empty string repeated finitely many times is an empty string", and continue on with the rest of the pattern. This explains the behavior of PCRE in the cases listed above. As for the algorithm, PCRE stops the repetition after matching empty string twice in a row.

    这篇关于JavaScript中的贪婪表现不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆