为什么仅当字符重复2 ^ n-1次时,此递归正则表达式才匹配? [英] Why will this recursive regex only match when a character repeats 2^n - 1 times?

查看:83
本文介绍了为什么仅当字符重复2 ^ n-1次时,此递归正则表达式才匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读 polygenelubricants 的有关高级正则表达式技术的系列文章之后(尤其是 www.ideone.com/上看到的那样a9T3F .相反,只有2 n -1的字符串(即空字符串,aaaaaaaaaaa a 15 )重复的字符与正则表达式匹配.

奇怪的是,如果我修改模式以使递归是可选的(即^(([a-z])(?1)?\2|[a-z]?)$,请参见 www. ideone.com/D6lJR ,它仅匹配字符重复2 n 次的字符串(即空字符串,aaaaaaaaaaaaaaa a 16 ).

为什么我的正则表达式不能按我期望的方式工作?

为那些发痒的人的建议,建议不要使用正则表达式:
这个问题的重点是学习如何正确使用递归正则表达式.我知道这不是确定字符串是否为回文式的有效方法,并且由于某种原因我必须在生产代码中确定回文式,因此我不会使用递归正则表达式.我只是想对正则表达式的高级方面有更多了解.

解决方案

您观察到的现象是由于与Perl不同,PCRE子模式递归是原子的. 手册页实际上非常详细地介绍了此问题:

在PCRE中(类似于Python,但与Perl不同),递归子模式调用是 总是被视为原子团.也就是说,一旦匹配了 主题字符串,即使它包含未使用的字符串,也不会重新输入 替代方案,并且随后会出现匹配失败.

这可以通过以下模式来说明,该模式旨在匹配包含奇数个字符的回文字符串(例如, "a""aba""abcba""abcdcba"):

    ^(.|(.)(?1)\2)$

这个想法是,它要么匹配一个字符,要么匹配两个相同的字符 次回文周围的字符. 在Perl中,这种模式有效; 在PCRE中,如果图案的长度超过三个字符,则不会.

考虑主题字符串"abcba":

在顶层,第一个字符被匹配,但是在第一个字符不匹配 字符串的末尾,第一个替代方法失败;第二种选择 并执行递归.对子模式1的递归调用 成功匹配下一个字符("b"). (请注意,开头 和行尾测试不属于递归的一部分.

返回顶层,将下一个字符("c")与 匹配的子模式2为"a".这失败了.因为递归 被视为一个原子团,现在没有回溯点, 因此整个比赛失败了. (此时,Perl可以重新 输入递归并尝试第二种选择.)但是,如果 模式是用其他顺序写的,其他的是 不同:

    ^((.)(?1)\2|.)$

这一次,首先尝试递归替代,然后继续 递归直到字符用完,此时递归 失败.但这一次,我们确实有另一种方法可以尝试 更高层次.那是最大的区别:在以前的案例中, 剩下的选择是在更深层次的递归级别上,PCRE无法使用.

更改模式以匹配所有回文字符串,而不仅仅是 那些字符数为奇数的字符,很容易改变 模式:

    ^((.)(?1)\2|.?)$

同样,这在Perl中有效,但在PCRE中无效,并且出于相同的原因. 当更深层次的递归匹配单个字符时,就不能 再次输入以匹配空字符串.解决方案是 分离这两种情况,并写出奇数和偶数情况作为替代 在更高级别:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$


警告!!!

只有在主题字符串开头的回文不短于以下内容的回文时,以上回文匹配模式才起作用 整个字符串.例如,尽管"abcba"正确匹配,但如果 受试者为"ababa",PCRE在开始时发现回文"aba", 然后在顶级失败,因为字符串的结尾不跟在后面. 再一次,它无法跳回递归以尝试其他替代方法, 所以整个比赛失败了.

其他参考

  • regular-expressions.info/原子分组
    • (?>…)在某种程度上是原子分组语法
    • 周围的环境(?=…)(?!…)(?<=…)(?<!…)都是原子的
    • 所有量词(例如a*+)也是原子
    • PCRE递归子模式和子例程调用也是原子的

仔细研究模式

atomicity参数是正确的,但它如何解释模式观察到的行为的方式也许并不明显.让我们仔细看看,看看这一切如何适合:

我们将使用第一种模式:

^(([a-z])(?1)\2|[a-z]?)$

我将使用以下符号表示递归:

  • 1表示该角色在第一个候补中被捕获到第2组中
  • 2表示该字符已与第二个备用字符匹配
    • 或者如果2不在字符上方,则会执行?的零重复选项
  • \表示该字符已在第一个备用行中被反向引用匹配到第2组
  • _表示递归分支的底部
    • 即使有其他选择,也不会重新进入此分支!

现在让我们考虑"aaa"作为输入:

      _
1 1 1 2 
a a a   # This is the first bottom of the recursion,
        # now we go back to the third 1 and try to match \.
        # This fails, so the third 1 becomes 2.
    _
1 1 2
a a a   # Now we go back to the second 1 and try to match \.
        # This fails, so the second 1 becomes 2.
  _
1 2
a a a   # The second level matched! now we go back to the first level...

_____
1 2 \
a a a   # Now the first 1 can match \, and entire pattern matches!!


现在考虑"aaaaa":

          _
1 1 1 1 1 2
a a a a a  # Fifth 1 can't match \, so it becomes 2. 
        _
1 1 1 1 2
a a a a a  # Fourth 1 can't match \, so it becomes 2.
    _____
1 1 1 2 /
a a a a a  # Here's a crucial point. The third 1 successfully matched.
           # Now we're back to the second 1 and try to match \, but this fails.
           # However, since PCRE recursion is atomic, the third 1 will NOT be
           # reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a  # Anchors don't match, so the first 1 becomes 2, and then also the
           # anchors don't match, so the pattern fails to match.

请注意,一旦第一个替代级别上的递归级别匹配,以后就不会再尝试第二个替代(即使这样做可能会导致可能匹配),因为PCRE子模式递归是原子的.


现在考虑"aa":

    _
1 1 2 
a a
  _
1 2
a a  # The second level matched by taking the one repetition option on ?.
     # We now go back to the first level, and we can't match \.
     # Since PCRE recursion is atomic, we can't go back to the second level
     # to try the zero repetition option on ?.
_    
2
a a  # Anchors don't match, trying zero option on ? also doesn't help,
     # so the pattern fails to match!

请注意,一旦递归级别与第二个替代项上的?的一次重复匹配,将来就不会尝试零重复选项(即使这样做可能会导致可能匹配),因为PCRE子模式递归是原子的.


现在让我们考虑aaaaaaa

              _
1 1 1 1 1 1 1 2  
a a a a a a a 
            _
1 1 1 1 1 1 2  
a a a a a a a 
        _____
1 1 1 1 1 2 \  
a a a a a a a  # A crucial point: the fifth level matched and now the fourth
               # level can't match \, but it does NOT reenter the fifth level to
               # try 2. Instead, the fourth level tries 2.
    _____    
1 1 1 2 \  
a a a a a a a 
  _________    
1 1 1 2 \ \ 
a a a a a a a 
_____________    
1 1 1 2 \ \ \
a a a a a a a  # Entire pattern is a match! 

请注意,即使PCRE子模式递归是原子的,它仍然可以成功地匹配由重复2 n -1次的字符组成的回文.


现在,为了好玩,让我们尝试"abcba":

          _
1 1 1 1 1 2
a b c b a
        _
1 1 1 1 2
a b c b a

1 1 1 2 
a b c b a   # Third level attempts \, but c does not match a!
            # So we go back to third 1 and try 2.
  _____
1 1 2 \ 
a b c b a 
_________
1 1 2 \ \
a b c b a   # Entire pattern is a match!

也就是说,模式不仅仅与仅当字符重复2 n -1次时匹配".它确实可以匹配"abcba"(如在ideone.com上看到的 ).但是,它不能匹配"ababa",也不能匹配"aaaaa"(请参见手册页上的警告!),因为PCRE中的子模式递归是原子的.

您可以应用相同的跟踪过程来解释模式在任何输入上的行为.

After reading polygenelubricants's series of articles on advanced regular expressions techniques (particularly How does this Java regex detect palindromes?), I decided to attempt to create my own PCRE regex to parse a palindrome, using recursion (in PHP).

What I came up with was:

^(([a-z])(?1)\2|[a-z]?)$

My understanding of this expression is that it should either match zero or one characters (every string of less than 2 characters is implicitly a palindrome, as well as to account for palindromes with odd lengths in the recursion), or two of the same character separated by a recursion of the pattern.

Unfortunately, it does not seem to be working that way, as you can see at www.ideone.com/a9T3F. Instead, only the strings of 2n - 1 (ie. empty string, a, aaa, aaaaaaa, a15) repeated characters match the regular expression.

Oddly, if I modify my pattern so that the recursion is optional (ie. ^(([a-z])(?1)?\2|[a-z]?)$, see www.ideone.com/D6lJR, it only matches strings with a character repeated 2n times (ie. empty string, a, aa, aaaa, aaaaaaaa, a16).

Why is my regex not working the way I expect it to?

Note for the people who are itching to suggest not to use regex:
The point of this question is to learn how to use recursive regular expressions properly. I know that this is not an effective way to determine if a string is a palindrome, and I wouldn't use a recursive regex if I for some reason had to determine palindromes in production code; I am just interested in learning more about the advanced aspects of regex.

解决方案

The phenomenon you're observing is due to the fact that PCRE subpattern recursion is atomic, unlike Perl. The man page actually covers this problem in great detail:

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):

    ^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE it does not if the pattern is longer than three characters.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alternative is taken and the recursion kicks in. The recursive call to subpattern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what subpattern 2 matched, which was "a". This fails. Because the recursion is treated as an atomic group, there are now no backtracking points, and so the entire match fails. (Perl is able, at this point, to re- enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different:

    ^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to recurse until it runs out of characters, at which point the recursion fails. But this time we do have another alternative to try at the higher level. That is the big difference: in the previous case the remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just those with an odd number of characters, it is tempting to change the pattern to this:

    ^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason. When a deeper recursion has matched a single character, it cannot be entered again in order to match an empty string. The solution is to separate the two cases, and write out the odd and even cases as alternatives at the higher level:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$


WARNING!!!

The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the entire string. For example, although "abcba" is correctly matched, if the subject is "ababa", PCRE finds the palindrome "aba" at the start, then fails at top level because the end of the string does not follow. Once again, it cannot jump back into the recursion to try other alternatives, so the entire match fails.

Additional references

  • regular-expressions.info/Atomic grouping
    • (?>…) in some flavor is atomic grouping syntax
    • Lookarounds (?=…), (?!…), (?<=…), (?<!…), are all atomic
    • Possessive quantifier (e.g. a*+) is also atomic
    • PCRE recursive subpattern and subroutine calls are also atomic

A closer look at the pattern

The atomicity argument is correct, but perhaps it's not obvious how it explains why the pattern behaves as observed. Let's take a closer look and see how this all fits:

We will use the first pattern:

^(([a-z])(?1)\2|[a-z]?)$

I will use the following notation to denote the recursion:

  • 1 means the character was captured into group 2 in the first alternate
  • 2 means the character was matched by the second alternate
    • Or if the 2 is not above a character, the zero repetition option of ? is exercised
  • \ means the character was matched by the backreference to group 2 in first alternate
  • _ denotes the bottom of a recursive branch
    • This branch will NOT be reentered even if there are other alternatives!

Now let's consider "aaa" as input:

      _
1 1 1 2 
a a a   # This is the first bottom of the recursion,
        # now we go back to the third 1 and try to match \.
        # This fails, so the third 1 becomes 2.
    _
1 1 2
a a a   # Now we go back to the second 1 and try to match \.
        # This fails, so the second 1 becomes 2.
  _
1 2
a a a   # The second level matched! now we go back to the first level...

_____
1 2 \
a a a   # Now the first 1 can match \, and entire pattern matches!!


Now consider "aaaaa":

          _
1 1 1 1 1 2
a a a a a  # Fifth 1 can't match \, so it becomes 2. 
        _
1 1 1 1 2
a a a a a  # Fourth 1 can't match \, so it becomes 2.
    _____
1 1 1 2 /
a a a a a  # Here's a crucial point. The third 1 successfully matched.
           # Now we're back to the second 1 and try to match \, but this fails.
           # However, since PCRE recursion is atomic, the third 1 will NOT be
           # reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a  # Anchors don't match, so the first 1 becomes 2, and then also the
           # anchors don't match, so the pattern fails to match.

Note that once a recursion level matches on the first alternative, the second alternative will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now consider "aa":

    _
1 1 2 
a a
  _
1 2
a a  # The second level matched by taking the one repetition option on ?.
     # We now go back to the first level, and we can't match \.
     # Since PCRE recursion is atomic, we can't go back to the second level
     # to try the zero repetition option on ?.
_    
2
a a  # Anchors don't match, trying zero option on ? also doesn't help,
     # so the pattern fails to match!

Note that once a recursion level matches on the one repetition of the ? on the second alternative, the zero repetition option will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.


Now let's consider aaaaaaa

              _
1 1 1 1 1 1 1 2  
a a a a a a a 
            _
1 1 1 1 1 1 2  
a a a a a a a 
        _____
1 1 1 1 1 2 \  
a a a a a a a  # A crucial point: the fifth level matched and now the fourth
               # level can't match \, but it does NOT reenter the fifth level to
               # try 2. Instead, the fourth level tries 2.
    _____    
1 1 1 2 \  
a a a a a a a 
  _________    
1 1 1 2 \ \ 
a a a a a a a 
_____________    
1 1 1 2 \ \ \
a a a a a a a  # Entire pattern is a match! 

Note that even though PCRE subpattern recursion is atomic, it can still successfully match a palindrome consisting of a character repeating 2n-1 times.


Now, just for fun, let's try "abcba":

          _
1 1 1 1 1 2
a b c b a
        _
1 1 1 1 2
a b c b a

1 1 1 2 
a b c b a   # Third level attempts \, but c does not match a!
            # So we go back to third 1 and try 2.
  _____
1 1 2 \ 
a b c b a 
_________
1 1 2 \ \
a b c b a   # Entire pattern is a match!

That is, the pattern doesn't just match "only when a character repeats 2n-1 times". It can indeed match "abcba" (as seen on ideone.com). It can NOT, however, match "ababa", nor can it match "aaaaa" (see the WARNING on the man page!), because subpattern recursion in PCRE is atomic.

You can apply this same tracing process to explain the behavior of the pattern on any input.

这篇关于为什么仅当字符重复2 ^ n-1次时,此递归正则表达式才匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆