为什么不能在断言后面的零宽度外观中使用重复量词? [英] Why can't you use repetition quantifiers in zero-width look behind assertions?

查看:43
本文介绍了为什么不能在断言后面的零宽度外观中使用重复量词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直觉得你不能在零宽度断言(Perl Compatible Regular Expressions [PCRE])中使用重复量词.但是,最近我发现您可以在前瞻断言中使用它们.

I was always under the impression that you couldn't use repetition quantifiers in zero-width assertions (Perl Compatible Regular Expressions [PCRE]). However, it has recently transpired to me that you can use them in look ahead assertions.

PCRE 正则表达式引擎在使用零宽度后视搜索时如何工作,从而阻止使用重复量词?

这是 R 中 PCRE 的一个简单示例:

Here is a simple example from a PCRE in R:

# Our string
x <- 'MaaabcccM'

##  Does it contain a 'b', preceeded by an 'a' and followed by zero or more 'c',
##  then an 'M'?
grepl( '(?<=a)b(?=c*M)' , x , perl=T )
# [1] TRUE

##  Does it contain a 'b': (1) preceeded by an 'M' and then zero or more 'a' and
##                         (2) followed by zero or more 'c' then an 'M'?
grepl( '(?<=Ma*)b(?=c*M)' , x , perl = TRUE )
# Error in grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) :
#   invalid regular expression '(?<M=a*)b(?=c*M)'
# In addition: Warning message:
# In grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) : PCRE pattern compilation error
#         'lookbehind assertion is not fixed length'
#         at ')b(?=c*M)'

推荐答案

此类问题的最终答案在引擎代码中,在答案的底部,您将能够深入到 PCRE 部分引擎的代码负责确保后视中的固定长度 - 如果您有兴趣了解最细微的细节.同时,让我们从更高的层面逐步放大问题.

The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. In the meantime, let's gradually zoom into the question from higher levels.

可变宽度后视与无限宽度后视

首先,快速澄清条款.越来越多的引擎(包括 PCRE)支持某种形式的可变宽度后视,其中变化落在确定的范围内,例如:

First off, a quick clarification on terms. A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance:

  • 引擎知道前面的宽度必须 5到10个字符(PCRE不支持)
  • 引擎知道前面的宽度必须 5 10个字符(PCRE支持)
  • the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE)
  • the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE)

相反,在无限宽度的lookbehind中,你可以使用量化的token,比如a+

In contrast, in infinite-width lookbehind, you can use quantified tokens such as a+

支持无限宽度后视的引擎

根据记录,这些引擎支持无限后视:

For the record, these engines support infinite lookbehind:

  • .NET(C#、VB.NET 等)
  • Matthew Barnett 的 regex Python 模块
  • JGSoft(EditPad 等;不适用于编程语言).

据我所知,他们是唯一的.

As far as I know, they are the only ones.

PCRE 中的变量回溯

在 PCRE 中,文档中最相关的部分是这样的:

In PCRE, the most relevant section in the documentation is this:

后视断言的内容受到限制,使得所有它匹配的字符串必须具有固定长度.但是,如果有几个顶级替代品,它们不必都具有相同的固定长度.

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length.

因此,以下回顾是有效的:

Therefore, the following lookbehind is valid:

(?<=a |big )cat

但是,这些都不是:

  • (?<=a\s?|big )cat(交替的边没有固定的宽度)
  • (?<=@{1,10})cat(可变宽度)
  • (?<=\R)cat (\R 没有固定宽度,因为它可以匹配 \n,\r\n 等)
  • (?<=\X)cat (\X 没有固定宽度,因为 Unicode 字素簇可以包含可变数量的字节.)
  • (?<=a+)cat(显然不是固定的)
  • (?<=a\s?|big )cat (the sides of the alternation do not have a fixed width)
  • (?<=@{1,10})cat (variable width)
  • (?<=\R)cat (\R does not have a fixed-width as it can match \n, \r\n, etc.)
  • (?<=\X)cat (\X does not have a fixed-width as a Unicode grapheme cluster can contain a variable number of bytes.)
  • (?<=a+)cat (clearly not fixed)

零宽度匹配但无限重复的回顾

现在考虑:

(?<=(?=@+))(cat#+)

从表面上看,这是一个固定宽度的lookbehind,因为它只能找到零宽度匹配(由lookahead (?=@++)定义).这是绕过无限后视限制的技巧吗?

On the face of it, this is a fixed-width lookbehind, because it can only ever find a zero-width match (defined by the lookahead (?=@++)). Is that a trick to get around the infinite lookbehind limitation?

没有.PCRE 会因此而窒息.即使后视的内容为零宽度,PCRE 也不允许在后视中无限重复.任何地方.当文档说它匹配的所有字符串必须具有固定长度时,它应该是:

No. PCRE will choke on this. Even though the content of the lookbehind is zero-width, PCRE will not allow infinite repetition in the lookbehind. Anywhere. When the documentation says all the strings it matches must have a fixed length, it should really be:

与其任何组件匹配的所有字符串都必须具有固定的长度.

All the strings that any of its components matches must have a fixed length.

变通方法:没有无限回溯的生活

在 PCRE 中,用于解决无限后视问题的两个主要解决方案是 \K 和捕获组.

In PCRE, the two main solutions to problems where infinite lookbehinds would help are \K and capture Groups.

解决方法 #1:\K

\K 断言告诉引擎从它返回的最终匹配中删除到目前为止匹配的内容.

The \K assertion tells the engine to drop what was matched so far from the final match it returns.

假设您想要 (?<=@+)cat#+,这在 PCRE 中是不合法的.相反,您可以使用:

Suppose you want (?<=@+)cat#+, which is not legal in PCRE. Instead, you can use:

@+\Kcat#+

解决方法 2:捕获组

另一种继续进行的方法是匹配您在后视中放置的任何内容,并在捕获组中捕获感兴趣的内容.然后从捕获组中检索匹配项.

Another way to proceed is to match whatever you would have placed in a lookbehind, and to capture the content of interest in a capture group. You then retrieve the match from the capture group.

例如,代替非法的 (?<=@+)cat#+,您可以使用:

For instance, instead of the illegal (?<=@+)cat#+, you would use:

@+(cat#+)

在 R 中,这可能是这样的:

In R, this could look like this:

matches <- regexpr("@+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)

在不支持 \K 的语言中,这通常是唯一的解决方案.

In languages that don't support \K, this is often the only solution.

引擎内部:PCRE 代码说明了什么?

最终答案可以在 pcre_compile.c 中找到.如果您检查以该注释开头的代码块:

The ultimate answer is to be found in pcre_compile.c. If you examine the code block that starts with this comment:

如果回溯,检查这个分支是否匹配一个固定长度的字符串

If lookbehind, check that this branch matches a fixed-length string

您发现繁重的工作是由 find_fixedlength() 函数完成的.

You find that the grunt work is done by the find_fixedlength() function.

我在这里复制它给任何想要深入了解更多细节的人.

I reproduce it here for anyone who would like to dive into further details.

static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
{
int length = -1;

register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;

/* Scan along the opcodes for this branch. If we get to the end of the
branch, check the length against that of the other branches. */

for (;;)
  {
  int d;
  pcre_uchar *ce, *cs;
  register pcre_uchar op = *cc;

  switch (op)
    {
    /* We only need to continue for OP_CBRA (normal capturing bracket) and
    OP_BRA (normal non-capturing bracket) because the other variants of these
    opcodes are all concerned with unlimited repeated groups, which of course
    are not of fixed length. */

    case OP_CBRA:
    case OP_BRA:
    case OP_ONCE:
    case OP_ONCE_NC:
    case OP_COND:
    d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += 1 + LINK_SIZE;
    break;

    /* Reached end of a branch; if it's a ket it is the end of a nested call.
    If it's ALT it is an alternation in a nested call. An ACCEPT is effectively
    an ALT. If it is END it's the end of the outer call. All can be handled by
    the same code. Note that we must not include the OP_KETRxxx opcodes here,
    because they all imply an unlimited repeat. */

    case OP_ALT:
    case OP_KET:
    case OP_END:
    case OP_ACCEPT:
    case OP_ASSERT_ACCEPT:
    if (length < 0) length = branchlength;
      else if (length != branchlength) return -1;
    if (*cc != OP_ALT) return length;
    cc += 1 + LINK_SIZE;
    branchlength = 0;
    break;

    /* A true recursion implies not fixed length, but a subroutine call may
    be OK. If the subroutine is a forward reference, we can't deal with
    it until the end of the pattern, so return -3. */

    case OP_RECURSE:
    if (!atend) return -3;
    cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1);  /* Start subpattern */
    do ce += GET(ce, 1); while (*ce == OP_ALT);           /* End subpattern */
    if (cc > cs && cc < ce) return -1;                    /* Recursion */
    d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    cc += 1 + LINK_SIZE;
    break;

    /* Skip over assertive subpatterns */

    case OP_ASSERT:
    case OP_ASSERT_NOT:
    case OP_ASSERTBACK:
    case OP_ASSERTBACK_NOT:
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += PRIV(OP_lengths)[*cc];
    break;

    /* Skip over things that don't match chars */

    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
    case OP_THEN_ARG:
    cc += cc[1] + PRIV(OP_lengths)[*cc];
    break;

    case OP_CALLOUT:
    case OP_CIRC:
    case OP_CIRCM:
    case OP_CLOSE:
    case OP_COMMIT:
    case OP_CREF:
    case OP_DEF:
    case OP_DNCREF:
    case OP_DNRREF:
    case OP_DOLL:
    case OP_DOLLM:
    case OP_EOD:
    case OP_EODN:
    case OP_FAIL:
    case OP_NOT_WORD_BOUNDARY:
    case OP_PRUNE:
    case OP_REVERSE:
    case OP_RREF:
    case OP_SET_SOM:
    case OP_SKIP:
    case OP_SOD:
    case OP_SOM:
    case OP_THEN:
    case OP_WORD_BOUNDARY:
    cc += PRIV(OP_lengths)[*cc];
    break;

    /* Handle literal characters */

    case OP_CHAR:
    case OP_CHARI:
    case OP_NOT:
    case OP_NOTI:
    branchlength++;
    cc += 2;
#ifdef SUPPORT_UTF
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
    break;

    /* Handle exact repetitions. The count is already in characters, but we
    need to skip over a multibyte character in UTF8 mode.  */

    case OP_EXACT:
    case OP_EXACTI:
    case OP_NOTEXACT:
    case OP_NOTEXACTI:
    branchlength += (int)GET2(cc,1);
    cc += 2 + IMM2_SIZE;
#ifdef SUPPORT_UTF
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
    break;

    case OP_TYPEEXACT:
    branchlength += GET2(cc,1);
    if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
      cc += 2;
    cc += 1 + IMM2_SIZE + 1;
    break;

    /* Handle single-char matchers */

    case OP_PROP:
    case OP_NOTPROP:
    cc += 2;
    /* Fall through */

    case OP_HSPACE:
    case OP_VSPACE:
    case OP_NOT_HSPACE:
    case OP_NOT_VSPACE:
    case OP_NOT_DIGIT:
    case OP_DIGIT:
    case OP_NOT_WHITESPACE:
    case OP_WHITESPACE:
    case OP_NOT_WORDCHAR:
    case OP_WORDCHAR:
    case OP_ANY:
    case OP_ALLANY:
    branchlength++;
    cc++;
    break;

    /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
    otherwise \C is coded as OP_ALLANY. */

    case OP_ANYBYTE:
    return -2;

    /* Check a class for variable quantification */

    case OP_CLASS:
    case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
    case OP_XCLASS:
    /* The original code caused an unsigned overflow in 64 bit systems,
    so now we use a conditional statement. */
    if (op == OP_XCLASS)
      cc += GET(cc, 1);
    else
      cc += PRIV(OP_lengths)[OP_CLASS];
#else
    cc += PRIV(OP_lengths)[OP_CLASS];
#endif

    switch (*cc)
      {
      case OP_CRSTAR:
      case OP_CRMINSTAR:
      case OP_CRPLUS:
      case OP_CRMINPLUS:
      case OP_CRQUERY:
      case OP_CRMINQUERY:
      case OP_CRPOSSTAR:
      case OP_CRPOSPLUS:
      case OP_CRPOSQUERY:
      return -1;

      case OP_CRRANGE:
      case OP_CRMINRANGE:
      case OP_CRPOSRANGE:
      if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
      branchlength += (int)GET2(cc,1);
      cc += 1 + 2 * IMM2_SIZE;
      break;

      default:
      branchlength++;
      }
    break;

    /* Anything else is variable length */

    case OP_ANYNL:
    case OP_BRAMINZERO:
    case OP_BRAPOS:
    case OP_BRAPOSZERO:
    case OP_BRAZERO:
    case OP_CBRAPOS:
    case OP_EXTUNI:
    case OP_KETRMAX:
    case OP_KETRMIN:
    case OP_KETRPOS:
    case OP_MINPLUS:
    case OP_MINPLUSI:
    case OP_MINQUERY:
    case OP_MINQUERYI:
    case OP_MINSTAR:
    case OP_MINSTARI:
    case OP_MINUPTO:
    case OP_MINUPTOI:
    case OP_NOTMINPLUS:
    case OP_NOTMINPLUSI:
    case OP_NOTMINQUERY:
    case OP_NOTMINQUERYI:
    case OP_NOTMINSTAR:
    case OP_NOTMINSTARI:
    case OP_NOTMINUPTO:
    case OP_NOTMINUPTOI:
    case OP_NOTPLUS:
    case OP_NOTPLUSI:
    case OP_NOTPOSPLUS:
    case OP_NOTPOSPLUSI:
    case OP_NOTPOSQUERY:
    case OP_NOTPOSQUERYI:
    case OP_NOTPOSSTAR:
    case OP_NOTPOSSTARI:
    case OP_NOTPOSUPTO:
    case OP_NOTPOSUPTOI:
    case OP_NOTQUERY:
    case OP_NOTQUERYI:
    case OP_NOTSTAR:
    case OP_NOTSTARI:
    case OP_NOTUPTO:
    case OP_NOTUPTOI:
    case OP_PLUS:
    case OP_PLUSI:
    case OP_POSPLUS:
    case OP_POSPLUSI:
    case OP_POSQUERY:
    case OP_POSQUERYI:
    case OP_POSSTAR:
    case OP_POSSTARI:
    case OP_POSUPTO:
    case OP_POSUPTOI:
    case OP_QUERY:
    case OP_QUERYI:
    case OP_REF:
    case OP_REFI:
    case OP_DNREF:
    case OP_DNREFI:
    case OP_SBRA:
    case OP_SBRAPOS:
    case OP_SCBRA:
    case OP_SCBRAPOS:
    case OP_SCOND:
    case OP_SKIPZERO:
    case OP_STAR:
    case OP_STARI:
    case OP_TYPEMINPLUS:
    case OP_TYPEMINQUERY:
    case OP_TYPEMINSTAR:
    case OP_TYPEMINUPTO:
    case OP_TYPEPLUS:
    case OP_TYPEPOSPLUS:
    case OP_TYPEPOSQUERY:
    case OP_TYPEPOSSTAR:
    case OP_TYPEPOSUPTO:
    case OP_TYPEQUERY:
    case OP_TYPESTAR:
    case OP_TYPEUPTO:
    case OP_UPTO:
    case OP_UPTOI:
    return -1;

    /* Catch unrecognized opcodes so that when new ones are added they
    are not forgotten, as has happened in the past. */

    default:
    return -4;
    }
  }
/* Control never gets here */
}

这篇关于为什么不能在断言后面的零宽度外观中使用重复量词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆