捕获量词和量词算术 [英] Capturing Quantifiers and Quantifier Arithmetic

查看:88
本文介绍了捕获量词和量词算术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,让我解释一下这个问题既不是关于如何捕获组,也不是关于如何使用量词,这是我非常熟悉的regex的两个功能.对于可能熟悉异国情调的引擎中异常语法的正则表达式爱好者来说,这更是一个高级问题.

At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.

捕获量词

有人知道正则表达式是否允许您捕获量词?借此,我的意思是要对与+和*等量词匹配的字符数进行计数,并且该数目可以在另一个量词中再次使用.

Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.

例如,假设您要确保这种字符串中的L和R数相同:LLLRRRRR

For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR

您可以想象像这样的语法

You could imagine a syntax such as

L(+)R{\q1}

其中捕获了L的+量词,而捕获的数字在R的量词中称为{\ q1}

where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}

这对于平衡字符串(例如, @@@@星球大战" ===="1977"表示"----科幻小说";////乔治·卢卡斯"

This would be useful to balance the number of {@,=,-,/} in strings such as @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

与递归的关系

在某些情况下,量词捕获会优雅地替换递归,例如一段由相同数量的Ls和Rs构成的文本,a in

In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in

L(+) some_content R{\q1} 

有关此想法的详细信息,请参见下一页:捕获的量词

The idea is presented in some details on the following page: Captured Quantifiers

它还讨论了捕获的量词的自然扩展:量词算术,当您想匹配(3 * x + 1)较早匹配的字符数时.

It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.

我正试图找出是否存在类似的东西.

I am trying to find out if anything like this exists.

提前感谢您的见解!!!

Thanks in advance for your insights!!!

更新

Casimir给出了一个奇妙的答案,该答案显示了两种方法来验证模式的各个部分具有相同的长度.但是,我不想在日常工作中依靠任何一个.这些确实是技巧,可以证明其出色的表演技巧.在我看来,这些漂亮但复杂的方法证实了这个问题的前提:正则表达式功能可捕获量词(例如+或*)可以匹配的字符数,这将使这种平衡模式变得非常简单,并扩展了语法.一种令人愉悦的表达方式.

Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.

更新2 (稍后)

我发现.NET具有与我所要求的功能接近的功能.添加了一个演示功能的答案.

I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.

推荐答案

我不知道可以捕获量词的正则表达式引擎.但是,PCRE或Perl可能会使用一些技巧来检查您是否拥有相同数量的字符.以您的示例为例:

I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"


您可以检查@ = - /是否与使用


you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"

~(?<!@)((?:@(?=[^=]*(\2?+=)[^-]*(\3?+-)[^/]*(\4?+/)))+)(?!@)(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))~

图案细节:

~                          # pattern delimiter
(?<!@)                     # negative lookbehind used as an @ boundary
(                          # first capturing group for the @
    (?:
        @                  # one @
        (?=                # checks that each @ is followed by the same number
                           # of = - /  
            [^=]*          # all that is not an =
            (\2?+=)        # The possessive optional self-referencing group:
                           # capture group 2: backreference to itself + one = 
            [^-]*(\3?+-)   # the same for -
            [^/]*(\4?+/)   # the same for /
        )                  # close the lookahead
    )+                     # close the non-capturing group and repeat
)                          # close the first capturing group
(?!@)                      # negative lookahead used as an @ boundary too.

# this checks the boundaries for all groups
(?=[^=]*\2(?!=)[^-]*\3(?!-)[^/]*\4(?!/))
~

主要思想

非捕获组仅包含一个@.每次重复该组时,都会在捕获组2、3和4中添加一个新角色.

The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

所有格可选的自我参照小组

它如何工作?

( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )

在第一次出现@字符时,捕获组2尚未定义,因此您不能写出类似(\2 =)这样会使模式失败的内容.为避免此问题,方法是使反向引用为可选:\2?

At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

该组的第二个方面是,由于每次都添加一个=,因此在每次非捕获组重复时,匹配的字符=的数量都会增加.为了确保此数字始终增加(或模式失败),所有格修饰符会强制先对后向引用进行匹配,然后再添加新的=字符.

The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

请注意,该组的显示方式如下:如果组2存在,则将其与下一个=

Note that this group can be seen like that: if group 2 exists then match it with the next =

( (?(2)\2) = )

递归方式

~(?<!@)(?=(@(?>[^@=]+|(?-1))*=)(?!=))(?=(@(?>[^@-]+|(?-1))*-)(?!-))(?=(@(?>[^@/]+|(?-1))*/)(?!/))~

您需要使用重叠的匹配项,因为您将多次使用@部分,所以这就是所有模式都在环视范围内的原因.

You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

图案细节:

(?<!@)                # left @ boundary
(?=                   # open a lookahead (to allow overlapped matches)
    (                 # open a capturing group
        @
        (?>           # open an atomic group
            [^@=]+    # all that is not an @ or an =, one or more times
          |           # OR
            (?-1)     # recursion: the last defined capturing group (the current here)
        )*            # repeat zero or more the atomic group
        =             #
    )                 # close the capture group
    (?!=)             # checks the = boundary
)                     # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-))  # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/))  # the same for /

与先例模式的主要区别在于,该模式不关心=-/组的顺序. (不过,您可以轻松地对第一个模式进行一些更改,以解决此问题,包括字符类和否定先行.)

The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

注意:对于示例字符串,更具体地说,可以用锚点(^\A)替换负向后的外观.并且,如果要获取整个字符串作为匹配结果,则必须在末尾添加.*(否则,匹配结果将为空,因为它会引起嬉戏的注意.)

Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

这篇关于捕获量词和量词算术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆