有没有办法在正则表达式中定义自定义速记? [英] Is there a way to define custom shorthands in regular expressions?

查看:56
本文介绍了有没有办法在正则表达式中定义自定义速记?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表单的正则表达式

I have a regular expression of the form

def parse(self, format_string):
    for m in re.finditer(
        r"""(?: \$ \( ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \$ \( ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \$ \( ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

我想用一些自定义速记常量"替换所有重复序列(\$ \(),如下所示:

I would like to replace all the repeating sequences (\$ \() with some custom shorthand "constant", which would look like this:

def parse(self, format_string):
    re.<something>('\BEGIN = \$\(')
    for m in re.finditer(
        r"""(?: \BEGIN ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \BEGIN ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \BEGIN ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

有没有办法用正则表达式本身来做到这一点(即不使用 Python 的字符串格式将 \BEGIN 替换为 \$\()?

Is there a way to do this with regular expressions themselves (i.e. not using Python's string formatting to substitute \BEGIN with \$\()?

说明:Python 源代码纯粹用于上下文和说明.我正在寻找 RE 解决方案,该解决方案在某些 RE 方言(可能不是 Python 方言)中可用,而不是专门针对 Python 的解决方案.

Clarification: the Python source is purely for context and illustration. I'm looking for RE solution, which would be available in some RE dialect (maybe not Python's one), not the solution specifically for Python.

推荐答案

我认为这在 Python 的正则表达式风格中是不可能的.您将需要仅由 PCRE 支持的递归(或更确切地说是模式重用).事实上,PCRE 甚至在其手册页(搜索定义子模式").

I don't think this is possible in Python's regex flavor. You would need recursion (or rather pattern reuse) which is only supported by PCRE. In fact, PCRE even mentions how defining shorthands works in its man page (search for "Defining subpatterns").

在 PCRE 中,您可以以与反向引用类似的方式使用递归语法 - 除了再次应用该模式,而不是尝试从反向引用获取相同的文字文本.示例:

In PCRE, you can use the recursion syntax in a similar way to backreferences - except that the pattern is applied again, instead of trying to get the same literal text as from a backreference. Example:

/(\d\d)-(?1)-(?1)/

匹配日期之类的东西(其中 (?1) 将被替换为 \d\d 并再次评估).这真的很强大,因为如果您在引用组本身中使用此构造,您将获得递归 - 但我们在这里甚至不需要它.以上也适用于命名组:

Matches something like a date (where (?1) will be replaced with with \d\d and evaluated again). This is really powerful, because if you use this construct within the referenced group itself you get recursion - but we don't even need that here. The above also works with named groups:

/(?<my>\d\d)-(?&my)-(?&my)/

现在我们已经很接近了,但定义也是模式的第一次使用,这使表达式有些混乱.诀窍是首先在从未评估过的位置使用模式.手册页建议了一个依赖于(不存在的)组的条件 DEFINE:

Now we're already really close, but the definition is also the first use of the pattern, which somewhat clutters up the expression. The trick is to use the pattern first in a position that is never evaluated. The man pages suggest a conditional that is dependent on a (non-existent) group DEFINE:

/
(?(DEFINE)
  (?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x

结构 (?(group)true|false) 应用模式 true 如果组 group 之前使用过,并且(可选的) 模式 false 否则.由于没有组DEFINE,条件将始终为false,并且将跳过true 模式.因此,我们可以将各种定义放在那里,而不必担心它们会被应用并弄乱我们的结果.这样我们就可以让它们进入模式,而无需真正使用它们.

The construct (?(group)true|false) applies pattern true if the group group was used before, and (the optional) pattern false otherwise. Since there is no group DEFINE, the condition will always be false, and the true pattern will be skipped. Hence, we can put all kinds of definitions there, without being afraid that they will ever be applied and mess up our results. This way we get them into the pattern, without really using them.

替代方案是一个负向前瞻,它永远不会到达定义表达式的点:

And alternative is a negative lookahead that never reaches the point where the expression is defined:

/
(?!
  (?!)     # fail - this makes the surrounding lookahead pass unconditionally
  # the engine never gets here; now we can write down our definitions
  (?<my>\d\d) 
)
(?&my)-(?&my)-(?&my)
/x

但是,如果您没有条件,但确实有命名模式重用(我认为不存在这样的风格),那么您只需要这种形式.另一个变体的优点是,使用 DEFINE 可以清楚地了解组的用途,而先行方法有点模糊.

However, you only really need this form, if you don't have conditionals, but do have named pattern reuse (and I don't think a flavor like this exists). The other variant has the advantage, that the use of DEFINE makes it obvious what the group is for, while the lookahead approach is a bit obfuscated.

回到你最初的例子:

/
# Definitions
(?(DEFINE)
  (?<BEGIN>[$][(])
)
# And now your pattern
  (?: (?&BEGIN) ( [^)]+ ) \) ) # the field access specifier
|
  (
    (?: # any one single character before the '$('
      \n | . (?= (?&BEGIN) ) 
    )
  | 
    (?: # any one single character, except the one before the '$('
      \n | . (?! (?&BEGIN) ) 
    )*
  )
/x

这种方法有两个主要注意事项:

There are two major caveats to this approach:

  1. 递归引用是原子.也就是说,一旦引用匹配了某些东西,它就永远不会被回溯到.在某些情况下,这可能意味着您必须巧妙地设计表达方式,以便第一个匹配项始终是您想要的匹配项.
  2. 您不能在定义的模式内使用捕获.如果你使用像 (?<myPattern>a(b)c) 这样的东西并重用它,b 将永远不会被捕获 - 当重用一个模式时,所有的组都是非捕获.
  1. Recursive references are atomic. That is, once the reference has matched something it will never be backtracked into. For certain cases this can mean that you have to be a bit clever in crafting your expression, so that the first match will always be the one you want.
  2. You cannot use capturing inside the defined patterns. If you use something like (?<myPattern>a(b)c) and reuse it, the b will never be captured - when reusing a pattern, all groups are non-capturing.

然而,与任何类型的插值或串联相比,最重要的优势是,您永远不会产生无效模式,也不会弄乱您的捕获组计数.

The most important advantage over any kind of interpolation or concatenation however is, that you can never produce invalid patterns with this, and you cannot mess up your capturing group counts either.

这篇关于有没有办法在正则表达式中定义自定义速记?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆