重用正则表达式模式的一部分 [英] Reuse part of a Regex pattern

查看:42
本文介绍了重用正则表达式模式的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑这个(非常简化的)示例字符串:

1aw2,5cx7

如您所见,它是由逗号分隔的两个 digit/letter/letter/digit 值.

现在,我可以将其与以下内容进行匹配:

<预><代码>>>>从重新导入匹配>>>match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")<_sre.SRE_Match 对象在 0x01749D40>>>>

问题是,我必须写两次 \d\w\w\d.对于小模式,这还不错,但是,对于更复杂的正则表达式,两次编写完全相同的内容会使最终模式变得庞大且难以处理.这似乎也是多余的.

我尝试使用命名的捕获组:

<预><代码>>>>从重新导入匹配>>>match("(?P\d\w\w\d),(?P=id)", "1aw2,5cx7")>>>

但它没有工作,因为它正在寻找 1aw2 的两次出现,而不是 digit/letter/letter/digit.

有什么办法可以保存一个模式的一部分,比如\d\w\w\d,以便以后可以在同一个模式中使用?换句话说,我可以在模式中重用子模式吗?

解决方案

否,使用标准库re 模块时,正则表达式patterns 不能被符号化".

当然,您始终可以通过重用 Python 变量来实现:

digit_letter_letter_digit = r'\d\w\w\d'

然后使用字符串格式来构建更大的模式:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

或者,使用 Python 3.6+ f 字符串:

dlld = r'\d\w\w\d'匹配(fr{dlld},{dlld}",输入文本)

我经常使用这种技术从可重用的子模式中组合出更大、更复杂的模式.

如果您准备安装外部库,则regex 项目 可以通过 regex 子例程调用来解决这个问题.语法 (?) 重新使用已使用(隐式编号)捕获组的模式:

(\d\w\w\d),(?1)^..^ ^..^|\|捕获组 1 的重用模式\捕获组 1

您可以对 named 捕获组执行相同操作,其中 (?...) 是命名组 groupname, 和 (?&groupname), (?P&groupname)(?P>groupname) 重新使用匹配的模式groupname(后两种形式是与其他引擎兼容的替代形式).

最后,regex 支持 (?(DEFINE)...) 块来定义"子例程模式,而无需它们在该阶段实际匹配任何内容.您可以在该构造中放置多个 (..)(?...) 捕获组,以便稍后在实际模式中引用它们:

(?(DEFINE)(?\d\w\w\d))(?&d​​lld),(?&dlld)^......^ ^......^ ^......^|\/创建dlld"模式两次使用dlld"模式

明确一点:标准库re 模块不支持子程序模式.

Consider this (very simplified) example string:

1aw2,5cx7

As you can see, it is two digit/letter/letter/digit values separated by a comma.

Now, I could match this with the following:

>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>

The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.

I tried using a named capture group:

>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>

But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.

Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?

解决方案

No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.

You can always do so by re-using Python variables, of course:

digit_letter_letter_digit = r'\d\w\w\d'

then use string formatting to build the larger pattern:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

or, using Python 3.6+ f-strings:

dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)

I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.

If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:

(\d\w\w\d),(?1)
^........^ ^..^
|           \
|             re-use pattern of capturing group 1  
\
  capturing group 1

You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).

And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:

(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
          ^...............^ ^......^ ^......^
          |                    \       /          
 creates 'dlld' pattern      uses 'dlld' pattern twice

Just to be explicit: the standard library re module does not support subroutine patterns.

这篇关于重用正则表达式模式的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆