正则表达式中的递归模式 [英] Recursive pattern in regex

查看:85
本文介绍了正则表达式中的递归模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与匹配外括号的正则表达式非常相关,我特别想知道如何或是否可以做到这一点正则表达式的递归模式?我还没有找到使用这种策略的 Python 示例,所以我认为这应该是一个有用的问题!

看过 一些 声明 递归模式可用于匹配平衡括号,但没有使用python的regex 包(注意:re支持递归模式,需要使用regex).

I've seen some claims that recursive patterns can be used to match balanced parenthesis, but no examples using python's regex package (Note: re does not support recursive pattern, you need to use regex).

一个声明是语法是b(?:m|(?R))*e 其中:

b 是构造的开始,m 是可以出现在构造中间的,而 e 是可以出现的构造结束

b is what begins the construct, m is what can occur in the middle of the construct, and e is what can occur at the end of the construct

<小时>

我想提取以下外部大括号的匹配项:

"{1, {2, 3}} {4, 5}"
["1, {2, 3}", "4, 5"]  # desired

请注意,对于大括号,这很容易做到:

Note that this is easy to do the same for inner braces:

re.findall(r"{([^{}]*)}", "{1, {2, 3}} {4, 5}")
['2, 3', '4, 5']

(在我的示例中,我使用的是 finditer(匹配对象),参见 这里.)

所以我曾希望以下或一些变体能够奏效:

So I had hoped that the following, or some variation, would work:

regex.findall(r"{(:[^{}]*|?R)}", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}]*|?R)})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*|(?R))*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*)|(?R)*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}])|(?R)})", "{1, {2, 3}} {4, 5}")

但我被 [] 或 错误:太多回溯所困扰.

but I'm scuppered by either [] or error: too much backtracking.

是否可以使用正则表达式的递归提取外括号的匹配对象?

显然,我冒着被击落的风险:

Obviously, I run the risk of being shot down with:

我想强调这是关于如何使用递归模式(如果我的理解是正确的,这会将我们带到常规语言解析之外,所以实际上可能是可能的!).如果可以的话,这应该是一个更干净的解决方案.

I want to emphasis this is about how to use the recursive pattern (which if my understanding is correct, takes us outside of regular language parsing, so may can actually be possible!). If it can be done, this ought to be a cleaner solution.

推荐答案

模式是:

{((?>[^{}]+|(?R))*)}

您可以看到这适用于您的示例:

You can see this works for your example:

regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']

说明:

m 部分需要排除括号.如果您希望同时允许 [^{}] 的量词并在没有灾难性回溯问题的情况下重复该组,则需要使用原子组.更清楚的是,如果最后一个大括号丢失,这个正则表达式引擎将逐个原子组而不是逐个字符地回溯原子组.为了说明这一点,您可以像这样使量词具有所有格:{((?>[^{}]+|(?R))*+)}(或 {((?:[^{}]+|(?R))*+)} 因为原子组不再有用.

Explanation:

The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}] and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)} (or {((?:[^{}]+|(?R))*+)} since the atomic group is no more useful).

原子群(?>....) 和所有格量词?+, *+, ++ 是同一特征的两个方面.此功能禁止正则表达式引擎在成为原子"的字符组内回溯(您无法将其分成更小的部分).

The atomic group (?>....) and the possessive quantifier ?+, *+, ++ are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).

基本示例是以下两种对于字符串 aaaaaaaaaab 总是失败的模式:

The basic examples are the following two patterns that always fail for the string aaaaaaaaaab:

(?>a+)ab
a++ab

即:

regex.match("a++ab", "aaaaaaaaaab")
regex.match("(?>a+)ab", "aaaaaaaaaab")

当您使用 (?:a+)a+ 时,正则表达式引擎(默认情况下)会记录(预先设定)所有字符的所有回溯位置.但是当您使用原子组或所有格量词时,不再记录这些回溯位置(组的开头除外).所以当回溯机制发生时,最后一个a"字符不能被返回.只能返还整个组.

When you use (?:a+) or a+ the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.

:如果您使用展开"子模式来描述括号之间的内容,则可以以更有效的方式编写模式:

: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:

{([^{}]*+(?:(?R)[^{}]*)*+)}

这篇关于正则表达式中的递归模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆