是否可以在不使用递归或平衡组的情况下将嵌套括号与正则表达式匹配? [英] Is it possible to match nested brackets with a regex without using recursion or balancing groups?

查看:17
本文介绍了是否可以在不使用递归或平衡组的情况下将嵌套括号与正则表达式匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:匹配任意嵌套的括号组,例如 Java 的 java.util.regex 既不支持递归也不支持平衡组.即匹配三个外部组:

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:

(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) ((((((((第三)))))))

(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))

这个练习纯粹是学术性的,因为我们都知道不应该使用正则表达式来匹配这些东西,就像 Q-tips 不应该用于清洁耳朵.

This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.

Stack Overflow 鼓励自我回答问题,所以我决定创建这篇文章来分享我最近发现的东西.

推荐答案

确实如此!可以使用前向引用:

(?=()(?:(?=.*?((?!.*?1)(.*)(?!.*2).*))(?=.*?)(?!.*?2)(.*)).)+?.*?(?=1)[^(]*(?=2$)

证明

等等;就是这样.从头到尾匹配一整组嵌套括号.每个匹配的两个子字符串必须被捕获和保存;这些对你没用.只关注主要比赛的结果.

Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

不,深度没有限制.不,那里没有隐藏的递归构造.只是简单的环顾四周,带有一些前向引用.如果你的风格不支持前向引用(我在看着你,JavaScript),那么我很抱歉.我真的是.我希望我能帮助你,但我不是一个奇怪的奇迹创造者.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

好的,这就是交易.我们能够匹配这些外部组的原因是因为它们不重叠.一旦我们想要的比赛开始重叠,我们就必须稍微调整我们的策略.我们仍然可以检查主题中是否有正确平衡的括号组.但是,我们需要使用捕获组保存它们,而不是完全匹配它们,如下所示:

OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=()(?=((?:(?=.*?((?!.*?2)(.*)(?!.*3).*))(?=.*?)(?!.*?3)(.*)).)+?.*?(?=2)[^(]*(?=3$))) 

与之前的表达式完全相同,除了我将它的大部分包装在前瞻中以避免消耗字符,添加了一个捕获组,并调整了反向引用索引,以便它们与新朋友一起玩.现在表达式在下一个括号组之前的位置匹配,并且感兴趣的子串被保存为 1.

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as 1.

我很高兴你这么问.一般方法非常简单:一次遍历一个字符,同时匹配下一次出现的 '(' 和 ')',在每种情况下捕获字符串的其余部分,以便确定在下一次迭代.让我一块一块地分解:

I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:

<头>
注意组件说明
(?=()在做任何艰苦的工作之前确保'('跟在后面.
(?:用于遍历字符串的组的开始,因此以下前瞻重复匹配.
处理'('(?=这个前瞻处理寻找下一个'('.
.*?((?!.*?1)匹配直到下一个 '(' 后面没有 1.下面,你会看到 1 填充了整个部分最后一个 '(' 匹配的字符串.所以 (?!.*?1) 确保我们不会再次匹配相同的 '('
(.*)(?!.*2).*)用字符串的其余部分填充 1.同时,检查是否至少有另一个)"出现.这是一个 PCRE 创可贴,用于克服在前瞻中捕获组的错误.
)
处理')'(?=这个前瞻处理寻找下一个')'
.*?)(?!.*?2)匹配直到下一个没有2的')'.与之前的 '(' 匹配一样,这会强制匹配之前未匹配过的 ')'.
(.*)用字符串的其余部分填充 2.上面提到的错误在这里不适用,所以简单的表达就足够了.
)
.消耗单个字符,以便该组可以继续匹配.使用字符是安全的,因为在新的匹配点之前不可能出现下一个 '(' 或 ')'.
)+?尽可能少地匹配,直到找到一个平衡的组.这是通过以下检查验证的
最终验证.*?(?=1)匹配并包含最后一个 '(' 找到.
[^(]*(?=2$)然后匹配直到找到最后一个 ')' 的位置,确保我们不会在此过程中遇到另一个 '('(这意味着一个不平衡的组).
Note Component Description
(?=() Make sure '(' follows before doing any hard work.
(?: Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '(' (?= This lookahead deals with finding the next '('.
.*?((?!.*?1) Match up until the next '(' that is not followed by 1. Below, you'll see that 1 is filled with the entire part of the string following the last '(' matched. So (?!.*?1) ensures we don't match the same '(' again
(.*)(?!.*2).*) Fill 1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')' (?= This lookahead deals with finding the next ')'
.*?)(?!.*?2) Match up until the next ')' that is not followed by 2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*) Fill 2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
. Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+? Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation .*?(?=1) Match up to and including the last '(' found.
[^(]*(?=2$) Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).

结论

所以,你有它.一种使用前向引用与标准(扩展)正则表达式功能匹配平衡嵌套结构的方法 - 无递归或平衡组.它效率不高,当然也不漂亮,但这是可能的.而且以前从来没有这样做过.这对我来说非常令人兴奋.

Conclusion

So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

我知道你们很多人使用正则表达式来完成并帮助其他用户完成更简单、更实用的任务,但是如果有人和我一样兴奋地用正则表达式推动可能性的极限,那么我很想听听你的意见.如果有兴趣,我有其他类似的材料要发布.

这篇关于是否可以在不使用递归或平衡组的情况下将嵌套括号与正则表达式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆