什么是正则表达式平衡组? [英] What are regular expression Balancing Groups?

查看:26
本文介绍了什么是正则表达式平衡组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚读了一个关于如何在双花括号内获取数据的问题(这个问题),然后有人提出了平衡组.我仍然不太确定它们是什么以及如何使用它们.

我通读了平衡组定义,但解释很难理解,我对我提到的问题仍然很困惑.

有人能简单解释一下什么是平衡组以及它们有什么用处吗?

解决方案

据我所知,平衡组是 .NET 的正则表达式所独有的.

旁白:重复组

首先,您需要知道 .NET 是(再次,据我所知)唯一的正则表达式风格,可以让您访问单个捕获组的多个捕获(不是在反向引用中,而是在匹配完成后).

用一个例子来说明这一点,考虑模式

(.)+

和字符串"abcd".

在所有其他正则表达式中,捕获组 1 只会产生一个结果:d(注意,完整匹配当然是 abcd 正如预期的那样).这是因为捕获组的每次新使用都会覆盖之前的捕获.

另一方面,.NET 会记住它们.它是在堆栈中进行的.匹配上面的正则表达式后像

Match m = new Regex(@"(.)+").Match("abcd");

你会发现

m.Groups[1].Captures

是一个CaptureCollection,其元素对应四个捕获

0: "a"1:乙"2:c"3:d"

其中数字是 CaptureCollection 的索引.所以基本上每次再次使用该组时,都会将一个新的捕获推入堆栈.

如果我们使用命名的捕获组,它会变得更有趣.因为 .NET 允许重复使用相同的名称,所以我们可以编写像

这样的正则表达式

(?w+)W+(?w+)

将两个单词捕获到同一组中.同样,每次遇到具有特定名称的组时,都会将捕获推入其堆栈.因此,将此正则表达式应用于输入 "foo bar" 并检查

m.Groups["word"].Captures

我们发现了两个捕获

0: "foo"1:酒吧"

这使我们甚至可以将表达式的不同部分的内容推送到单个堆栈中.但是,这只是 .NET 能够跟踪此 CaptureCollection 中列出的多个捕获的功能.但我说,这个集合是一个堆栈.那么我们可以弹出东西吗?

输入:平衡组

事实证明我们可以.如果我们使用像 (?<-word>...) 这样的组,那么如果子表达式 ... 匹配.所以如果我们把之前的表达式改成

(?w+)W+(?<-word>w+)

然后第二组会弹出第一组的捕获,最后我们会收到一个空的CaptureCollection.当然,这个例子没什么用.

但是减号语法还有一个细节:如果堆栈已经为空,则组失败(无论其子模式如何).我们可以利用这种行为来计算嵌套级别 - 这就是名称平衡组的来源(以及它变得有趣的地方).假设我们要匹配括号正确的字符串.我们将每个左括号压入堆栈,并为每个右括号弹出一个捕获.如果我们遇到一个右括号太多,它会尝试弹出一个空栈并导致模式失败:

^(?:[^()]|(?[(])|(?<-Open>[)]))*$

所以我们在重复中有三个选择.第一个选择消耗所有不是括号的东西.第二个选择匹配 (s,同时将它们压入堆栈.第三个选择匹配 )s 同时从堆栈中弹出元素(如果可能!).

注意:只是为了澄清,我们只是检查没有不匹配的括号!这意味着根本不包含括号的字符串匹配,因为它们在语法上仍然有效(在某些需要括号匹配的语法中).如果您想确保至少有一组括号,只需在 ^ 之后添加一个前瞻 (?=.*[(]).>

尽管这种模式并不完美(或完全正确).

结局:条件模式

还有一个问题:这并不能确保字符串末尾的堆栈为空(因此 (foo(bar) 将是有效的)..NET(以及许多其他风格) 还有一个结构可以帮助我们解决这个问题:条件模式.一般语法是

(?(condition)truePattern|falsePattern)

其中 falsePattern 是可选的 - 如果省略它,则 false-case 将始终匹配.条件可以是模式,也可以是捕获组的名称.我将在这里重点讨论后一种情况.如果它是捕获组的名称,则当且仅当该特定组的捕获堆栈不为空时才使用 truePattern.也就是说,像 (?(name)yes|no) 这样的条件模式读取如果 name 匹配并捕获了一些东西(仍然在堆栈上),使用模式yes 否则使用模式 no".

所以在我们上述模式的末尾,我们可以添加类似 (?(Open)failPattern) 的内容,如果 Open-stack不是空的.使模式无条件失败的最简单的方法是 (?!)(一个空的负前瞻).所以我们有了最终的模式:

^(?:[^()]|(?[(])|(?<-Open>[)]))*(?(Open)(?!))$

请注意,这种条件语法本身与平衡组无关,但有必要充分利用它们的功能.

从这里开始,天空就是极限.许多非常复杂的用途是可能的,并且当与其他 .NET-Regex 功能结合使用时会遇到一些问题,例如可变长度的lookbehinds (我必须自己艰难地学习).然而,主要问题始终是:使用这些功能时,您的代码是否仍然可维护?你需要很好地记录它,并确保所有从事它工作的人也知道这些功能.否则,您可能会更好,只需手动逐个字符地遍历字符串并计算整数中的嵌套级别.

附录:(?...) 语法有什么用?

这部分的功劳归于 Kobi(有关详细信息,请参阅下面的回答).

现在有了以上所有内容,我们可以验证字符串是否正确加括号.但是,如果我们实际上可以为所有这些括号的内容获取(嵌套)捕获,那将会更有用.当然,我们可以记住在单独的未清空的捕获堆栈中打开和关闭括号,然后在单独的步骤中根据它们的位置进行一些子字符串提取.

但是 .NET 在这里提供了一个更方便的功能:如果我们使用 (?subPattern),不仅是从堆栈中弹出捕获 B,而且在 B 的弹出捕获和当前组之间的所有内容都被推送到堆栈 A 上.因此,如果我们使用这样的组作为右括号,同时从我们的堆栈中弹出嵌套级别,我们还可以将这对的内容推入另一个堆栈:

^(?:[^()]|(?[(])|(?[)]))*(?(Open)(?!))$

Kobi 提供了这个 现场演示在他的回答中

因此,我们可以将所有这些因素结合起来:

  • 记住任意多个捕获
  • 验证嵌套结构
  • 捕获每个嵌套级别

全部在一个正则表达式中.如果这不令人兴奋... ;)

当我第一次了解它们时我发现它们很有帮助的一些资源:

I was just reading a question about how to get data inside double curly braces (this question), and then someone brought up balancing groups. I'm still not quite sure what they are and how to use them.

I read through Balancing Group Definition, but the explanation is hard to follow, and I'm still quite confused on the questions that I mentioned.

Could someone simply explain what balancing groups are and how they are useful?

解决方案

As far as I know, balancing groups are unique to .NET's regex flavor.

Aside: Repeated Groups

First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).

To illustrate this with an example, consider the pattern

(.)+

and the string "abcd".

in all other regex flavors, capturing group 1 will simply yield one result: d (note, the full match will of course be abcd as expected). This is because every new use of the capturing group overwrites the previous capture.

.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like

Match m = new Regex(@"(.)+").Match("abcd");

you will find that

m.Groups[1].Captures

Is a CaptureCollection whose elements correspond to the four captures

0: "a"
1: "b"
2: "c"
3: "d"

where the number is the index into the CaptureCollection. So basically every time the group is used again, a new capture is pushed onto the stack.

It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like

(?<word>w+)W+(?<word>w+)

to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar" and inspecting

m.Groups["word"].Captures

we find two captures

0: "foo"
1: "bar"

This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET's feature of being able to track multiple captures which are listed in this CaptureCollection. But I said, this collection is a stack. So can we pop things from it?

Enter: Balancing Groups

It turns out we can. If we use a group like (?<-word>...), then the last capture is popped from the stack word if the subexpression ... matches. So if we change our previous expression to

(?<word>w+)W+(?<-word>w+)

Then the second group will pop the first group's capture, and we will receive an empty CaptureCollection in the end. Of course, this example is pretty useless.

But there's one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels - and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$

So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (s while pushing them onto the stack. The third alternative matches )s while popping elements from the stack (if possible!).

Note: Just to clarify, we're only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(]) right after the ^.

This pattern is not perfect (or entirely correct) though.

Finale: Conditional Patterns

There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar) would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is

(?(condition)truePattern|falsePattern)

where the falsePattern is optional - if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I'll focus on the latter case here. If it's the name of a capturing group, then truePattern is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no) reads "if name has matched and captured something (that is still on the stack), use pattern yes otherwise use pattern no".

So at the end of our above pattern we could add something like (?(Open)failPattern) which causes the entire pattern to fail, if the Open-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!) (an empty negative lookahead). So we have our final pattern:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$

Note that this conditional syntax has nothing per se to do with balancing groups but it's necessary to harness their full power.

From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.

Addendum: What's with the (?<A-B>...) syntax?

Credits for this part go to Kobi (see his answer below for more details).

Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses' contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.

But .NET provides one more convenience feature here: if we use (?<A-B>subPattern), not only is a capture popped from stack B, but also everything between that popped capture of B and this current group is pushed onto stack A. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair's content onto another stack:

^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$

Kobi provided this Live-Demo in his answer

So taking all of these things together we can:

  • Remember arbitrarily many captures
  • Validate nested structures
  • Capture each nesting level

All in a single regular expression. If that's not exciting... ;)

Some resources that I found helpful when I first learned about them:

这篇关于什么是正则表达式平衡组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆