什么是正EX pression均衡组? [英] What are regular expression Balancing Groups?

查看:244
本文介绍了什么是正EX pression均衡组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是读一本关于如何获得在双大括号数据(这个问题的问题),然后有人提出了均衡组。我还是不太清楚它们是什么以及如何使用它们。

I was just reading a question about how to get data inside double curly braces (this question), and then someone brought up balancing groups. I'm still not quite sure what they are and how to use them.

我通过平衡组定义,但解释是难以遵循,而我还在,我提到的问题很困惑。

I read through Balancing Group Definition, but the explanation is hard to follow, and I'm still quite confused on the questions that I mentioned.

可能有人简单地解释什么是均衡组,以及它们是如何有用吗?

Could someone simply explain what balancing groups are and how they are useful?

推荐答案

据我所知,均衡组是唯一的.NET的正则表达式的味道。

As far as I know, balancing groups are unique to .NET's regex flavor.

首先,你需要知道,.NET是(同样,据我所知)唯一的正则表达式的味道,让你访问一个捕获组的多个捕获(不反向引用,但在比赛完成后)。

First, you need to know that .NET is (again, as far as I know) the only regex flavor that lets you access multiple captures of a single capturing group (not in backreferences but after the match has completed).

要说明这一个例子,考虑图案

To illustrate this with an example, consider the pattern

(.)+

和字符串ABCD

在所有其他的正则表达式的口味,捕获组 1 将只产生一个结果: D (注意,全当然,这场比赛将是 ABCD 如预期)。这是因为每一个新的使用捕获组的覆盖previous捕捉。

in all other regex flavors, capturing group 1 will simply yield one result: d (note, the full match will of course be abcd as expected). This is because every new use of the capturing group overwrites the previous capture.

.NET,另一方面记住他们。它这样做成堆叠。符合上述正则表达式等之后

.NET on the other hand remembers them all. And it does so in a stack. After matching the above regex like

Match m = new Regex(@"(.)+").Match("abcd");

您会发现

m.Groups[1].Captures

是一个 CaptureCollection ,它的元素对应于四个捕获

Is a CaptureCollection whose elements correspond to the four captures

0: "a"
1: "b"
2: "c"
3: "d"

其中数字是该指数进入 CaptureCollection 。所以基本上每个组被再次使用时,一个新的捕捉被压入堆栈。

where the number is the index into the CaptureCollection. So basically every time the group is used again, a new capture is pushed onto the stack.

它变得更有趣,如果我们使用命名捕获组。由于.NET允许重复使用相同的名称,我们可以写像

It gets more interesting if we are using named capturing groups. Because .NET allows repeated use of the same name we could write a regex like

(?<word>\w+)\W+(?<word>\w+)

要捕捉这两个词在同一组。同样,每一群具有特定名称遇到的时候,捕捉被压入堆栈中。因此,应用此正则表达式来输入福吧和检查

to capture two words into the same group. Again, every time a group with a certain name is encountered, a capture is pushed onto its stack. So applying this regex to the input "foo bar" and inspecting

m.Groups["word"].Captures

我们发现两个捕获

0: "foo"
1: "bar"

这使我们能够连推东西到来自前pression不同地区的一个堆叠。但尽管如此,这仅仅是能够跟踪其列在本 CaptureCollection 多次捕捉.NET的功能。但我说,这个收藏是一个堆栈。因此,我们可以的流行的从它的东西?

This allows us to even push things onto a single stack from different parts of the expression. But still, this is just .NET's feature of being able to track multiple captures which are listed in this CaptureCollection. But I said, this collection is a stack. So can we pop things from it?

事实证明,我们能做到。如果我们用一组像(小于?-word&GT; ...),那么最后捕获从堆栈中弹出如果SUBEX pression ... 匹配。因此,如果我们改变我们的previous EX pression到

It turns out we can. If we use a group like (?<-word>...), then the last capture is popped from the stack word if the subexpression ... matches. So if we change our previous expression to

(?<word>\w+)\W+(?<-word>\w+)

然后第二组将弹出第一组的拍摄,并且我们会收到一个空的 CaptureCollection 到底。当然,这个例子是pretty的无用

Then the second group will pop the first group's capture, and we will receive an empty CaptureCollection in the end. Of course, this example is pretty useless.

但是有一个更详细的负语法:如果栈已经为空,该集团将失败(不管其子模式)。我们可以利用这种行为算嵌套水平 - 而这正是这个名字均衡组来自(以及它变得有趣)。说,我们要以匹配正确括号字符串。我们推堆栈上的每个开括号,并弹出一个捕捉每一个右括号。如果我们遇到的一个右括号太多,它会试图弹出空栈,并导致模式失败:

But there's one more detail to the minus-syntax: if the stack is already empty, the group fails (regardless of its subpattern). We can leverage this behavior to count nesting levels - and this is where the name balancing group comes from (and where it gets interesting). Say we want to match strings that are correctly parenthesized. We push each opening parenthesis on the stack, and pop one capture for each closing parenthesis. If we encounter one closing parenthesis too many, it will try to pop an empty stack and cause the pattern to fail:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*$

因此​​,我们必须在重复三种选择。第一种方案消耗的一切,是不是一个括号。第二个备选的匹配 s,而将它们推入堆栈。第三种选择匹配 s,而弹出的元素堆栈(如果可能的话!)。

So we have three alternatives in a repetition. The first alternative consumes everything that is not a parenthesis. The second alternative matches (s while pushing them onto the stack. The third alternative matches )s while popping elements from the stack (if possible!).

<分> 注意:只是为了澄清,我们只检查有没有括号不匹配!这意味着不含括号在所有的的字符串的比赛,因为他们仍然语法有效(在某些语法,你需要你的括号匹配)。如果你想确保至少一组括号,只需添加一个超前(?=。* [(])之后的 ^

Note: Just to clarify, we're only checking that there are no unmatched parentheses! This means that string containing no parentheses at all will match, because they are still syntactically valid (in some syntax where you need your parentheses to match). If you want to ensure at least one set of parentheses, simply add a lookahead (?=.*[(]) right after the ^.

这种模式是不完美的(或完全正确),虽然。

This pattern is not perfect (or entirely correct) though.

还有一个陷阱:这并不能保证堆栈为空字符串的结束(因此(FOO(栏)将是有效的)。 NET(和许多其他口味)多了一个结构,它可以帮助我们在这里:有条件模式的一般语法

There is one more catch: this does not ensure that the stack is empty at the end of the string (hence (foo(bar) would be valid). .NET (and many other flavors) have one more construct that helps us out here: conditional patterns. The general syntax is

(?(condition)truePattern|falsePattern)

其中 falsePattern 是可选的 - 如果它省略了错误的情况下将始终一致。的条件可以是一种模式,或拍摄组的名称。我将重点讨论后一种情况在这里。如果它是一个捕获组的名称,然后 truePattern 用来当且仅当捕获堆栈特定组不为空。也就是说,像一个条件模式((名)吗|)写着如果名称的匹配和捕获东西(这仍然是在栈上),使用模式,否则使用模式没有

where the falsePattern is optional - if it is omitted the false-case will always match. The condition can either be a pattern, or the name of a capturing group. I'll focus on the latter case here. If it's the name of a capturing group, then truePattern is used if and only if the capture stack for that particular group is not empty. That is, a conditional pattern like (?(name)yes|no) reads "if name has matched and captured something (that is still on the stack), use pattern yes otherwise use pattern no".

因此​​,在我们上面的图案结束时,我们可以添加类似(?(打开)failPattern)这会导致整个模式失败,如果打开 -stack不为空。最简单的事情,使图案无条件失败是(?!)(空负向前查找)。因此,我们有我们的最终格局:

So at the end of our above pattern we could add something like (?(Open)failPattern) which causes the entire pattern to fail, if the Open-stack is not empty. The simplest thing to make the pattern unconditionally fail is (?!) (an empty negative lookahead). So we have our final pattern:

^(?:[^()]|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))$

请注意,此条件的语法有什么本质上做均衡组,但有必要利用其全部功能。

Note that this conditional syntax has nothing per se to do with balancing groups but it's necessary to harness their full power.

从这里,天空才是极限。许多非常复杂的使用是可能的,也有一些陷阱,当与其他.NET的正则表达式组合使用功能,如可变长度lookbehinds(<一href="http://stackoverflow.com/questions/13389560/balancing-groups-in-variable-length-lookbehind">which我不得不学习困难的方式自己)。主要的问题却始终是:使用这些功能时是你的code还是维护?你需要记录它真的很好,并确保每个人都谁的作品就可以了也意识到了这些功能。否则,你可能会更好,只是走串字符手动逐个字符和一个整数计算嵌套层次。

From here, the sky is the limit. Many very sophisticated uses are possible and there are some gotchas when used in combination with other .NET-Regex features like variable-length lookbehinds (which I had to learn the hard way myself). The main question however is always: is your code still maintainable when using these features? You need to document it really well, and be sure that everyone who works on it is also aware of these features. Otherwise you might be better off, just walking the string manually character-by-character and counting nesting levels in an integer.

学分为这部分去了Kobi(请参阅下面的详细信息他的回答)。

Credits for this part go to Kobi (see his answer below for more details).

现在所有的上述情况,我们可以验证字符串是否正确括号。但是,这将是一个更为有用,如果我们能够真正得到(嵌套)捕获所有这些括号的内容。当然,我们可以根据他们在一个单独的步骤位置记得在未清空单独捕获堆栈打开和关闭括号,然后做一些字符串的提取。

Now with all of the above, we can validate that a string is correctly parenthesized. But it would be a lot more useful, if we could actually get (nested) captures for all those parentheses' contents. Of course, we could remember opening and closing parentheses in a separate capture stack that is not emptied, and then do some substring extraction based on their positions in a separate step.

但是.NET提供多一个方便的功能在这里:如果我们用(?&LT; A-B&GT;子模式),不仅是从堆栈中弹出一个捕获 B ,而且还与一切弹出 B 该电流组捕获被压入堆栈 A 。因此,如果我们用一组喜欢本作的右括号,而弹出嵌套层数从我们的堆栈,我们也可以推动该货币对的内容到另一个堆栈:

But .NET provides one more convenience feature here: if we use (?<A-B>subPattern), not only is a capture popped from stack B, but also everything between that popped capture of B and this current group is pushed onto stack A. So if we use a group like this for the closing parentheses, while popping nesting levels from our stack, we can also push the pair's content onto another stack:

^(?:[^()]|(?<Open>[(])|(?<Content-Open>[)]))*(?(Open)(?!))$

<子>提供了Kobi本<一个href="http://regexstorm.net/tester?p=%28?%3a%5B%5E%7B%7D%5D%7C%28?%3COpen%3E%7B%29%7C%28?%3CContent-Open%3E%7D%29%29%2b%28?%28Open%29%28?!%29%29&i=0%20%7B1%202%20%7B3%7D%20%7B4%205%20%7B6%7D%7D%207%7D%208">Live-Demo在他的回答

Kobi provided this Live-Demo in his answer

所以,考虑到所有这些东西放在一起,我们可以:

So taking all of these things together we can:

  • 记住任意多个捕获
  • 验证嵌套结构
  • 在捕捉每一层嵌套

在一个单一的常规EX pression。如果这不是令人兴奋的......)

All in a single regular expression. If that's not exciting... ;)

这是我发现的有用,当我第一次了解他们的一些资源:

Some resources that I found helpful when I first learned about them:

  • <一个href="http://blog.stevenlevithan.com/archives/balancing-groups">http://blog.stevenlevithan.com/archives/balancing-groups
  • MSDN上的均衡组
  • MSDN上有条件的图案
  • <一个href="http://kobikobi.word$p$pss.com/tag/balancing-group/">http://kobikobi.word$p$pss.com/tag/balancing-group/ (略学业,但有一些有趣的应用程序)
  • http://blog.stevenlevithan.com/archives/balancing-groups
  • MSDN on balancing groups
  • MSDN on conditional patterns
  • http://kobikobi.wordpress.com/tag/balancing-group/ (slightly academic, but has some interesting applications)

这篇关于什么是正EX pression均衡组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆