在可变长度后向均衡组 [英] Balancing groups in variable-length lookbehind

查看:216
本文介绍了在可变长度后向均衡组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR:使用捕捉(特别是均衡组)内.NET的lookbehinds改变所获得的捕获,但它不应该有所作为。有什么用.NET的lookbehinds,打破了预期的行为是什么呢?

TL;DR: Using capturing (and in particular balancing groups) inside .NET's lookbehinds changes the obtained captures, although it shouldn't make a difference. What is it with .NET's lookbehinds that breaks the expected behavior?

我试图想出一个答案,<一个href="http://stackoverflow.com/questions/13387132/regular-ex$p$pssion-to-check-if-a-string-is-within-certain-pattern-that-may-conta">this另一个问题,以此为借口玩弄.NET的均衡组。但是,我不能让他们的工作变长后向里面。

I was trying to come up with an answer to this other question, as an excuse to play around with .NET's balancing groups. However, I cannot get them to work inside a variable-length lookbehind.

首先,请注意,我不打算来有效地利用这个特殊的解决方案。它更学术的原因,因为我觉得有一些与可变长度的回顾后,我自己也不知道怎么回事。而明知可能派上用场的未来,当我真正需要使用像这样来解决问题。

First of all, note that I do not intend to use this particular solution productively. It's more for academic reasons, because I feel that there is something going on with the variable-length lookbehind which I am not aware of. And knowing that could come in handy in the future, when I actually need to use something like this to solve a problem.

考虑此输入:

~(a b (c) d (e f (g) h) i) j (k (l (m) n) p) q

我们的目标是要匹配所有的字母,这是在括号内是由 pceded $ P $,没有多么深跌(所以一切从 A )。我的尝试是检查在后向正确的位置上,这样我可以得到一个调用的所有字母为匹配。这是我的方式:

The goal is to match all letters, that are inside parentheses that are preceded by ~, not matter how deep down (so everything from a to i). My attempt was to check for the correct position in a lookbehind, so that I can get all letters in a single call to Matches. Here is my pattern:

(?<=~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*)[a-z]

在后向我试图找到一个〜(,然后我用的是命名组叠深度来算上外来开括号,只要在〜(从不关闭,回顾后应该匹配。如果右括号到到达,(小于?-depth&GT; ...)不能从栈中弹出任何东西和后向应该失败(即从 J均以字母)。不幸的是,这是行不通的。相反,我匹配 A B C 电子 F 先按g M 所以,只有这些:

In the lookbehind I try to find a ~(, and then I use the named group stack Depth to count extraneous opening parentheses. As long as the parenthesis opened in ~( is never closed, the lookbehind should match. If the closing parenthesis to that is reached, (?<-Depth>...) cannot pop anything from the stack and the lookbehind should fail (that is, for all letters from j). Unfortunately, this does not work. Instead, I match a, b, c, e, f, g and m. So only these:

~(a b (c) _ (e f (g) _) _) _ (_ (_ (m) _) _) _

这似乎意味着,一旦我关闭了一个括号,除非我去回落到最高嵌套层次我一直前的后向所无法比拟的任何东西。

That seems to mean that the lookbehind cannot match anything once I have closed a single parenthesis, unless I go back down to the highest nesting level I have been to before.

好了,这可能只是意味着有一些奇怪我的正常EX pression,或者我不明白均衡组正常。但后来我想这没有后向。我创建了一个字符串的每一个字母是这样的:

Okay, this could just mean there is something odd with my regular expression, or I did not understand the balancing groups properly. But then I tried this without the lookbehind. I created a string for every letter like this:

~(z b (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a z (c) d (e f (x) y) g) h (i (j (k) l) m) n
~(a b (z) d (e f (x) y) g) h (i (j (k) l) m) n
....
~(a b (c) d (e f (x) y) g) h (i (j (k) l) z) n
~(a b (c) d (e f (x) y) g) h (i (j (k) l) m) z

和使用这种模式在每一个这些:

And used this pattern on each of those:

~[(](?:[^()]*|(?<Depth>[(])|(?<-Depth>[)]))*z

和根据需要,所有的情况下比赛,其中以Z 替换字母之间的 A 和所有的情况下,后失败。

And as desired, all cases match, where z replaces a letter between a and i and all the cases after that fail.

那么,是什么的(可变长度)后向这样做打破了这种平衡的使用群体?我试图研究这个一晚上(和发现如这个),但我不能找到一个后向单次使用了这一点。

So what does the (variable-length) lookbehind do that breaks this use of balancing groups? I tried to research this all evening (and found pages like this one), but I could not find a single use of this in a lookbehind.

我也很高兴,如果有人可以联系我到一些深入有关.NET正则表达式引擎如何处理内部.NET的特定功能的信息。我发现这个惊人的文章,但它似乎没有进入(可变长度)lookbehinds,例如。

I would also be glad, if someone could link me to some in-depth information about how the .NET regex engine handles .NET-specific features internally. I found this amazing article, but it does not seem to go into (variable-length) lookbehinds, for instance.

推荐答案

我想我得到了它。
首先,正如我提到的意见,之一(小于?=(小于?A&GT;?)(小于-A&GT;。))永远不匹配。
但转念一想,怎么样(小于?=(小于2 -A&GT;?)(小于A&GT;。))?它匹配!
以及如何对(小于??=(小于A&GT;?)(小于A&GT;。))?匹配12 A 是捕捉1,如果我们看看捕获集合,它是 {2,1} - 第一二,再一个 - 这是相反的
因此,在一个后向里面,.NET比赛和捕获从右边到左边

I think I got it.
First, as I mentioned in one of the comments, (?<=(?<A>.)(?<-A>.)) never matches.
But then I thought, what about (?<=(?<-A>.)(?<A>.))? It does match!
And how about (?<=(?<A>.)(?<A>.))? Matched against "12", A is captures "1", and if we look at the Captures collection, it is {"2", "1"} - first two, then one - it is reversed.
So, while inside a lookbehind, .net matches and captures from the right to the left.

现在,我们怎么可以把它捕捉到由左到右?这是很简单的,真的 - 我们可以用超前欺骗发动机:

Now, how can we make it capture from left to right? This is quite simple, really - we can trick the engine using a lookahead:

(?<=(?=(?<A>.)(?<A>.))..)

适用于原来的彭定康,我想出了一个最简单的办法是:

Applied to your original patten, the simplest option I came up with was:

(?<=
    ~[(]
    (?=
        (?:
            [^()]
            |
            (?<Depth>[(])
            |
            (?<-Depth>[)])
        )*
        (?<=(\k<Prefix>))   # Make sure we matched until the current position
    )
    (?<Prefix>.*)           # This is captured BEFORE getting to the lookahead
)
[a-z]

这里的挑战是,现在的平衡的部分最终可能在任何地方,所以我们做这一切的方式达到当前位置(类似于 \ G变 \ Z 将是有益的在这里,但我不认为.NET都有)

The challenge here was that now the balanced part may end anywhere, so we make it reach all the way to the current position (Something like \G or \Z would be useful here, but I don't think .net has that)

这是非常可能这种行为被记录在某处,我会尝试一下吧。

It is very possible this behavior is documented somewhere, I'll try to look it up.

下面是另一种方法。这个想法很简单 - .NET希望从正确的比赛向左?精细!接招:
(提示:开始从底部阅读 - 这就是.NET是怎么做的)

Here's another approach. The idea is simple - .net wants to match from right to left? Fine! Take that:
(tip: start reading from the bottom - that is how .net does it)

(?<=
    (?(Depth)(?!))  # 4. Finally, make sure there are no extra closed parentheses.
    ~\(
    (?>                     # (non backtracking)
        [^()]               # 3. Allow any other character
        |
        \( (?<-Depth>)?     # 2. When seeing an open paren, decreace depth.
                            #    Also allow excess parentheses: '~((((((a' is OK.
        |
        (?<Depth>  \) )     # 1. When seeing a closed paren, add to depth.
    )*
)
\w                          # Match your letter

这篇关于在可变长度后向均衡组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆