正则表达式仅捕获匹配中捕获组的最后一个实例 [英] Regex only capturing last instance of capture group in match
问题描述
我有以下两种不同语言的正则表达式,它们产生相同的奇怪结果(javaScript 和 Flash).我想知道的不是如何修复它,而是为什么会发生这种行为?
I have the following regular expression in two different languages that produces the same odd results (javaScript and Flash). What I want to know is not how to fix it, but why the behavior is occurring?
正则表达式:
[(\{2}|\]|[^]])*]
这里的目标是匹配括号中的字符串,并确保我不会在转义括号处停止.
The goal here is to match a bracketed string, and ensure that I don't stop at an escaped bracket.
如果我有文本输入 [abcdefg]
它是正确匹配的,但作为捕获组的一部分返回的唯一内容是 g
,正如我所期望的 <代码>abcdefg.如果我将表达式更改为[((?:\{2}|\]|[^]])*)]
,然后得到我想要的结果.
If I have the text input [abcdefg]
it is correctly matched, but the only thing returned as part of the capture group is g
, where as I expect abcdefg
. If I change the expression to
[((?:\{2}|\]|[^]])*)]
, then I get the result that I want.
为什么会发生这种情况?这在其他语言中会保持一致吗?
So why is this happening? Will this be consistent across other languages?
注意: 将表达式简化为 [([^]])*]
会产生同样的问题.
note: Simplifing the expression to [([^]])*]
produces the same issue.
推荐答案
无论出现什么问题,ActionScript 和 JavaScript 都应该产生相同的结果,因为它们都实现了 ECMAScript(或其超集,但对于正则表达式,它们不应该不同意).
Regardless of the problem, ActionScript and JavaScript should always yield the same results, as they both implement ECMAScript (or a superset thereof, but for regular expressions they should not disagree).
但是是的,这将发生在任何语言(或者任何正则表达式)中.原因是您正在重复捕获组.让我们举一个更简单的例子:将 (.)*
与 abc
进行匹配.所以我们要重复的是(.)
.第一次尝试时,引擎进入组,将 a
与 .
匹配,离开组并捕获 a
.只有现在量词才起作用并重复整个过程.于是我们再次入组,匹配捕获b
.此捕获覆盖了前一个捕获,因此 1
现在确实包含 b
.第三次重复同样如此:捕获将被 c
覆盖.
But yes, this will be happening in any language (or rather any regex flavor). The reason is that you are repeating the capturing group. Let's take a simpler example: match (.)*
against abc
. So what we are repeating is (.)
. The first time it is tried, the engine enters the group, matches a
with .
, leaves the group and captures a
. Only now does the quantifier kick in and it repeats the whole thing. So we enter the group again, and match and capture b
. This capture overwrites the previous one, hence 1
does now contain b
. Same again for the third repetition: the capture will be overwritten with with c
.
我不知道有什么行为不同的正则表达式风格,唯一可以让您访问所有先前捕获(而不是仅仅覆盖它们)的是 .NET.
I don't know of a regex flavor that behaves differently, and the only one that lets you access all previous captures (instead of just overwriting them) is .NET.
解决方案是一个 p.s.w.g.建议的.将重复所需的分组设为非捕获(这将提高性能,因为无论如何您都不需要所有捕获和覆盖)并将整个内容包装在一个新组中.不过,您的表达式有一个小缺陷:您需要在否定字符类中包含反斜杠.否则,回溯可能会在 [abc]
中为您提供匹配项.所以这里有一个符合你预期的表达式:
The solution is the one p.s.w.g. proposed. Make the grouping you need for the repetition non-capturing (this will improve performance, because you don't need all that capturing and overwriting anyway) and wrap the whole thing in a new group. Your expression has one little flaw though: you need to include include the backslash in the negated character class. Otherwise, backtracking could give you a match in [abc]
. So here is an expression that will work as you expect:
[((?:\{2}|\]|[^]\])*)]
工作演示.(不幸的是,它没有显示捕获,但它表明它在所有情况下都提供了正确的匹配)
Working demo. (unfortunately, it doesn't show the captures, but it shows that it gives correct matches in all cases)
请注意,您的表达式不允许使用其他转义序列.特别是单个 后跟除
]
之外的任何内容都会导致您的模式失败.如果这不是您想要的,您可以使用:
Note that your expression does not allow for other escape sequences. In particular a single , followed by anything but a
]
will cause your pattern to fail. If this is not what you desire, you can just use:
[((?:\.|[^]\])*)]
使用展开循环"技术可以进一步提高性能:
Performance can further be improved with the "unrolling-the-loop" technique:
[([^]\]*(?:\.[^]\]*)*)]
这篇关于正则表达式仅捕获匹配中捕获组的最后一个实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!