捕获< thisPartOnly>和(thisPartOnly)具有相同的组 [英] Capturing <thisPartOnly> and (thisPartOnly) with the same group
问题描述
假设我们输入以下内容:
Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
我们还有以下正则表达式:
We also have the following regex:
<(\w+)>|\((\w+)\)
现在我们得到了两个匹配项(如在ularular.com上看到的):
Now we get two matches (as seen on rubular.com):
-
< amy>
是匹配项,\1
捕获amy
,\2
失败 -
(bob)
是匹配项,\2
捕获bob
,\1
失败
<amy>
is a match,\1
capturesamy
,\2
fails(bob)
is a match,\2
capturesbob
,\1
fails
此正则表达式可以满足我们的大部分需求,即:
This regex does most of what we want, which are:
- 它正确地匹配了左括号和右括号(即没有混合)
- 它捕获了我们感兴趣的部分
但是,它确实有一些缺点:
However, it does have a few drawbacks:
- 捕获模式(即主要部分重复
- 只有
\w +
在这种情况下,但通常来说可能会很复杂,
- 如果涉及回溯引用,则必须为每个备用引用重新编号! / li>
- 重复使维护成为噩梦! (如果更改了什么?)
- The capturing pattern (i.e. the "main" part) is repeated
- It's only
\w+
in this case, but generally speaking this can be quite complex,- If it involves backreferences, then they must be renumbered for each alternate!
- Repetition makes maintenance a nightmare! (what if it changes?)
- 根据哪些替代匹配项,我们必须查询不同的组
- 只有<$ c $在这种情况下,是c> \1 或
\2
,但是通常,主要部分可以拥有自己的捕获组!
- Depending on which alternate matches, we must query different groups
- It's only
\1
or\2
in this case, but generally the "main" part can have capturing groups of their own!
所以问题很明显:我们如何在不重复主要模式的情况下做到这一点?
注意:在大多数情况下,我对
java.util.regex
样式感兴趣,但欢迎使用其他样式。Note: for the most part I'm interested in
java.util.regex
flavor, but other flavors are welcomed.
附录
本节没有新内容;
Appendix
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
让我们将上面的示例转到下一步:我们现在要匹配这些:
Let's take the above example to the next step: we now want to match these:
<amy=amy> (bob=bob) [carol=carol]
但不是这些:
<amy=amy) # non-matching bracket <amy=bob> # left hand side not equal to right hand side
使用替代技术,我们可以使用以下方法(在rubular.com上看到):
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
如上所述:
- 主要模式不能简单地重复;向后引用必须重新编号
- 重复也意味着维护噩梦,如果它发生变化
- 根据哪些替代匹配项,我们必须查询
\1 \2
,\3 \4
或\5 \6
- The main pattern can't simply be repeated; backreferences must be renumbered
- Repetition also means maintenance nightmare if it ever changes
- Depending on which alternate matches, we must query either
\1 \2
,\3 \4
, or\5 \6
推荐答案
您可以使用前瞻功能锁定组
You can use a lookahead to "lock in" the group number before doing the real match.
String s = "<amy=amy>(bob=bob)[carol=carol]"; Pattern p = Pattern.compile( "(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])"); Matcher m = p.matcher(s); while(m.find()) { System.out.printf("found %s in %s%n", m.group(2), m.group()); }
输出:
found amy in <amy=amy> found bob in (bob=bob) found carol in [carol=carol]
这仍然很丑陋,但是您不必重新计算所有例如,要添加对大括号的支持,就是这样:
It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:
"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"
这篇关于捕获< thisPartOnly>和(thisPartOnly)具有相同的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- It's only
- 只有<$ c $在这种情况下,是c> \1 或
- It's only
- 如果涉及回溯引用,则必须为每个备用引用重新编号! / li>
- 只有