Java正则表达式:当事先不知道捕获组的编号时,如何在特定上下文中向后引用捕获组 [英] Java regex: how to back-reference capturing groups in a certain context when their number is not known in advance
问题描述
作为介绍性注释,我知道有关解决regex问题的古老说法,也了解使用RegEx处理XML的预防措施。但是请耐心片刻...
As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...
我正在尝试进行RegEx搜索并替换一组字符。我事先不知道该组的匹配频率,但是我只想在特定的上下文中进行搜索。
I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.
示例:
如果我有以下字符串 ** ab ** df ** ab ** sdf ** ab ** fdsa ** ab ** bb
,我想搜索 ab
并替换为 @ ab @
,使用以下正则表达式即可正常工作:
An example:
If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb"
and I want to search for "ab"
and replace with "@ab@"
, this works fine using the following regex:
搜索正则表达式:
(.*?)(ab)(.*?)
替换:
$1@$2@$3
我一共得到了四场比赛。在每个匹配项中,组ID相同,因此反向引用($ 1,$ 2 ...)也可以正常工作。
I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.
但是,如果我现在在字符串中添加特定上下文,则上述正则表达式将失败:
However, if I now add a certain context to the string, the regex above fails:
搜索字符串:
<context>abdfabsdfabfdsaabbb</context>
搜索正则表达式:
<context>(.*?)(ab)(.*?)</context>
这只会找到第一个匹配项。
但是,即使我在原始正则表达式中添加了一个非捕获组,也无法使用(< context>(?:(。*?)(ab)(。* ?))*< / context>
)。
This will find only the first match.
But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>"
).
我想要的是与第一次搜索中一样的匹配项列表(无上下文),因此在每个匹配项中,组ID都是相同的。
What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.
是否知道如何实现?
推荐答案
解决方案
您的要求类似于这个问题:匹配并捕获前缀和后缀之间的模式的多个实例。使用我的答案中所述的方法:
Solution
Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:
(?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab
根据需要添加捕获组。
请注意,正则表达式仅适用于只允许包含文本的标记。如果标签包含其他标签,则它将无法正常工作。
Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.
它还与<$内的 ab
相匹配c $ c>< context> 标签,而没有结束标签< / context>
。如果要防止这种情况发生,则:
It also matches ab
inside <context>
tag without a closing tag </context>
. If you want to prevent this then:
(?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab
说明
让我们分解正则表达式:
Explanation
Let us break down the regex:
(?s) # Make . matches any character, without exception
(?:
<context>
|
(?!^)\G
)
(?:(?!</context>|ab).)*
ab
(?:< context> |(?!^)\G)
确保我们进入新的< context>
标记,或者从上一个匹配继续并尝试匹配更多子模式实例。
(?:<context>|(?!^)\G)
makes sure that we either gets inside a new <context>
tag, or continue from the previous match and attempt to match more instance of sub-pattern.
(?:( ?! < / context> | ab)。)*
匹配我们不关心的任何文本(不是 ab
),并阻止我们前进结束标记< / context>
之后。然后,我们匹配希望结尾的 ab
的模式。
(?:(?!</context>|ab).)*
match whatever text that we don't care about (not ab
) and prevent us from going past the closing tag </context>
. Then we match the pattern we want ab
at the end.
这篇关于Java正则表达式:当事先不知道捕获组的编号时,如何在特定上下文中向后引用捕获组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!