Java正则表达式:当事先不知道捕获组的编号时,如何在特定上下文中向后引用捕获组 [英] Java regex: how to back-reference capturing groups in a certain context when their number is not known in advance

查看:103
本文介绍了Java正则表达式:当事先不知道捕获组的编号时,如何在特定上下文中向后引用捕获组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为介绍性注释,我知道有关解决regex问题的古老说法,也了解使用RegEx处理XML的预防措施。但是请耐心片刻...

As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...

我正在尝试进行RegEx搜索并替换一组字符。我事先不知道该组的匹配频率,但是我只想在特定的上下文中进行搜索。

I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.

示例:
如果我有以下字符串 ** ab ** df ** ab ** sdf ** ab ** fdsa ** ab ** bb ,我想搜索 ab 并替换为 @ ab @ ,使用以下正则表达式即可正常工作:

An example: If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb" and I want to search for "ab" and replace with "@ab@", this works fine using the following regex:

搜索正则表达式:

(.*?)(ab)(.*?)

替换:

$1@$2@$3

我一共得到了四场比赛。在每个匹配项中,组ID相同,因此反向引用($ 1,$ 2 ...)也可以正常工作。

I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.

但是,如果我现在在字符串中添加特定上下文,则上述正则表达式将失败:

However, if I now add a certain context to the string, the regex above fails:

搜索字符串:

<context>abdfabsdfabfdsaabbb</context>

搜索正则表达式:

<context>(.*?)(ab)(.*?)</context>

这只会找到第一个匹配项。
但是,即使我在原始正则表达式中添加了一个非捕获组,也无法使用(< context>(?:(。*?)(ab)(。* ?))*< / context> )。

This will find only the first match. But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>").

我想要的是与第一次搜索中一样的匹配项列表(无上下文),因此在每个匹配项中,组ID都是相同的。

What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.

是否知道如何实现?

推荐答案

解决方案



您的要求类似于这个问题:匹配并捕获前缀和后缀之间的模式的多个实例。使用我的答案中所述的方法:

Solution

Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:

(?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab

根据需要添加捕获组。

请注意,正则表达式仅适用于只允许包含文本的标记。如果标签包含其他标签,则它将无法正常工作。

Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.

它还与<$内的 ab 相匹配c $ c>< context> 标签,而没有结束标签< / context> 。如果要防止这种情况发生,则:

It also matches ab inside <context> tag without a closing tag </context>. If you want to prevent this then:

(?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab



说明



让我们分解正则表达式:

Explanation

Let us break down the regex:

(?s)                        # Make . matches any character, without exception
(?:
  <context>
    |
  (?!^)\G
)
(?:(?!</context>|ab).)*
ab

(?:< context> |(?!^)\G)确保我们进入新的< context> 标记,或者从上一个匹配继续并尝试匹配更多子模式实例。

(?:<context>|(?!^)\G) makes sure that we either gets inside a new <context> tag, or continue from the previous match and attempt to match more instance of sub-pattern.

(?:( ?! < / context> | ab)。)* 匹配我们不关心的任何文本(不是 ab ),并阻止我们前进结束标记< / context> 之后。然后,我们匹配希望结尾的 ab 的模式。

(?:(?!</context>|ab).)* match whatever text that we don't care about (not ab) and prevent us from going past the closing tag </context>. Then we match the pattern we want ab at the end.

这篇关于Java正则表达式:当事先不知道捕获组的编号时,如何在特定上下文中向后引用捕获组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆