所有重叠的子字符串都匹配java正则表达式 [英] All overlapping substrings matching a java regex

查看:103
本文介绍了所有重叠的子字符串都匹配java正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有API方法返回与正则表达式匹配的所有(可能重叠)子串?

Is there an API method that returns all (possibly overlapping) substrings that match a regular expression?

例如,我有一个文本字符串:字符串t = 04/31 412-555-1235; ,我有一个模式:模式p =新模式(\\\\\\ d +); 匹配两个或多个字符的字符串。

For example, I have a text string: String t = 04/31 412-555-1235;, and I have a pattern: Pattern p = new Pattern("\\d\\d+"); that matches strings of two or more characters.

我得到的匹配是:04,31,412,555,1235 。

The matches I get are: 04, 31, 412, 555, 1235.

如何获得重叠匹配?

我希望代码返回:04,31,41,412,12,55,555,55,12,123,1235,23,235,35。

I want the code to return: 04, 31, 41, 412, 12, 55, 555, 55, 12, 123, 1235, 23, 235, 35.

从理论上讲它应该是可行的 - 有一个明显的 O(n ^ 2)算法枚举和检查反对该模式的所有子串。

Theoretically it should be possible -- there is an obvious O(n^2) algorithm that enumerates and checks all the substrings against the pattern.

编辑

而不是枚举所有子串,在 Matcher 中使用区域(int start,int end)方法更安全。根据单独的提取子字符串检查模式可能会更改匹配的结果(例如,如果在模式的开头/结尾处存在非捕获组或字边界检查)。

Rather than enumerating all substrings, it is safer to use the region(int start, int end) method in Matcher. Checking the pattern against a separate, extracted substring might change the result of the match (e.g. if there is a non-capturing group or word boundary check at the start/end of the pattern).

编辑2

实际上,目前还不清楚是否 region()做你对零宽度匹配的期望。规范含糊不清,实验结果令人失望。

Actually, it's unclear whether region() does what you expect for zero-width matches. The specification is vague, and experiments yield disappointing results.

例如:

String line = "xx90xx";
String pat = "\\b90\\b";
System.out.println(Pattern.compile(pat).matcher(line).find()); // prints false
for (int i = 0; i < line.length(); ++i) {
  for (int j = i + 1; j <= line.length(); ++j) {
    Matcher m = Pattern.compile(pat).matcher(line).region(i, j);
    if (m.find() && m.group().size == (j - i)) {
      System.out.println(m.group() + " (" + i + ", " + j + ")"); // prints 90 (2, 4)
    }
  }
}

我不确定最优雅的解决方案是什么。一种方法是在检查 pat 匹配之前,采用 line 的子字符串并使用适当的边界字符进行填充。

I'm not sure what the most elegant solution is. One approach would be to take a substring of line and pad with with the appropriate boundary characters before checking whether the pat matches.

编辑3

这是我提出的完整解决方案。它可以处理原始正则表达式中的零宽度模式,边界等。它查看文本字符串的所有子字符串,并通过在开头和结尾用适当数量的通配符填充模式来检查正则表达式是否仅在特定位置匹配。它似乎适用于我尝试过的案例 - 尽管我还没有进行过广泛的测试。它肯定效率低于它。

Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.

  public static void allMatches(String text, String regex)
  {
    for (int i = 0; i < text.length(); ++i) {
      for (int j = i + 1; j <= text.length(); ++j) {
        String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
        Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);

        if (m.find()) 
        {   
          System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
        }   
      }   
    }   
  }

编辑4

这是一种更好的方法: https://stackoverflow.com/a/11372670/244526

Here's a better way of doing this: https://stackoverflow.com/a/11372670/244526

编辑5

JRegex 库支持查找与java正则表达式匹配的所有重叠子字符串(尽管它出现了)暂时没有更新)。具体而言,关于非破坏性搜索的文档指定:

The JRegex library supports finding all overlapping substrings matching a java regex (although it appears not to have been updated in a while). Specifically, the documentation on non-breaking search specifies:


使用非中断搜索,您可以找到
模式的所有可能发生的事件,包括相交或嵌套的模式。这是
通过使用Matcher的方法proceed()而不是find()

Using non-breaking search you can finding all possible occureneces of a pattern, including those that are intersecting or nested. This is achieved by using the Matcher's method proceed() instead of find()


推荐答案

我遇到了类似的情况,我尝试了上述答案,但在我的情况下,通过设置匹配器的开始和结束索引
需要花费太多时间,但我认为我找到了更好的解决方案,我我在这里张贴给别人。
以下是我的代码片段。

I faced a similar situation and I tried the above answers but in my case it took too much of time by setting the start and end index of the matcher but I think I've found a better solution, I'm posting it here for others. So below is my code sniplet.

if (textToParse != null) {
Matcher matcher = PLACEHOLDER_PATTERN.matcher(textToParse);
    while(matcher.hitEnd()!=true){
        Boolean result = matcher.find();
        int count = matcher.groupCount();
        System.out.println("Result " +result+" count "+count);
        if(result==true && count==1){
            mergeFieldName = matcher.group(1);
            mergeFieldNames.add(mergeFieldName);
           }
       }
  }

我使用过匹配器。 hitEnd()方法检查我是否已到达文本末尾。

I have used the matcher.hitEnd() method to check if i have reached the end of text.

希望这会有所帮助。
谢谢!

Hope this helps. Thanks!

这篇关于所有重叠的子字符串都匹配java正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆