Java中没有明显最大长度的正则表达式后视 [英] Regex look-behind without obvious maximum length in Java

查看:21
本文介绍了Java中没有明显最大长度的正则表达式后视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直认为 Java 的正则表达式 API(以及许多其他语言)中的 look-behind 断言必须有明显的长度.因此,look-behinds 中不允许使用 STAR 和 PLUS 量词.

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.

优秀的在线资源 regular-expressions.info 似乎证实了我的(部分)假设:

The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:

"[...] Java 更进一步允许有限重复.你还在不能使用星号或加号,但您可以使用问号和带有 max 参数的花括号指定的.Java认清事实有限重复可以是重写为字符串的交替具有不同但固定的长度.不幸的是,JDK 1.4 和 1.5使用时有一些bug在lookbehind里面交替.这些已在 JDK 1.6 中修复.[...]"

"[...] Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java recognizes the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6. [...]"

-- http://www.regular-expressions.info/lookaround.html

只要后视内字符范围的总长度小于或等于 Integer.MAX_VALUE,就可以使用大括号.所以这些正则表达式是有效的:

Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"

但这些不是:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"

但是,我不明白以下内容:

However, I don't understand the following:

当我在 look-behind 中使用 * 和 + 量词运行测试时,一切顺利(参见输出 Test 1Test 2).

When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).

但是,当我在 Test 1Test 2look-behind 开头添加单个字符时,它会中断 (见输出测试 3).

But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).

使测试 3 中的贪婪 * 不情愿没有效果,它仍然会中断(请参阅测试 4).

Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).

这是测试工具:

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}

输出:

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^

我的问题可能很明显,但我还是会问:谁能向我解释为什么测试 12 失败,并且 测试 34 不是吗?我原以为它们都会失败,而不是其中一半可以工作,一半会失败.

My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.

谢谢.

附注.我正在使用:Java 版本 1.6.0_14

PS. I'm using: Java version 1.6.0_14

推荐答案

查看 Pattern.java 的源代码会发现 '*' 和 '+' 是作为 Curly 的实例实现的(它是为 curl 创建的对象运营商).所以,

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,

a*

被实现为

a{0,0x7FFFFFFF}

a+

被实现为

a{1,0x7FFFFFFF}

这就是为什么您会看到卷发和星星的行为完全相同.

which is why you see exactly the same behaviors for curlies and stars.

这篇关于Java中没有明显最大长度的正则表达式后视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆