为什么^(?: x + y){5} $的性能比^ x + yx + yx + yx + yx + y $慢 [英] Why is performance of ^(?:x+y){5}$ slower than ^x+yx+yx+yx+yx+y$

查看:62
本文介绍了为什么^(?: x + y){5} $的性能比^ x + yx + yx + yx + yx + y $慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我让以下编译的正则表达式在.net(N)和Java(J)中的一串字符串上匹配.在.net和Java中,通过多次运行,正则表达式 1 和正则表达式 2 之间存在一致的差异:

I let the following compiled regular expressions match on a bunch of strings, both in .net (N) and Java (J). Through multiple runs there are consistent differences between regex 1 and regex 2, both in .net and in Java:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  #  regex                             N secs   N x  J secs   J x 
──────────────────────────────────────────────────────────────────
  1  ^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$    8.10     1    5.67     1 
  2  ^(?:[^@]+@){5}$                    11.07  1.37    6.48  1.14 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

正则表达式编译器是否应该并且应该将等效结构标准化为性能最佳的形式?

如果他们应该并且应该",至少可以写一个 regex优化器,该优化器在编译正则表达式字符串之前对其进行修改.

If they "could and should", it whould at least be possible to write a regex optimizer, which modifies the regex string before it gets compiled.

所用代码的关键部分:

.net

// init
regex = new Regex(r, RegexOptions.Compiled | RegexOptions.CultureInvariant);

// test
sw = Stopwatch.Start();
foreach (var s in strs)
  if (regex.isMatch(s))
    matches++;
elapsed = sw.Elapsed;

Java

// init
pat = Pattern.compile(r);

// test
before = System.currentTimeMillis();
for (String s : strs)
  if (pat.matcher(s).matches())
    matches++;
elapsed = System.currentTimeMillis() - before;

推荐答案

我不了解.NET ,因为我没有详细研究其源代码.

I don't know about .NET, since I haven't studied its source code in detail.

但是,在Java中,尤其是Oracle/Sun实现,我可以说这可能是由于循环结构开销所致.

However, in Java, particularly Oracle/Sun implementation, I can say that it is likely due to the the looping structure overhead.

在这个答案中,每当我提到Java中的regex实现时,我所指的就是Oracle/Sun实现.我还没有研究其他实现,所以我什么也不能说.

In this answer, whenever I refer to the implementation of regex in Java, it is Oracle/Sun implementation that I'm referring to. I haven't studied other implementation yet, so I can't really say anything.

我只是意识到这部分与问题无关.但是,它介绍了如何通过这种方式实现贪婪量词,因此我将其留在此处.

给一个原子 A 和一个贪婪的量词 A * (此处重复的次数并不重要),贪婪的量词将尝试匹配尽可能多的<尽可能使用code> A ,然后尝试续集( A * 之后的任何内容),如果失败,一次回溯一次重复,然后重试该续集.

Given an atom A with a greedy quantifier A* (the number of repetition doesn't really matter here), the greedy quantifier will try to match as many A as possible, then try the sequel (whatever comes after A*), on failure, backtrack one repetition at a time and retry with the sequel.

问题是要回溯到的位置.仅通过找出位置来重新匹配整个对象是非常低效的,因此我们需要为每个重复存储一个重复结束匹配的位置.重复的次数越多,则需要更多的内存来保持到目前为止的所有状态以进行回溯,而不提及捕获组(如果有).

The problem is where to backtrack to. It is extremely inefficient to rematch the whole thing just to figure out the position, so we need to store the position where a repetition finishes its match at, for every single repetition. The more you repeat, the more memory you need to keep all the states so far for backtracking, not mentioning the capturing groups (if any).

按照问题中的步骤展开正则表达式不能逃脱上面的内存要求.

Unrolling the regex as done in the question doesn't escape the memory requirement above.

但是,对于诸如 [^ @] * 这样的简单情况,您知道原子 A (在本例中为 [^ @] )只能匹配固定长度的字符串(长度为1),只需要最后匹配的位置和匹配的长度即可有效地进行匹配.Java的实现包括一个 study 方法,以检测固定长度的模式,将其编译为循环实现( Pattern.Curly Pattern.GroupCurly ).

However, for simple cases such as [^@]* where you know the atom A (in this case [^@]) can only match fixed length string (length 1), only the last match position and the length of the match are needed to perform the match efficiently. Java's implementation includes a study method to detect fixed length patterns like these to compile to a loop implementation (Pattern.Curly and Pattern.GroupCurly).

这是第一个正则表达式 ^ [^ @] + @ [^ @] + @ [^ @] + @ [^ @] + @ [^ @] + @ $ 的样子将其编译为 Pattern 类内的节点链之后:

This is how the first regex ^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$ looks like after it is compiled into a chain of nodes inside Pattern class:

Begin. \A or default ^
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Curly. Greedy quantifier {1,2147483647}
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      @
  Node. Accept match
Single. Match code point: U+0040 COMMERCIAL AT
Dollar(multiline=false). \Z or default $
java.util.regex.Pattern$LastNode
Node. Accept match

循环结构开销

在长度不固定的情况下,例如正则表达式 ^(?:[^ @] + @){中的原子(?:[^ @] + @) {5} $ 的问题,Java的实现切换到递归以处理匹配( Pattern.Loop ).

Begin. \A or default ^
Prolog. Loop wrapper
Loop[1733fe5d]. Greedy quantifier {5,5}
  java.util.regex.Pattern$GroupHead
  Curly. Greedy quantifier {1,2147483647}
    CharProperty.complement. S̄:
      BitClass. Match any of these 1 character(s):
        @
    Node. Accept match
  Single. Match code point: U+0040 COMMERCIAL AT
  GroupTail. --[next]--> Loop[1733fe5d]
Dollar(multiline=false). \Z or default $
java.util.regex.Pattern$LastNode
Node. Accept match

每次遍历节点 GroupTail->都会产生额外的开销.循环->GroupHead .

您可能会问,即使重复次数是固定的,实现也为什么要这样做呢?由于重复的次数是固定的,所以完全没有回溯的地方,所以我们不能只跟踪重复之前的状态和当前重复中的状态吗?

You may ask why the implementation does this even though the number of repetitions is fixed. Since the number of repetition is fixed, by right, there is nowhere to backtrack at all, so can't we just keep track of the state before the repetition and the state in the current repetition?

好吧,这是一个反例: ^(?: a {1,5}?){5} $ 在仅 a 的字符串长度15上.回溯仍然可能发生在原子内部,因此我们需要像往常一样存储每个重复的匹配位置.

Well, here is a counterexample: ^(?:a{1,5}?){5}$ on a string length 15 of only a's. The backtracking can still happen inside the atom, so we need to store the match position of every repetition as per usual.

我上面讨论的都是Java源代码(因此是字节码)级别.尽管源代码可能会揭示实现中的某些问题,但性能最终取决于JVM如何生成机器代码并执行优化.

What I have discussed above are all at the Java source code (and thus bytecode) level. While the source code may reveal certain problem in the implementation, the performance ultimately depends on how the JVM generates the machine code and performs optimization.

这是我用于测试的源代码:

This is the source code I used for testing:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.regex.Pattern;


public class SO28161874 {
    // Makes sure the same set of strings is generated between different runs
    private static Random r = new Random();

    public static void main(String args[]) {
        final int rep = 5;

        // String r1 = "^[^@]+@[^@]+@[^@]+@[^@]+@[^@]+@$";
        String r1 = genUnroll(rep);
        // String r2 = "^(?:[^@]+@){5}$";
        String r2 = genQuantifier(rep);

        List<String> strs = new ArrayList<String>();
        // strs.addAll(generateRandomString(500, 40000, 0.002, false));
        // strs.addAll(generateRandomString(500, 40000, 0.01, false));
        // strs.addAll(generateRandomString(500, 40000, 0.01, true));
        // strs.addAll(generateRandomString(500, 20000, 0, false));
        // strs.addAll(generateRandomString(500, 40000, 0.002, true));
        strs.addAll(generateNearMatchingString(500, 40000, rep));

        /*
        // Assertion for generateNearMatchingString
        for (String s: strs) {
            assert(s.matches(r1.replaceAll("[+]", "*")));
        }
        */

        System.out.println("Test string generated");

        System.out.println(r1);
        System.out.println(test(Pattern.compile(r1), strs));

        System.out.println(r2);
        System.out.println(test(Pattern.compile(r2), strs));
    }

    private static String genUnroll(int rep) {
        StringBuilder out = new StringBuilder("^");

        for (int i = 0; i < rep; i++) {
            out.append("[^@]+@");
        }

        out.append("$");
        return out.toString();
    }

    private static String genQuantifier(int rep) {
        return "^(?:[^@]+@){" + rep + "}$";
    }

    /*
     * count -- number of strings
     * maxLength -- maximum length of the strings
     * chance -- chance that @ will appear in the string, from 0 to 1
     * end -- the string appended with @
     */
    private static List<String> generateRandomString(int count, int maxLength, double chance, boolean end) {
        List<String> out = new ArrayList<String>();

        for (int i = 0; i < count; i++) {
            StringBuilder sb = new StringBuilder();
            int length = r.nextInt(maxLength);
            for (int j = 0; j < length; j++) {
                if (r.nextDouble() < chance) {
                    sb.append('@');
                } else {
                    char c = (char) (r.nextInt(96) + 32);
                    if (c != '@') {
                        sb.append(c);
                    } else {
                        j--;
                    }
                }
            }

            if (end) {
                sb.append('@');
            }

            out.add(sb.toString());

        }

        return out;
    }

    /*
     * count -- number of strings
     * maxLength -- maximum length of the strings
     * rep -- number of repetitions of @
     */
    private static List<String> generateNearMatchingString(int count, int maxLength, int rep) {
        List<String> out = new ArrayList<String>();

        int pos[] = new int[rep - 1]; // Last @ is at the end

        for (int i = 0; i < count; i++) {
            StringBuilder sb = new StringBuilder();
            int length = r.nextInt(maxLength);

            for (int j = 0; j < pos.length; j++) {
                pos[j] = r.nextInt(length);
            }
            Arrays.sort(pos);

            int p = 0;

            for (int j = 0; j < length - 1; j++) {
                if (p < pos.length && pos[p] == j) {
                    sb.append('@');
                    p++;
                } else {
                    char c = (char) (r.nextInt(95) + 0x20);
                    if (c != '@') {
                        sb.append(c);
                    } else {
                        j--;
                    }
                }
            }

            sb.append('@');

            out.add(sb.toString());

        }

        return out;
    }

    private static long test(Pattern re, List<String> strs) {
        int matches = 0;

        // 500 rounds warm-up
        for (int i = 0; i < 500; i++) {
            for (String s : strs)
              if (re.matcher(s).matches());
        }

        long accumulated = 0;

        long before = System.currentTimeMillis();
        for (int i = 0; i < 1000; i++) {
            matches = 0;
            for (String s : strs)
              if (re.matcher(s).matches())
                matches++;
        }

        accumulated += System.currentTimeMillis() - before;

        System.out.println("Found " + matches + " matches");

        return accumulated;
    }
}

注释/取消注释不同的生成行以进行测试,并弄乱数字.在每次测试之前,我通过执行regex 500次来预热VM,然后将累计时间运行regex 1000次.

Comment/uncomment different generate lines to test, and mess around with the numbers. I warm up the VM before each test by executing the regex 500 times before timing the accumulated time to run the regex 1000 times.

我没有可发布的具体数字,因为我发现自己的结果相当不稳定.但是,从我的测试中,我通常发现第一个正则表达式比第二个正则表达式要快.

I don't have any concrete number to post, since I find my own result rather unstable. However, from my testing, I generally find the first regex faster than the second regex.

通过生成500个字符串,每个字符串最多可以包含40000个字符,我发现当输入导致它们不到10秒运行时,两个正则表达式之间的区别更加明显(大约1到2秒).当输入使它们运行更长(40+秒)时,两个正则表达式的运行时间大致相同,相差最多为几百毫秒.

By generating 500 strings, each string can be up to 40000 characters long, I find that the difference between the 2 regexes is more prominent (around 1 to 2 seconds) when the input causes them to run in less than 10 seconds. When the input causes them to run longer (40+ seconds), both regex have more or less the same run time, with at most a few hundred of milliseconds difference.

这篇关于为什么^(?: x + y){5} $的性能比^ x + yx + yx + yx + yx + y $慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆