慢正则表达式的性能 [英] Slow Regex performance

查看:132
本文介绍了慢正则表达式的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码包含了一个正则表达式,提取C#字符串字面量,但正则表达式匹配的输入字符串的性能超过几个字符是前景堪忧。

The code below contains a regular expression designed to extract a C# string literal but the performance of the regex matching for input strings of more than a few characters is woeful.

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, "\"(([^\\\\\"]*)(\\\\.)?)*\"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }

    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch("\"OK");

        // this string is terminated (the match succeeds)
        StringMatch("\"This is a longer terminated string - it matches and returns instantly\"");

        // this string is unterminated (so the match will fail), but it never returns
        StringMatch("\"This is another unterminated string and takes FOREVER to match");
    }
}

我可以重构正则表达式成不同的形式,但能?为什么表现这么差的人提供了一个解释

I can refactor the regex into a different form, but can anyone offer an explanation why the performance is so bad?

推荐答案

您正在运行到的catastrophic回溯:

让我们简化正则表达式位(不包括引号逃脱,没有第二可选的组,因为在您的评论,它无关紧要的测试字符串):

Let's simplify the regex a bit (without the escaped quotes and without the second optional group because, as in your comment, it's irrelevant for the tested strings):

"(([^\\"]*))*" 

([^ \\ ] *)任何字符串匹配除了引号或反斜线。这<青霉>再次的被包围在可重复任意次数的可选基团。

([^\\"]*) matches any string except quotes or backslashes. This again is enclosed in an optional group that can repeat any number of times.

字符串现在ABC ,正则表达式引擎需要尝试以下排列:

Now for the string "ABC, the regex engine needs to try the following permutations:


  • ABC

  • ABC <空字符串>

  • AB C

  • AB C <空字符串>

  • AB <空字符串> C

  • AB <空字符串> ; C <空字符串>

  • <空字符串> AB C

  • <空字符串> AB C <空字符串>

  • <空字符串> AB <空字符串> C <空字符串>

  • <空字符串> AB <空字符串> C

  • A BC

  • A BC <空字符串>

  • A <空字符串> BC

  • < ;空字符串> A BC

  • 等。

  • A B C

  • A B C <空字符串>

  • A B <空字符串> C

  • 等等。

  • ", ABC
  • ", ABC, <empty string>
  • ", AB, C
  • ", AB, C, <empty string>
  • ", AB, <empty string>, C
  • ", AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, C
  • ", <empty string>, AB, C, <empty string>
  • ", <empty string>, AB, <empty string>, C, <empty string>
  • ", <empty string>, AB, <empty string>, C
  • ", A, BC
  • ", A, BC, <empty string>
  • ", A, <empty string>, BC
  • ", <empty string>, A, BC
  • etc.
  • ", A, B, C
  • ", A, B, C, <empty string>
  • ", A, B, <empty string>, C
  • etc. etc.

每个失败的话,因为没有后续的

each of which then fails because there is no following ".

此外,你只测试为子而不是强迫的正则表达式匹配整个字符串。而你通常要使用逐字字符串正则表达式来减少你需要的反斜杠数目。这个怎么样:

Also, you're only testing for substrings instead of forcing the regex to match the entire string. And you usually want to use verbatim strings for regexes to cut down on the number of backslashes you need. How about this:

foundMatch = Regex.IsMatch(subjectString, 
    @"\A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     \\.     # an escaped character
    |        # or
     [^\\""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
    \Z       # End of the string", 
    RegexOptions.IgnorePatternWhitespace);

这篇关于慢正则表达式的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆