如何提高 .NET 正则表达式的性能? [英] How can I improve the performance of a .NET regular expression?

查看:29
本文介绍了如何提高 .NET 正则表达式的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正则表达式,它解析 Razor 模板语言的一个(非常小的)子集.最近,我向正则表达式添加了更多规则,这大大减慢了它的执行速度.我想知道:是否有某些已知的正则表达式结构很慢?是否对我正在使用的模式进行了重组,以保持可读性并提高性能?注意:我已确认此性能影响发生在编译后.

I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.

这是模式:

new Regex(
              @"  (?<escape> \@\@ )"
            + @"| (?<comment> \@\* ( ([^\*]\@) | (\*[^\@]) | . )* \*\@ )"
            + @"| (?<using> \@using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"

            // captures expressions of the form "foreach (var [var] in [expression]) { <text>" 
/* ---> */      + @"| (?<foreach> \@foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"

            // captures expressions of the form "if ([expression]) { <text>" 
/* ---> */      + @"| (?<if> \@if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"  

            // captures the close of a razor text block
            + @"| (?<endBlock> </text> \s* \} )"

            // an expression of the form @([(int)] a.b.c)
            + @"| (?<parenAtExpression> \@\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
            + @"| (?<atExpression> \@ (?<expressionValue> [\w\.]+ ) )"
/* ---> */      + @"| (?<literal> ([^\@<]+|[^\@]) )",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);

/* ---> */表示导致减速的新规则".

/* ---> */ indicates the new "rules" that caused the slowdown.

推荐答案

由于您没有锚定表达式,引擎将必须检查字符串每个位置的每个替代子模式,然后才能确定它可以'找到匹配项.这总是很耗时,但如何才能减少它呢?

As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. This will always be time-consuming, but how can it be made less so?

一些想法:

我不喜欢第二行试图匹配注释的子模式,我认为它不会正常工作.

I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly.

我可以看到你想用 ( ([^\*]\@) | (\*[^\@]) | . )* - 允许 @* 在注释中,只要它们前面不分别是 *@.但是由于组的 * 量词和第三个选项 .,子模式会很高兴地匹配 *@,因此使其他选项变得多余.

I can see what you're trying to do with the ( ([^\*]\@) | (\*[^\@]) | . )* - allow @ and * within the comments as long as they are not preceded by * or followed by @ respectively. But because of the group's * quantifier and the third option ., the sub-pattern will happily match *@, therefore rendering the other options redundant.

假设您尝试匹配的 Razor 子集不允许多行注释,我建议使用第二行

And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line

+ @"| (?<comment> @\*.*?\*@ )"

即延迟匹配任何字符(但换行符除外),直到遇到第一个 *@ .您正在使用 RegexOptions.ExplicitCapture 意味着只捕获命名组,因此缺少 () 应该不是问题.

i.e. lazily match any characters (but newlines) until the first *@ is encountered. You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem.

我也不喜欢最后一行中的 ([^\@<]+|[^\@]) 子模式,它等同于 ([^\@<]+|<).[^\@<]+ 将贪婪地匹配到字符串的末尾,除非遇到 @<.

I also do not like the ([^\@<]+|[^\@]) sub-pattern in the last line, which equates to ([^\@<]+|<). The [^\@<]+ will greedily match to the end of the string unless it comes across a @ or <.

我没有看到任何相邻的子模式会匹配相同的文本,这通常是过度回溯的罪魁祸首,但所有的 \s* 似乎都值得怀疑,因为它们的贪婪和灵活性,包括不匹配和换行.也许您可以将一些 \s* 更改为 [ \t]* ,其中您知道您不想匹配换行符,例如,可能在左括号之前遵循 if.

I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \s* seem suspect because of their greed and flexibility, including matching nothing and newlines. Perhaps you could change some of the \s* to [ \t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if.

我注意到 nhahtdh 建议您使用原子分组来防止引擎回溯到先前匹配的内容,这当然值得尝试,因为几乎可以肯定,当引擎找不到匹配项时会导致过度回溯这导致了减速.

I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down.

您想通过 RegexOptions.Multiline 选项实现什么目标?您不希望使用 ^$ 所以它不会有任何效果.

What are you trying to achieve with the RegexOptions.Multiline option? You do not look to be using ^ or $ so it will have no effect.

@ 的转义是不必要的.

这篇关于如何提高 .NET 正则表达式的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆