为什么这个正则表达式需要很长时间才能执行? [英] Why does this regex take a long time to execute?

查看:984
本文介绍了为什么这个正则表达式需要很长时间才能执行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现,例如这一行的执行时间非常长:

I found out, that for example this line has a very very long execution time:

System.out.println(
        ".. .. .. .. .. .. .. .. ..  .. .. .. .. .. .. .. .. .. .. .. .... .. .."
        .matches("(?i)(?:.* )?\\W?([a-z0-9-_\\.]+((?: *)\\.(?: *))+(?:DE))(?:[0-9]{1,5})?")
);

如果我减少字符串开头的点数,执行时间会降低(看起来像这是指数)。这是挂起线程的堆栈跟踪:

If I reduce the amount of dots at the start of the String the execution time gets lower (seems like it's exponential). Here is the suspended thread's stack trace:

[Repeating text]...
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.matchInit(Matcher, int, CharSequence) line: 4801
Pattern$Prolog.match(Matcher, int, CharSequence) line: 4741
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Ques.match(Matcher, int, CharSequence) line: 4182
Pattern$BranchConn.match(Matcher, int, CharSequence) line: 4568
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Branch.match(Matcher, int, CharSequence) line: 4604
Matcher.match(int, int) line: 1270
Matcher.matches() line: 604
Pattern.matches(String, CharSequence) line: 1135
String.matches(String) line: 2121
Main.main(String[]) line: 11

为什么会发生这种情况?

Why does this happen?

推荐答案

当模式 x 成为可选项时 - 使用 * 量词(或 {0,} ) - 引擎有两条路径可以接近使用量词的本质:

When pattern x is made optional - using ? or * quantifiers (or {0,}) - engine has two paths to approach according to the nature of quantifier being used:


  • 消耗然后回溯其他模式(贪婪的情况,即。* 。?

  • 首先不消费并立即查看其他模式(懒惰的情况。*?

  • Consumes then backtracks for other patterns (case of greediness i.e. .*, .?)
  • First doesn't consume and looks immediately for other patterns (case of laziness .*?)

有人可能不知道正则表达式或不是'关心性能并抛出。* 无论他需要在字符串中的某个地方进行匹配,引擎都是如此快速地来回走动而不是除非无法找到模式,否则兴趣似乎很奇怪或缓慢。

Someone probably is not aware about regular expressions or doesn't care about performance and throws .* wherever he needs a match somewhere in string and engines are so fast in taking steps back and forth that nothing seems weird or slow unless a pattern can not be found.

时间复杂度从 O(n)开始,继续 O(n ^ 2b)其中 b 是嵌套量词的级别。因此,在失败的情况下,引擎所需的步数是巨大的。

Time complexity starts at O(n) and continues with O(n^2b) where b is level of nesting quantifiers. So on failure number of steps an engine takes is HUGE.

为了避免这种情况,有人需要考虑一些指导原则:

To avoid such situations someone needs to consider some guiding principles:


  • 指定边界。如果模式应该在数字之前的某处停止。* 。而是 \ D *

使用条件。在使用前瞻 ^(?= [^ x] * x) x 是否存在C $ C>。这会导致早期失败。

Use conditions. You can check if pattern / letter x exists before running a whole match using a lookahead ^(?=[^x]*x). This leads to an early failure.

使用占有量词或原子组(如果有)。这两个避免回溯。有时你不需要回溯。

Use possessive quantifiers or atomic groups (if available). These two avoid backtracks. Sometimes you do not need backtracks.

不要做(。*)+ 或类似的模式。而是重新考虑您的要求或至少使用原子组(?>。*)+

Do not do (.*)+ or similar patterns. Instead reconsider your requirements or at least use atomic groups (?>.*)+.

您自己的正则表达式也不例外。它有很多贪婪和可选的比赛,需要一段时间才能重新获得。

Your own Regular Expression isn't an exception. It suffers from much greediness and optional matches and needs a time to be restudied.

这篇关于为什么这个正则表达式需要很长时间才能执行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆