Java正则表达式运行速度非常慢 [英] Java Regular Expression running very slow

查看:145
本文介绍了Java正则表达式运行速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用使用Fireball正则表达式来匹配Java中的URL ,并且我找到了一个使评估永久生效的URL.我已经修改了原始正则表达式以使用Java语法.

I'm trying to use the Daring Fireball Regular Expression for matching URLs in Java, and I've found a URL that causes the evaluation to take forever. I've modified the original regex to work with Java syntax.

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»""‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

如果我尝试运行以下命令,那将花费很多时间.我将其范围缩小到匹配平衡的括号(我认为).如果您在括号内更改文本,它可以正常工作,但是大约15个字符时,它开始以指数级的速度变慢.

If I attempt to run the following, it takes forever. I've narrowed it down to the matching of balanced parens (I think). If you change the text within the parens, it works fine, but at about 15 characters, it starts to slow down exponentially.

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

是否有一种方法可以改善此正则表达式,使有关的字句不会永远花费下去?在JUnit测试类中,我有大约100个不同的URL,我需要这些URL才能继续正常工作.

Is there a way to improve this regex so that the lines about don't take forever? I have about 100 different URLs in a JUnit test class, and I need those to continue to work as well.

推荐答案

问题在这里:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

这里有嵌套量词.这会对任何回溯算法造成严重破坏-例如,请考虑将正则表达式/^(a +)+ $/与字符串匹配

What you've got here is nested quantifiers. This plays havoc with any backtracking algorithm - as an example, consider the regex /^(a+)+$/ matching against the string

aaaaaaaaaab

第一次尝试,内部量词将匹配所有 a .然后正则表达式失败,因此它退出了正则表达式.然后外部量词尝试再次匹配,吞下最后一个 a ,然后正则表达式再次失败.当量词尝试以各种方式拆分 a s的运行时,我们基本上会得到指数行为,而实际上并未取得任何进展.

As a first attempt, the inner quantifier will match all of the as. Then the regex fails, so it backs off one. Then the outer quantifier tries to match again, swallowing up the last a, then the regex fails once more. We basically get exponential behaviour as the quantifiers try all sorts of ways of splitting up the run of as, without actually making any progress.

解决方案是可能的量词(我们通过将 + 附加到量词的末尾来表示)-我们设置内部量词,以便一旦它们具有一个匹配,他们不会放任不管-他们会坚持下去,直到匹配失败或较早的量词退出并且必须从字符串中的其他位置开始重新匹配.如果我们改为使用/^(a ++)+ $/作为正则表达式,则在上面的不匹配字符串上我们将立即失败,而不是尝试对其进行指数运算.

The solution is possessive quantifiers (which we denote by tacking a + onto the end of a quantifier) - we set up the inner quantifiers so that once they have a match, they don't let it go - they'll hold onto that until the match fails or an earlier quantifier backs off and they have to rematch starting somewhere else in the string. If we instead used /^(a++)+$/ as our regex, we would fail immediately on the non-matching string above, rather than going exponential trying to match it.

尝试使那些内部量词具有所有格,看看是否有帮助.

Try making those inner quantifiers possessive and see if it helps.

这篇关于Java正则表达式运行速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆