如何使这个正则表达式不会导致“灾难性的回溯”? [英] How can I make this regular expression not result in "catastrophic backtracking"?

查看:131
本文介绍了如何使这个正则表达式不会导致“灾难性的回溯”?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用匹配的正则表达式来自 http:// daringfireball。 net / 2010/07 / improved_regex_for_matching_urls

I'm trying to use a URL matching regular expression that I got from http://daringfireball.net/2010/07/improved_regex_for_matching_urls

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»""‘’]        # not a space or one of these punct chars
  )
)

根据另一个问题,似乎有些情况导致此正则表达式灾难性地回溯。例如:

Based on the answers to another question, it appears that there are cases that cause this regex to backtrack catastrophically. For example:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA)")

...可能需要很长时间才能执行(例如在Chrome中)

... can take a really long time to execute (e.g. in Chrome)

在我看来问题出在这部分代码中:

It seems to me that the problem lies in this part of the code:

(?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+

...这似乎大致相当于(。+ | \((。+ |(\(。+ \))) * \))+ ,看起来像包含(。+)+

... which seems to be roughly equivalent to (.+|\((.+|(\(.+\)))*\))+, which looks like it contains (.+)+

我能做出哪些改变可以避免这种情况吗?

Is there a change I can make that will avoid that?

推荐答案

将其更改为以下内容可以防止灾难性的回溯:

Changing it to the following should prevent the catastrophic backtracking:

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»""‘’]        # not a space or one of these punct chars
  )
)

唯一的变化是在第一个之后删除 + 正则表达式的每个平衡的parens部分中的[^ \s()<>]

The only change that was made was to remove the + after the first [^\s()<>] in each of the "balanced parens" portions of the regex.

这是一条线用于测试JS的版本:

Here is the one-line version for testing with JS:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

原始正则表达式的问题部分是平衡括号部分,以简化回溯发生原因的解释我将完全删除它的嵌套括号部分,因为它在这里不相关:

The problem portion of the original regex is the balanced parentheses section, to simplify the explanation of why the backtracking occurs I am going to completely remove the nested parentheses portion of it because it isn't relevant here:

\(([^\s()<>]+|(\([^\s()<>]+\)))*\)    # original
\(([^\s()<>]+)*\)                     # expanded below

\(                # literal '('
(                 # start group, repeat zero or more times
    [^\s()<>]+        # one or more non-special characters
)*                # end group
\)                # literal ')'

考虑这里用字符串'(AAAAA',文字匹配然后)会发生什么AAAAA 将是消费者由该组编辑,而将无法匹配。此时该组将放弃一个 A ,离开 AAAA 捕获并尝试在此时继续匹配。因为该组后面有一个 * ,所以该组可以匹配多次,所以现在你有([^ \(()<> ;] +)* 在第二遍中匹配 AAAA ,然后 A 。如果失败,原始捕获将放弃额外的 A ,并在第二次捕获时消耗。

Consider what happens here with the string '(AAAAA', the literal ( would match and then AAAAA would be consumed by the group, and the ) would fail to match. At this point the group would give up one A, leaving AAAA captured and attempting to continue the match at this point. Because the group has a * following it, the group can match multiple times so now you would have ([^\s()<>]+)* matching AAAA, and then A on the second pass. When this fails an additional A would be given up by the original capture and consumed by the second capture.

将持续很长一段时间导致以下匹配尝试,其中每个逗号分隔组指示组匹配的不同时间,以及实例匹配的字符数:

This would go on for a long while resulting in the following attempts to match, where each comma-separated group indicates a different time that the group is matched, and how many characters that instance matched:

AAAAA
AAAA, A
AAA, AA
AAA, A, A
AA, AAA
AA, AA, A
AA, A, AA
AA, A, A, A
....

我可能错了,但在确定正则表达式无法匹配之前,我非常确定它会增加16个步骤。当您继续向字符串添加其他字符时,计算出来的步骤数呈指数级增长。

I may have counted wrong, but I'm pretty sure it adds up to 16 steps before it is determined that the regex cannot match. As you continue to add additional characters to the string the number of steps to figure this out grows exponentially.

删除 + 并将其更改为 \(([^ \s()<>])* \),您可以避免此回溯方案。

By removing the + and changing this to \(([^\s()<>])*\), you would avoid this backtracking scenario.

重新添加交替以检查嵌套括号不会导致任何问题。

Adding the alternation back in to check for the nested parentheses doesn't cause any problems.

请注意可能想要在字符串末尾添加某种锚点,因为目前http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA在之前(,所以 re.test(...)将返回 true 因为 http://google.com/?q= 匹配。

Note that you may want to add some sort of anchor to the end of the string, because currently "http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA" will match up to just before the (, so re.test(...) would return true because http://google.com/?q= matches.

这篇关于如何使这个正则表达式不会导致“灾难性的回溯”?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆