修复正则表达式中的灾难性回溯 [英] Fixing Catastrophic Backtracking in Regular Expression
问题描述
我正在使用以下正则表达式来检查有效的文件路径:
I'm using the following regular expression to check for valid file paths:
^(?:[a-zA-Z]\:\\|\\\\)([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})+$
使用测试字符串V:\Sample Names\Libraries\DeveloperLib\DeveloperComDlgs\res
被识别为有效.我什至可以将无效字符添加到字符串的开头,而不会出现问题.但是,当我在字符串的末尾添加无效字符时,该网页就会从灾难性的回溯中冻结.
Using the test string V:\Sample Names\Libraries\DeveloperLib\DeveloperComDlgs\res
is recognized as valid. I can even add invalid characters to the beginning of the string without issues. However, when I add an invalid character towards the end of the string, the webpage freezes up from catastrophic backtracking.
此正则表达式字符串中的原因是什么?
What is causing this in this regex string?
完整字符串: ^(?:[a-zA-Z]\:\\|\\\\)([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})+$
第一组: (?:[a-zA-Z]\:\\|\\\\)
- 检查其中一个
- 大写或小写字母,后跟冒号和反斜杠
- 双反斜杠
- Checks for either
- A capital or lowercase alphabetical letter followed by a colon and a backslash
- A double backslash
第二组:
([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})
- 第一部分:
[^\\\/\:\*\?\<\>\"\|]+
- 确保没有非法字符(\/:*?<>" |)
- 根据需要多次检查节之间的反斜杠
我认为可能是导致问题的
{0, 1}
,因为这允许回溯,但我不确定.有什么想法吗?I think it may be the
{0, 1}
causing the issue since this allows for backtracking but I am not sure. Any thoughts?推荐答案
您当前的正则表达式可以写为
^(?:[a-zA-Z]:\\|\\\\)([^\\\/\:*?<>"|]+\\?)+$
:注意在?量词(等于 {0,1}
限制量词) >在+
量化组中.Your current regex can be written as
^(?:[a-zA-Z]:\\|\\\\)([^\\\/\:*?<>"|]+\\?)+$
: pay attention at the?
quantifier (it is equal to{0,1}
limiting quantifier) after\\
inside a+
quantified group.一旦在模式内部出现像
(a+b?)+
这样的模式,则很有可能发生灾难性的回溯.如果有匹配项,例如c:\12\34\aaaaaaaaaaaaaaaaaaa
匹配良好,一切都会很好出现不允许的字符,导致不匹配(尝试在末尾添加*
,c:\12\34\aaaaaaaaaaaaaaaaaaa*
),问题.Once such a pattern like
(a+b?)+
is present inside a pattern, there is a high chance of a catastrophical backtracking. Everything is nice when there is a match, say,c:\12\34\aaaaaaaaaaaaaaaaaaa
is matched fine, but once a char that is not allowed appears causing a no-match, (try adding*
at the end,c:\12\34\aaaaaaaaaaaaaaaaaaa*
), the issue will appear.为解决此问题,可以匹配相同文本的量化子模式不能立即连续跟随.并且在每个子模式都必须使用的可选组中启用了此功能.
To solve this, the quantified subpatterns that can match the same text cannot follow one another in immediate succession. And using optional groups where each subpattern is obligatory enables this.
在大多数情况下,您需要将这些模式部分替换为展开的
a+(ba+)*
(1个或多个出现的a
,然后是0个或多个序列的b
(本身不再是可选的)),然后是1或多次出现a
(因此,在一个a
和下一个a
之间必须有一个b
).如果您需要在末尾匹配一个可选的\
(因为^(a+b?)+$
实际上可能匹配字符串末尾的b
),则需要在末尾添加b?
:a+(ba+)*b?
.In most scenarios, you need to replace these pattern parts with unrolled
a+(ba+)*
(1 or more occurrences ofa
followed with 0 or more sequences ofb
(that is no longer optional by itself) and then 1 or more occurrences ofa
(so, between onea
and the nexta
there must be ab
). If you need to match an optional\
at the end (as^(a+b?)+$
actually may matchb
at the end of the string), you need to add ab?
at the end:a+(ba+)*b?
.因此,将其转换为您当前的情况:
So, translating this to your current scenario:
^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*$
,或者如果最后允许
\
:^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*\\?$ | a+ ( b a+ )* b?
查看在没有匹配项时正常失败的方式,或者
See how it fails gracefully upon a no match, or matches as expected.
正如@anubhava所建议的,您可以使用 posssive 量词(或原子组,因为,.NET regex引擎不支持所有格)来进一步提高性能,禁止任何回溯进入分组模式.一旦匹配,就不会重试这些模式,因此,失败的速度可能会更快:
As @anubhava suggests, you can further enhance the performance by using possessive quantifiers (or atomic groups instead, since, e.g. .NET regex engine does not support possessives) that disallow any backtracking into the grouped patterns. Once matched, these patterns are not re-tried, thus, failure may come much quicker:
^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*+\\?$ ^
或原子团示例:
^(?:[a-zA-Z]:\\|\\\\)(?>[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*)\\?$ ^^^ ^
请注意,
:
不是特殊的正则表达式元字符,因此不应转义.在字符类中,通常仅-
,^
,\
和]
需要转义,其他所有字符也不都是特殊的.Note that
:
is not a special regex metacharacter and should not be escaped. Inside a character class, only-
,^
,\
and]
usually require escaping, all others are not special there either.在 爆炸性量词陷阱 .
这篇关于修复正则表达式中的灾难性回溯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!