为什么这需要这么长时间才能匹配?这是一个错误吗? [英] Why does this take so long to match? Is it a bug?
问题描述
我需要匹配 Web 应用程序中的某些 URL,即 /123,456,789
,并编写此正则表达式以匹配模式:
I need to match certain URLs in web application, i.e. /123,456,789
, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
我注意到它似乎没有评估,即使在测试模式几分钟后:
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
预期的结果是没有匹配项.
The expected result would be that there were no matches.
然而,这个表达式几乎立即执行(注意尾部斜杠):
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
这是一个错误吗?
推荐答案
有一些灾难性的回溯正在进行中这将导致处理量呈指数级增长,具体取决于非匹配字符串的长度.这与您的嵌套重复和可选的逗号有关(即使某些正则表达式引擎可以确定这与尝试所有无关的重复不匹配).这是通过优化表达式来解决的.
There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
最简单的方法是查找 1+ 位数字或逗号,后跟斜杠和字符串结尾:[\d,]+/$
.然而,这并不完美,因为它允许诸如 ,123,,4,5/
之类的东西.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$
. However, that is not perfect since it would allow for something like ,123,,4,5/
.
为此,您可以使用初始尝试的稍微优化的版本:(?:\d,?)+/$
.首先,我让你的重复组非捕获 ((?:...)
) 这不是必需的,但它提供了更干净的匹配".接下来,也是唯一关键的一步,我不再重复组内的 \d
,因为组已经在重复了. 最后,我删除了可选的 ,
周围的不必要组,因为 ?
只影响最后一个字符.这几乎会寻找一个数字,也许是一个逗号,然后重复,最后跟着一个尾随的 /
.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$
. First, I made your repeating group non-capturing ((?:...)
) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d
inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional ,
since ?
only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /
.
这仍然可以匹配一个奇怪的字符串 1,2,3,/
,所以对于它,我用 负面回顾:(?:\d,?)+(?
.这将断言在尾随 /
之前没有逗号.
This can still match an odd string 1,2,3,/
, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$
. This will assert that there is no comma directly before the trailing /
.
这篇关于为什么这需要这么长时间才能匹配?这是一个错误吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!