为什么这需要这么长时间才能匹配?这是一个错误吗? [英] Why does this take so long to match? Is it a bug?

查看：31 发布时间：2021/6/15 19:02:06 python regex performance state-machine

本文介绍了为什么这需要这么长时间才能匹配?这是一个错误吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要匹配 Web 应用程序中的某些 URL，即 /123,456,789，并编写此正则表达式以匹配模式:

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:

r'(\d+(,)?)+/$'

我注意到它似乎没有评估，即使在测试模式几分钟后:

I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')

预期的结果是没有匹配项.

The expected result would be that there were no matches.

然而，这个表达式几乎立即执行(注意尾部斜杠):

This expression, however, executes almost immediately (note the trailing slash):

re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')

这是一个错误吗?

推荐答案

有一些灾难性的回溯正在进行中这将导致处理量呈指数级增长，具体取决于非匹配字符串的长度.这与您的嵌套重复和可选的逗号有关(即使某些正则表达式引擎可以确定这与尝试所有无关的重复不匹配).这是通过优化表达式来解决的.

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.

最简单的方法是查找 1+ 位数字或逗号，后跟斜杠和字符串结尾:[\d,]+/$.然而，这并不完美，因为它允许诸如 ,123,,4,5/ 之类的东西.

The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.

为此，您可以使用初始尝试的稍微优化的版本:(?:\d,?)+/$.首先，我让你的重复组非捕获 ((?:...)) 这不是必需的，但它提供了更干净的匹配".接下来，也是唯一关键的一步，我不再重复组内的 \d，因为组已经在重复了. 最后，我删除了可选的 , 周围的不必要组，因为 ? 只影响最后一个字符.这几乎会寻找一个数字，也许是一个逗号，然后重复，最后跟着一个尾随的 /.

For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.

这仍然可以匹配一个奇怪的字符串 1,2,3,/，所以对于它，我用负面回顾:(?:\d,?)+(?.这将断言在尾随 / 之前没有逗号.


This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

                        这篇关于为什么这需要这么长时间才能匹配?这是一个错误吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

为什么这需要这么长时间才能匹配?这是一个错误吗? [英] Why does this take so long to match? Is it a bug?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么这需要这么长时间才能匹配?这是一个错误吗? [英] Why does this take so long to match? Is it a bug?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭