为什么python的re.search方法挂了? [英] Why does python's re.search method hang?
问题描述
我正在使用 python 正则表达式库来解析一些字符串,目前我发现我的正则表达式要么太复杂,要么我搜索的字符串太长.
这是一个挂断的例子:
<预><代码>>>>进口重新>>>reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))">>>re.search(reg, "**正在寻找 PAYPAL 提供这些不常见的油漆 **") #挂在这里......我不确定发生了什么.任何帮助表示赞赏!
这是我尝试匹配的示例的链接:Regxr
代码执行挂起的原因是灾难性的回溯 由于量化组 (\w+'?\s*)+
中的一个强制模式和 1+ 个可选模式(那些可以匹配空字符串的模式),允许regex 引擎来测试很多匹配的路径,太多以至于需要很长时间才能完成.
我建议以 '
或 \s
成为必需的方式解开有问题的组并将它们包装在可选组中:
(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)^^^^^^^^^^^^^^^^^^^***
查看正则表达式演示
这里,(\w+(?:['\s]+\w+)*)
将匹配 1+ 个单词字符,然后匹配 1+ 个 ' 的 0+ 个序列code> 或空格后跟 1+ 个单词字符.这样,如果出现不匹配的字符串,模式就会变成线性,并且正则表达式引擎会更快地使匹配失败.
模式的其余部分:
\s*[-~]\s*
--
或~
用 0+ 个空格包裹(\$?\d+(?:\.\d+)?\$?)
- 第 2 组捕获\$?
- 1 或 0$
个符号\d+
- 1+ 个数字(?:\.\d+)?
- 1 或 0 个零序列:\.
- 一个点\d+
- 1+ 个数字
\$?
- 1 或 0$
个符号
I'm using python regex library to parse some strings and currently I found that my regex is either too complicated or the string I'm searching is too long.
Here's an example of the hang up:
>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...
I'm not sure what's going on. Any help appreciated!
EDIT: Here's a link with examples of what I'm trying to match: Regxr
The reason why the code execution hangs is catastrophic backtracking due to one obligatory and 1+ optional patterns (those that can match an empty string) inside a quantified group (\w+'?\s*)+
that allows a regex engine to test a lot of matching paths, so many that it takes too long to complete.
I suggest unwrapping the problematic group in such a way that '
or \s
become obligatory and wrap them in an optional group:
(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***
See the regex demo
Here, (\w+(?:['\s]+\w+)*)
will match 1+ word chars, and then 0+ sequences of 1+ '
or whitespaces followed with 1+ word chars. This way, the pattern becomes linear and the regex engine fails the match quicker if a non-matching string occurs.
The rest of the pattern:
\s*[-~]\s*
- either-
or~
wrapped with 0+ whitespaces(\$?\d+(?:\.\d+)?\$?)
- Group 2 capturing\$?
- 1 or 0$
symbols\d+
- 1+ digits(?:\.\d+)?
- 1 or 0 zero sequences of:\.
- a dot\d+
- 1+ digits
\$?
- 1 or 0$
symbols
这篇关于为什么python的re.search方法挂了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!