使用包含(?:.| \ s)*?的模式进行正则表达式搜索需要越来越长的时间 [英] Regex search with pattern containing (?:.|\s)*? takes increasingly long time

查看:53
本文介绍了使用包含(?:.| \ s)*?的模式进行正则表达式搜索需要越来越长的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的正则表达式花费的时间越来越长(第五次大约30秒),但需要进行大约500轮比赛.我怀疑灾难性的回溯.请帮忙!如何优化此正则表达式:

 字符串regex =< tr bgcolor = \" ffffff \> \\ s *?< td width = \" 20%\>< b>(((?:.|\\ s)+?):*?</b></td> \\ s *?< td width = \"80%\">((?? .. | \\ s)*?)(?=(?:</td> \\ s *?</tr> \\ s *?< tr bgcolor = \"ffffff \">)|(?\ s *?</tr> \\ s *?</table> \\ s *?< b>标签</b>))"; 

由于尚不清楚(我不好):我试图提取一个html格式的文档并通过提取两个搜索组并在其后添加格式来重新格式化.

解决方案

交替(?:.| \\ s)+?效率很低,因为它涉及太多的回溯.

基本上,此模式的所有变体都效率极低:(?:.| \ s)*?(?:.| \ n)*?(?:.| \ r \ n)*?,也有贪婪的对应对象((?:.| \ s)* (?:.| \ n)* (?:.| \ r \ n)* ).(.| \ s)*?可能是其中最差的.

为什么?

两个选项. \ s 可能在同一位置匹配相同的文本,两者都至少匹配常规空格.请参阅此演示,它需要3555个步骤来完成并.*? <(带有 s 修饰符)的href ="https://regex101.com/r/H1shYh/3" rel ="nofollow noreferrer">演示.>

Java中类似(?:.| \ n)*?/(?:.| \ n)* 的模式通常会导致 Java HTML解析器的帖子.

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. I suspect catastrophic backtracking. Please help! How can I optimize this regex:

String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";

EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.

解决方案

The alternation (?:.|\\s)+? is very inefficient, as it involves too much backtracking.

Basically, all variations of this pattern are extremely inefficient: (?:.|\s)*?, (?:.|\n)*?, (?:.|\r\n)*? and there greedy counterparts, too ((?:.|\s)*, (?:.|\n)*, (?:.|\r\n)*). (.|\s)*? is probably the worst of them all.

Why?

The two alternatives, . and \s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.

Patterns like (?:.|\n)*? / (?:.|\n)* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\n) is the only way to match any char).

Solution

If you want to match any characters including whitespace with regex, do it with

[\\s\\S]*?

Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).

NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.

这篇关于使用包含(?:.| \ s)*?的模式进行正则表达式搜索需要越来越长的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆