用HTML进行Catastophic回溯问题 [英] Catastophic backtracking issue with HTML

查看:99
本文介绍了用HTML进行Catastophic回溯问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用PHP抓取一系列网页,抓取标签和最早标签之间的所有内容。这是我正在使用的正则表达式:

  |(?< = div id =body>)。 *?< / div> | s 

对大多数页面来说,我在看。但是,它不会为其他几个人返回任何东西。我将正则表达式插入到regex101.com测试程序中,它告诉我问题在于灾难性的回溯。我尝试删除lookbehind语言,甚至玩弄如下内容:

  | id =body>。*? < / div> | s 

但问题仍然存在。我看了一些关于灾难性回溯的其他问题,以及 http://www.regular -expressions.info/catastrophic.html 文章,但我无法弄清楚如何将这些修补程序应用到这种特殊情况。

解决方案

正则表达式已知会导致带有大量HTML内容的灾难性回溯。在这种情况下,问题无疑是在后退和惰性点匹配的情况下,每当正则表达式引擎向右前进一个符号时,它必须检查符号是否以指定的子字符串开头,并且如果它达到足够的字符以产生匹配。



关于这个正则表达式如何工作的一个好主意是查看regex101 regex debugger 部分。 至于如何解析HTML,PHP DOMDocument和DOMXPath是您最好的朋友:

  $ html =<< YOUR_HTML_STRING_HERE>>; 
$ dom = new DOMDocument('1.0','UTF-8');
$ dom-> loadHTML($ html,LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//上面是字符串示例中的DOM初始化,下面是解析
$ xpath = new DOMXPath($ dom);
$ divs = $ xpath-> query('// div [@ id =body]'); //获取所有具有id = body的DIV标签

foreach($ divs as $ div){
echo $ dom-> saveHTML($ div); //回声HTML,可以添加到数组
}

请参阅 IDEONE演示


I'm trying to scrape a series of webpages using PHP, grabbing all of the content between the tag and the earliest tag. This is the regex that I'm using:

|(?<=div id="body">).*?</div>|s

This seems to be working perfectly fine for most of the pages I'm looking at. However, it's not returning anything for a few others. I plugged the regex into the regex101.com tester, and it told me that the problem was with catastrophic backtracking. I tried removing the lookbehind language, and even playing around with things like:

|id="body">.*?</div>|s

However, the problem is still persisting. I've looked at some other questions about catastrophic backtracking, as well as the http://www.regular-expressions.info/catastrophic.html article, but I can't figure out how to apply their fixes to this particular case.

解决方案

Regular expressions are known to cause catastrophic backtracking with large HTML contents. In this case, the problem is surely with the look-behind and lazy dot matching, when each time the regex engine advances one symbol to the right, it must check if the symbol is preceded with the specified substring, and if it reached enough characters to yield a match.

A good idea of how this regex works is looking at the regex101 regex debugger section.

As to how to parse your HTML, PHP DOMDocument and DOMXPath are your best friends:

$html = "<<YOUR_HTML_STRING_HERE>>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Above is the DOM initialization from string example, below is parsing
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[@id="body"]'); // Get all DIV tags with id=body

foreach($divs as $div) { 
  echo $dom->saveHTML($div); // Echo the HTML, can be added to array
}

See IDEONE demo

这篇关于用HTML进行Catastophic回溯问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆