如何使用无效的HTML抓取网站 [英] How can I scrape a website with invalid HTML
问题描述
我正在尝试从HTML无效的网站上抓取数据. 简单HTML DOM解析器对其进行了解析,但由于其如何处理无效的HTML而丢失了一些信息.带有DOMXPath的内置DOM解析器无法正常工作,它返回空白结果集.通过PHP Tidy运行获取的HTML之后,我能够在本地运行它(DOMDocument和DOMXPath),但是服务器及其共享托管服务器上未安装PHP Tidy,因此我对此无能为力.我尝试了 HTMLPurifier ,但这似乎只是为了保护用户输入,因为它完全删除了doctype,head和body标签.
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
是否有PHP Tidy的任何独立替代品?我真的更喜欢使用DOMXPath来浏览并获取我需要的东西,它似乎需要一些帮助来清理HTML才能对其进行解析.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
我正在抓取此网站: http://courseschedules.njit.edu/index .aspx?semester = 2010f .现在,我只是想获取所有课程链接.
Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.
推荐答案
如果使用loadHTML
或loadHTMLFile
,则DOM可以很好地处理HTML损坏的问题:
DOM handles broken HTML fine if you use loadHTML
or loadHTMLFile
:
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}
将输出
ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD - Art and Design (index.aspx?semester=2010f&subjectID=AD )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB - Urban Systems (index.aspx?semester=2010f&subjectID=URB )
使用
echo $dom->saveXML($link), PHP_EOL;
foreach
循环中的
将输出完整的链接outerHTML
.
这篇关于如何使用无效的HTML抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!