如何使用无效的 HTML 抓取网站 [英] How can I scrape a website with invalid HTML
问题描述
我正在尝试从包含无效 HTML 的网站中抓取数据.Simple HTML DOM Parser 会解析它,但由于其处理无效 HTML 的方式而丢失了一些信息.带有 DOMXPath 的内置 DOM 解析器不起作用,它返回一个空白结果集.在通过 PHP Tidy 运行获取的 HTML 后,我能够让它(DOMDocument 和 DOMXPath)在本地工作,但是 PHP Tidy 没有安装在服务器上,而且它是一个共享托管服务器,所以我无法控制它.我尝试了 HTMLPurifier 但这似乎只是为了保护用户输入,因为它完全删除了 doctype、head 和 body 标签.
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
有没有 PHP Tidy 的独立替代品?我真的更喜欢使用 DOMXPath 来导航并获取我需要的东西,它似乎需要一些帮助来清理 HTML,然后才能解析它.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
我正在抓取这个网站:http://courseschedules.njit.edu/index.aspx?semester=2010f.现在我只是想获取所有课程链接.
Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.
推荐答案
如果您使用 loadHTML
或 loadHTMLFile
,DOM 可以很好地处理损坏的 HTML:
DOM handles broken HTML fine if you use loadHTML
or loadHTMLFile
:
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
printf("%s (%s)
", $link->nodeValue, $link->getAttribute('href'));
}
会输出
ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD - Art and Design (index.aspx?semester=2010f&subjectID=AD )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB - Urban Systems (index.aspx?semester=2010f&subjectID=URB )
使用
echo $dom->saveXML($link), PHP_EOL;
在foreach
循环中将输出完整的outerHTML
链接.
in the foreach
loop will output the full outerHTML
of the links.
这篇关于如何使用无效的 HTML 抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!