如何使用无效的HTML抓取网站 [英] How can I scrape a website with invalid HTML

查看:88
本文介绍了如何使用无效的HTML抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从HTML无效的网站上抓取数据. 简单HTML DOM解析器对其进行了解析,但由于其如何处理无效的HTML而丢失了一些信息.带有DOMXPath的内置DOM解析器无法正常工作,它返回空白结果集.通过PHP Tidy运行获取的HTML之后,我能够在本地运行它(DOMDocument和DOMXPath),但是服务器及其共享托管服务器上未安装PHP Tidy,因此我对此无能为力.我尝试了 HTMLPurifier ,但这似乎只是为了保护用户输入,因为它完全删除了doctype,head和body标签.

I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.

是否有PHP Tidy的任何独立替代品?我真的更喜欢使用DOMXPath来浏览并获取我需要的东西,它似乎需要一些帮助来清理HTML才能对其进行解析.

Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.

我正在抓取此网站: http://courseschedules.njit.edu/index .aspx?semester = 2010f .现在,我只是想获取所有课程链接.

Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.

推荐答案

如果使用loadHTMLloadHTMLFile,则DOM可以很好地处理HTML损坏的问题:

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:

$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
    printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}

将输出

ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD   - Art and Design (index.aspx?semester=2010f&subjectID=AD  )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB  - Urban Systems (index.aspx?semester=2010f&subjectID=URB )

使用

echo $dom->saveXML($link), PHP_EOL;

foreach循环中的

将输出完整的链接outerHTML.

这篇关于如何使用无效的HTML抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆