如何使用无效的 HTML 抓取网站 [英] How can I scrape a website with invalid HTML

查看:34
本文介绍了如何使用无效的 HTML 抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从包含无效 HTML 的网站中抓取数据.Simple HTML DOM Parser 会解析它,但由于其处理无效 HTML 的方式而丢失了一些信息.带有 DOMXPath 的内置 DOM 解析器不起作用,它返回一个空白结果集.在通过 PHP Tidy 运行获取的 HTML 后,我能够让它(DOMDocument 和 DOMXPath)在本地工作,但是 PHP Tidy 没有安装在服务器上,而且它是一个共享托管服务器,所以我无法控制它.我尝试了 HTMLPurifier 但这似乎只是为了保护用户输入,因为它完全删除了 doctype、head 和 body 标签.

I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.

有没有 PHP Tidy 的独立替代品?我真的更喜欢使用 DOMXPath 来导航并获取我需要的东西,它似乎需要一些帮助来清理 HTML,然后才能解析它.

Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.

我正在抓取这个网站:http://courseschedules.njit.edu/index.aspx?semester=2010f.现在我只是想获取所有课程链接.

Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.

推荐答案

如果您使用 loadHTMLloadHTMLFile,DOM 可以很好地处理损坏的 HTML:

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:

$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
    printf("%s (%s)
", $link->nodeValue, $link->getAttribute('href'));
}

会输出

ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD   - Art and Design (index.aspx?semester=2010f&subjectID=AD  )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB  - Urban Systems (index.aspx?semester=2010f&subjectID=URB )

使用

echo $dom->saveXML($link), PHP_EOL;

foreach 循环中将输出完整的outerHTML 链接.

in the foreach loop will output the full outerHTML of the links.

这篇关于如何使用无效的 HTML 抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆