根据DOM标准通过链接进行递归Web下载 [英] Recursive web download following links according to DOM criteria

查看:69
本文介绍了根据DOM标准通过链接进行递归Web下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

MSDN是一个庞大的分层文档站点。

更准确地说,内容是以分层方式组织的,而URL则不是。 URL空间是平坦的,使其看起来一切都在同一目录中。 (实际上,可能没有目录;我想事情是从其他数据库中传出来的;但这在这里不相关。)

To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everything is in the same directory. (In reality, there probably isn't a directory; I guess things are coming out of some other database; but that's not relevant here.)

因此,如果您想下载MSDN的一部分,例如 NMake手册,您可以只需递归下载给定目录下的所有内容。因为那将是所有MSDN。

So if you want to download part of MSDN, say, the NMake manual, you can't just recursively download everything below a given directory. Because that will be all of MSDN. Too much for your hard drive and bandwith.

但是您可以编写一个查看DOM(HTML)的脚本,然后仅跟随并下载某些导航中包含的那些链接。文档的各个部分,例如CSS class 属性 toc_children toc_siblings ,但不是 toc_parent

But you could write a script that looks at the DOM (HTML) to then follow and download only those links contained in certain navigational sections of the document, like those of CSS class attribute toc_children and toc_siblings, but not toc_parent.

您需要的是一些下载程序,您可以说:

What you'd need would be some downloader that allows you to say:

$webclient->add_links( $xpath_expression ); # or
$webclient->add_links( $css_selector );

使用Perl,LWP和XML :: LibXML( HTML解析器),但也许您知道一个可以让您做到这一点的工具,所以我不需要重新发明它。

It shouldn't be too difficult to cobble something together using Perl, LWP and XML::LibXML (HTML parser), but maybe you know of a tool that allows you to do just that so I don't need to reinvent it.

它不一定是Perl,任何其他语言也都可以,并且现成的程序也具有该工作所需的灵活性。

It doesn't have to be Perl, any other language is fine, too, and so is a ready-made program that has the flexibility required for this job.

推荐答案

WWW :: Mechanize 中查找find_link函数(及其兄弟姐妹)。它可以使用任意条件来查找包含 id和 class属性的链接。

Check out the find_link function (and siblings) from WWW::Mechanize. It can use arbitrary criteria to find links including the "id" and "class" attributes.

这篇关于根据DOM标准通过链接进行递归Web下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆