根据DOM标准通过链接进行递归Web下载 [英] Recursive web download following links according to DOM criteria

查看：69 发布时间：2020/10/25 20:38:08 python ruby perl dom download

本文介绍了根据DOM标准通过链接进行递归Web下载的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更准确地说，内容是以分层方式组织的，而URL则不是。 URL空间是平坦的，使其看起来一切都在同一目录中。（实际上，可能没有目录；我想事情是从其他数据库中传出来的；但这在这里不相关。）

To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everything is in the same directory. (In reality, there probably isn't a directory; I guess things are coming out of some other database; but that's not relevant here.)

因此，如果您想下载MSDN的一部分，例如 NMake手册，您可以只需递归下载给定目录下的所有内容。因为那将是所有MSDN。

So if you want to download part of MSDN, say, the NMake manual, you can't just recursively download everything below a given directory. Because that will be all of MSDN. Too much for your hard drive and bandwith.

但是您可以编写一个查看DOM（HTML）的脚本，然后仅跟随并下载某些导航中包含的那些链接。文档的各个部分，例如CSS class 属性 toc_children 和 toc_siblings ，但不是 toc_parent 。

But you could write a script that looks at the DOM (HTML) to then follow and download only those links contained in certain navigational sections of the document, like those of CSS class attribute toc_children and toc_siblings, but not toc_parent.

您需要的是一些下载程序，您可以说：

What you'd need would be some downloader that allows you to say:

$webclient->add_links( $xpath_expression ); # or
$webclient->add_links( $css_selector );

使用Perl，LWP和XML :: LibXML（ HTML解析器），但也许您知道一个可以让您做到这一点的工具，所以我不需要重新发明它。

It shouldn't be too difficult to cobble something together using Perl, LWP and XML::LibXML (HTML parser), but maybe you know of a tool that allows you to do just that so I don't need to reinvent it.

它不一定是Perl，任何其他语言也都可以，并且现成的程序也具有该工作所需的灵活性。

It doesn't have to be Perl, any other language is fine, too, and so is a ready-made program that has the flexibility required for this job.

根据DOM标准通过链接进行递归Web下载 [英] Recursive web download following links according to DOM criteria

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据DOM标准通过链接进行递归Web下载 [英] Recursive web download following links according to DOM criteria

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭