使用 Nutch 重新抓取 URL 仅用于更新的站点 [英] Recrawl URL with Nutch just for updated sites

查看：31 发布时间：2021/11/11 6:03:34 apache solr lucene nutch web-crawler

本文介绍了使用 Nutch 重新抓取 URL 仅用于更新的站点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 Nutch 2.1 抓取了一个 URL，然后我想在页面更新后重新抓取它们.我怎样才能做到这一点?我如何知道页面已更新?

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

推荐答案

你根本做不到.您需要重新抓取页面以控制它是否已更新.因此，根据您的需要，对页面/域进行优先级排序并在一段时间内重新抓取它们.为此，您需要一个作业调度程序，例如 Quartz.

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

您需要编写一个比较页面的函数.但是，Nutch 最初将页面保存为索引文件.换句话说，Nutch 生成新的二进制文件来保存 HTML.我认为比较二进制文件是不可能的，因为 Nutch 将所有抓取结果组合在一个文件中.如果您想以原始 HTML 格式保存页面进行比较，请参阅我对这个问题.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

这篇关于使用 Nutch 重新抓取 URL 仅用于更新的站点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Nutch 重新抓取 URL 仅用于更新的站点 [英] Recrawl URL with Nutch just for updated sites

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

使用 Nutch 重新抓取 URL 仅用于更新的站点 [英] Recrawl URL with Nutch just for updated sites

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭