重新抓取网址与Nutch的只是更新的网站 [英] Recrawl URL with Nutch just for updated sites

查看：381 发布时间：2016/5/20 0:24:19 apache solr lucene nutch web-crawler

本文介绍了重新抓取网址与Nutch的只是更新的网站的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我爬到一个网址，使用Nutch 2.1，然后我想他们得到了更新后重新抓取的网页。我怎样才能做到这一点？我怎么能知道一个网页被更新？

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

推荐答案

只要你不能。您需要重新抓取网页，如果它的更新来控制。因此，根据您的需要，优先页/域和一段时间内重新抓取他们。为此你需要一个作业调度程序，如石英。

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

您需要编写一个比较页面的功能。然而，原本Nutch的保存页面作为索引文件。换句话说Nutch的产生新的二进制文件，以节省HTMLS。我不认为这是可以比较的二进制文件，结合Nutch的一个文件中的所有抓取的结果。如果你想要保存原始HTML格式来比较页面，请参阅我的回答<一个href=\"http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch/10060160#10060160\">this问题。

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

这篇关于重新抓取网址与Nutch的只是更新的网站的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

重新抓取网址与Nutch的只是更新的网站 [英] Recrawl URL with Nutch just for updated sites

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

重新抓取网址与Nutch的只是更新的网站 [英] Recrawl URL with Nutch just for updated sites

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭