重新抓取网址与Nutch的只是更新的网站 [英] Recrawl URL with Nutch just for updated sites

查看:381
本文介绍了重新抓取网址与Nutch的只是更新的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我爬到一个网址,使用Nutch 2.1,然后我想他们得到了更新后重新抓取的网页。我怎样才能做到这一点?我怎么能知道一个网页被更新?

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

推荐答案

只要你不能。您需要重新抓取网页,如果它的更新来控制。因此,根据您的需要,优先页/域和一段时间内重新抓取他们。为此你需要一个作业调度程序,如石英

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

您需要编写一个比较页面的功能。然而,原本Nutch的保存页面作为索引文件。换句话说Nutch的产生新的二进制文件,以节省HTMLS。我不认为这是可以比较的二进制文件,结合Nutch的一个文件中的所有抓取的结果。如果你想要保存原始HTML格式来比较页面,请参阅我的回答<一个href=\"http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch/10060160#10060160\">this问题。

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

这篇关于重新抓取网址与Nutch的只是更新的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆