使用 Nutch 重新抓取 URL 仅用于更新的站点 [英] Recrawl URL with Nutch just for updated sites
问题描述
我使用 Nutch 2.1 抓取了一个 URL,然后我想在页面更新后重新抓取它们.我怎样才能做到这一点?我如何知道页面已更新?
I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?
推荐答案
你根本做不到.您需要重新抓取页面以控制它是否已更新.因此,根据您的需要,对页面/域进行优先级排序并在一段时间内重新抓取它们.为此,您需要一个作业调度程序,例如 Quartz.
Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.
您需要编写一个比较页面的函数.但是,Nutch 最初将页面保存为索引文件.换句话说,Nutch 生成新的二进制文件来保存 HTML.我认为比较二进制文件是不可能的,因为 Nutch 将所有抓取结果组合在一个文件中.如果您想以原始 HTML 格式保存页面进行比较,请参阅我对 这个问题.
You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.
这篇关于使用 Nutch 重新抓取 URL 仅用于更新的站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!