快速重新抓取网站 [英] Re-crawling websites fast

查看:52
本文介绍了快速重新抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个系统,该系统必须跟踪少数门户的内容并每晚检查更改(例如下载和索引白天添加的新站点).该门户的内容将被索引以供搜索.问题在于重新抓取此门户 - 第一次抓取门户需要很长时间(门户示例:www.onet.pl、www.bankier.pl、www.gazeta.pl)我想重新抓取它更快(尽可能快)例如通过检查修改日期但我已经使用 wget 下载 www.bankier.pl 但作为回应它抱怨没有最后-修改标题.有没有办法重新抓取这么多网站?我也尝试过使用 Nutch,但用于重新抓取的脚本似乎无法正常工作 - 或者它也取决于此标题(最后修改).也许有一种工具,爬虫(如 Nutch 之类的)可以通过添加新站点来更新已下载的站点??

I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I have used wget to download www.bankier.pl but in response it complains that there is no last-modification header. Is there any way to re-crawl so many sites? I have also tried using Nutch but script for re-clawing seems not work properly - or it also depends on this headers (last-modified). Maybe there is a tool, crawler (like Nutch or something) that can update already downloaded sites by adding new one??

最好的问候,沃伊泰克

推荐答案

我建议只使用 curl 来获取头部并检查 Last-Modified 标题是否已更改.

I recommend using curl to fetch only the head and check if the Last-Modified header has changed.

示例:

 curl --head www.bankier.pl

这篇关于快速重新抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆