我怎样才能把谷歌一样重新进行抓取在我的应用程序(Web或控制台) [英] How can I bring google-like recrawling in my application(web or console)

查看:169
本文介绍了我怎样才能把谷歌一样重新进行抓取在我的应用程序(Web或控制台)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怎么可以把谷歌一样重新进行抓取在我的应用程序(Web或控制台)。我只需要重新抓取这是一个特定的日期后更新的页面。

How can I bring google-like recrawling in my application(web or console). I need only those pages to be recrawled which are updated after a particular date.

在System.Net.WebResponse的上次更改报头只给出服务器的当前日期。例如,如果我在27 2012年1月下载了HttpWebRequest的一个页面,并检查上次更改日期的头,它显示服务器的当前时间的页面端上来的时候。在这种情况下,2012年1月27日只剩下。

The LastModified header in the System.Net.WebResponse gives only the current date of the server. For example if I downloaded one page with HTTPWebRequest on 27 January 2012, and check the header for the LastModified date, it is showing the current time of the server when the page was served. In this case it is 27 January 2012 only.

任何人都可以提出任何其他的方法?

Can anyone suggest any other methods?

推荐答案

首先,在这里指出的是,你想要做的是非常困难的,也就是试图解决的究级论文很大(我会待会儿给你链接到一个他们几个)。没有办法,看是否有网站已经无爬行它改变了,虽然可以有像检查从响应头中的Content-Length,而无需下载页面的其余部分的快捷方式。这将允许你的系统,以节省流量,但它不会在一个方式是非常有用的解决您的问题。

First, to point out here is that what you're trying to do is very difficult and there is a great deal of research-level papers that try to address it (I will give you links to a few of them a little later). There is no way to see if a site has changed without crawling it, although you can have shortcuts like checking the Content-Length from the response header without downloading the rest of the page. This will allow your system to save on traffic, but it won't resolve your problem in a manner that's really useful.

第二,既然你关心的内容,那么的Last-Modified 报头字段不会对你非常有用,我甚至会尽可能地说,去它不会在所有有用的。

Second, since you're concerned about content, then Last-Modified header field will not be very useful for you and I would even go as far as to say that it will not be useful at all.

第三,什么你所描述的有几分矛盾的要求,因为你有兴趣只爬行已更新内容的网页,这就是谷歌不是究竟是如何做的事情(是,你要像Google一样爬行)。谷歌的抓取是专注于为最频繁搜索/访问的网站提供的最新内容。例如:谷歌在频繁抓取一个网站,两次更新其内容时该网站每天10人次有一天非常小的兴趣,而不是谷歌更感兴趣的是爬行一个网站,获取每天10万游客即使其内容更新那么频繁。这可能是同样真实的是经常更新其内容的网站也有很多游客,但是从谷歌的角度来看这是不完全相关的。

And third, what you're describing has somewhat conflicting requirements, because you're interested in crawling only the pages that have updated content and that's not exactly how Google does things (yet, you want google-like crawling). Google's crawling is focused on providing the freshest content for the most frequently searched/visited websites. For example: Google has very little interest in frequently crawling a website that updates its content twice a day when that website has 10 visitors a day, instead Google is more interested in crawling a website that gets 10 million visitors a day even if its content updates less frequently. It may be also true that websites that update their content frequently also have a lot of visitors, but from Google's perspective that's not exactly relevant.

如果你发现新的网站(覆盖),并在同一时间,你想拥有你知道(新鲜)网站的最新内容,那么你有相互冲突的目标(这是大多数爬虫真实的,甚至谷歌)。通常,结束意外事件发生的是,当你有更多的报道,你有较少的新鲜度,如果你有更多的新鲜感,那么你必须覆盖较少。如果你有兴趣在平衡两者,那么我建议你阅读以下文章:

If you have to discover new websites (coverage) and at the same time you want to have the latest content of the sites you know about (freshness), then you have conflicting goals (which is true for most crawlers, even Google). Usually what ends up happening is that when you have more coverage you have less freshness and if you have more freshness then you have less coverage. If you're interested in balancing both, then I suggest you read the following articles:

  • Web Crawler: An Overview
  • After that, I would recommend reading Adaptive On-Line Page Importance Computation
  • And finally: Scaling to 6 Billion Pages and Beyond

这个想法的总结是,你必须为了抓取网站几次(可能几百倍),为您打造了它历史的好措施。一旦你有了一套好的历史的措施,那么你用一个predictive模型时,将再次在网站的变化和您计划的预期变化后的一段时间内爬行进行插值。

The summary of the idea is that you have to crawl a website several times (maybe several hundred times) in order for you to build up a good measure of its history. Once you have a good set of historical measures, then you use a predictive model to interpolate when will the website change again and you schedule a crawl for some time after the expected change.

这篇关于我怎样才能把谷歌一样重新进行抓取在我的应用程序(Web或控制台)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆