如何抓取/索引经常更新的网页的策略? [英] Strategy for how to crawl/index frequently updated webpages?

查看:57
本文介绍了如何抓取/索引经常更新的网页的策略?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个非常小的、小众的搜索引擎,使用 Nutch 来抓取特定的站点.一些网站是新闻/博客网站.如果我抓取,比如说,techcrunch.com,并存储和索引他们的首页或他们的任何主要页面,那么在几小时内我对该页面的索引就会过时.

像谷歌这样的大型搜索引擎是否有一种算法可以非常频繁地、甚至每小时重新抓取频繁更新的页面?或者它只是对经常更新的页面评分很低,因此它们不会被退回?

如何在我自己的应用程序中处理这个问题?

解决方案

好问题.这实际上是 WWW 研究社区中的一个活跃话题.所涉及的技术称为重新抓取策略页面刷新策略.

据我所知,文献中考虑了三个不同的因素:

  • 更改频率(网页内容更新的频率)
    • [1]:形式化数据新鲜度"的概念,并使用泊松过程来模拟网页的变化.
    • [2]:频率估计器
    • [3]:更多的调度策略
  • 相关性(更新的页面内容对搜索结果的影响有多大)
    • [4]:最大限度地提高搜索引擎查询者的用户体验质量
    • [5]:确定(几乎)最佳的爬行频率
  • 信息寿命(随着时间的推移从网页中出现和消失的内容片段的寿命,显示与更改频率没有强相关性)
    • [6]:区分临时和持久内容

您可能想要决定哪个因素对您的应用程序和用户更重要.然后您可以查看以下参考以了解更多详细信息.

<小时>

编辑:为了帮助您入门,我简要讨论了 [2] 中提到的频率估计器.基于此,您应该能够找出其他论文中可能对您有用的内容.:)

请按照我在下面指出的顺序阅读本文.只要你知道一些概率和统计数据 101 应该不会太难理解(如果你只采用估算器公式可能会少得多):

第 1 步.请转到部分 6.4 -- 应用到网络爬虫.Cho 在此列出了 3 种估算网页更改频率的方法.

  • 统一政策:抓取工具以每周一次的频率重新访问所有页面.
  • 天真的策略:在前 5 次访问中,爬虫以该频率访问每个页面每周一次.5次访问后,爬虫估计变化频率使用 naive estimator 的页面(第 4.1 节)
  • 我们的政策:抓取工具使用建议的估算器(第 4.2 节)来估算更改频率.

第 2 步.天真的政策.请转到第 4 部分.您将阅读:

<块引用>

直观地,我们可以使用X/T(X:检测到的变化次数,T:监测周期)作为估计频率的变化.

子序列第 4.1 节刚刚证明了这种估计是有偏的

[2] http://oak.cs.ucla.edu/~cho/论文/cho-freq.pdf

[3] http://hal.inria.fr/docs/00/07/33/72/PDF/RR-3317.pdf

[4] http://wwwconference.org/proceedings/www2005/docs/p401.pdf

[5] http://www.columbia.edu/~js1353/pubs/wolf-www02.pdf

[6] http://infolab.stanford.edu/~olston/publications/www08.pdf

I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date.

Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get returned?

How can I handle this in my own application?

解决方案

Good question. This is actually an active topic in WWW research community. The technique involved is called Re-crawl Strategy or Page Refresh Policy.

As I know there are three different factors that were considered in the literature:

  • Change frequency (how ofter the content of a web page is updated)
    • [1]: Formalized the notion of "freshness" of data and use a poisson process to model the change of web pages.
    • [2]: Frequency estimator
    • [3]: More of scheduling policy
  • Relevance (how much influence the updated page content has on search results)
    • [4]: Maximize the quality of the user experience for those who query the search engine
    • [5]: Determine the (nearly) optimal crawling frequencies
  • Information Longevity (the lifetimes of content fragments that appear and disappear from web pages over time, which is shown not strongly correlated with change frequency)
    • [6]: distinguish between ephemeral and persistent content

You might want to decide which factor is more important for your application and users. Then you can check the below reference for more details.


Edit: I briefly discuss the frequency estimator mentioned in [2] to get you started. Based on this, you should be able to figure out what might be useful to you in the other papers. :)

Please follow the order I pointed out below to read this paper. It should not be too hard to understand as long as you know some probability and stats 101 (maybe much less if you just take the estimator formula):

Step 1. Please go to Section 6.4 -- Application to a Web crawler. Here Cho listed 3 approaches to estimate the web page change frequency.

  • Uniform policy: A crawler revisits all pages at the frequency of once every week.
  • Naive policy: In the first 5 visits, a crawler visits each page at the frequency of once every week. After the 5 visits, the crawler estimates the change frequencies of the pages using the naive estimator (Section 4.1)
  • Our policy: The crawler uses the proposed estimator (Section 4.2) to estimate change frequency.

Step 2. The naive policy. Please go to section 4. You will read:

Intuitively, we may use X/T (X:the number of detected changes, T: monitoring period) as the estimated frequency of change.

The subsequence section 4.1 just proved this estimation is biased7, in-consistant8 and in-efficient9.

Step 3. The improved estimator. Please go to section 4.2. The new estimator looks like below:

where \bar X is n - X (the number of accesses that the element did not change) and n is the number of accesses. So just take this formula and estimate the change frequency. You don't need to understand the proof in the rest of the sub-section.

Step 4. There are some tricks and useful techniques discussed in Section 4.3 and Section 5 that might be helpful to you. Section 4.3 discussed how to deal with irregular intervals. Section 5 solved the question: When the last-modication date of an element is available, how can we use it to estimate change frequency? The proposed estimator using last-modification date is shown below:

The explanation to the above algorithm after Fig.10 in the paper is very clear.

Step 5. Now if you have interest, you can take a look at the experiment setup and results in section 6.

So that's it. If you feel more confident now, go ahead and try the freshness paper in [1].


References

[1] http://oak.cs.ucla.edu/~cho/papers/cho-tods03.pdf

[2] http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf

[3] http://hal.inria.fr/docs/00/07/33/72/PDF/RR-3317.pdf

[4] http://wwwconference.org/proceedings/www2005/docs/p401.pdf

[5] http://www.columbia.edu/~js1353/pubs/wolf-www02.pdf

[6] http://infolab.stanford.edu/~olston/publications/www08.pdf

这篇关于如何抓取/索引经常更新的网页的策略?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆