设计网页抓取工具 [英] Designing a web crawler

查看:176
本文介绍了设计网页抓取工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个面试问题:如果你正在设计一个网页抓取工具,你将如何避免进入无限循环?我正在回答这个问题。



从头开始都是如此。
说Google从一些中心页面开始说,其中有数百个(首先找到这些集线器页面是不同的子问题)。
由于Google遵循一个页面的链接等等,它是否继续制作一个哈希表,以确保它不遵循较早访问的页面。



如果同一页面有两个名字(URL)在这些日子里说,当我们有URL缩短等等。



我以Google为例。虽然Google不会泄漏网页抓取器算法和页面排名等工作,但是有什么猜测?

解决方案

得到一个详细的答案,看看第3.8节这篇文章,其中描述了URL现在的刮板检查:


在提取链接的过程中,任何
的网页抓取工具都会遇到多个
链接到同一个文档。为了避免
多次下载和处理文档
,在将URL添加到URL前端之前,每个提取的链接
上执行一个网址看到的测试必须

(另一种设计将是
,而是执行URL查看的测试,当
从边界中删除URL
,但这种方法会导致
要执行
URL查看的测试,我们将所有的
URL存储在经典的$ b $中, b在一个大表格中形成一个叫做
的URL。再次,有太多的条目
,他们都适合内存,所以像
文档指纹集,URL
集主要存储在磁盘上。



为了节省
空间,我们不会在URL
集合中存储每个URL的文本
表示,而是固定尺寸
校验和。不同于指纹
提供给内容看到的测试的
文档指纹集,针对URL集测试的URL

是一个非常小的本地数量。要
减少
备份磁盘文件上的操作数,因此我们保留
一个流行URL的内存缓存。
这个缓存的直觉是
链接到一些URL是很常见的,
因此缓存流行的内存
将导致内存中的高位
率。实际上,使用2 ^ 18个条目的内存
缓存和LRU类
时钟替换策略,我们实现
在662%的内存中
缓存的总命中率,最近添加的URL的表中的命中率为9.5%

为净命中率75.7%。此外,在
中的24.3%的请求中,
中的
1 = 3的流行URL缓存和
表中的
都产生了缓冲区在
随机访问文件实现中,
也驻留在用户空间中。所有这些缓冲的
净结果是
,我们在URL集合上执行
的每个成员资格测试导致平均
为0.16 seek,0.17读取内核
调用(其中一些部分是
从内核的文件系统
缓冲区中提供)。因此,每个URL集合成员资格
测试引起了
文档指纹集的成员资格测试的六分之一的内核
调用。这些
的储蓄完全是由于在爬网期间遇到的
的流中固有的URL位置(即重复
流行URL)的数量


基本上,它们使用散列函数对所有URL进行散列,以保证每个URL的唯一散列,并且由于URL的位置,变得非常容易找到URL Google甚至开源哈希功能: CityHash



警告!

他们也可能在谈论机器人陷阱!!!机器人陷阱是页面的一部分,不断生成具有唯一URL的新链接,并且您将基本上通过遵循该页面提供的链接陷入无限循环。这不是一个循环,因为循环将是访问同一个URL的结果,但它是一个无限链接的URL,您应该避免抓取。



更新12 / 13/2012 - 世界应该结束的第二天:)



Per Fr0zenFyr的评论:如果使用 AOPIC 算法用于选择页面,那么避免无限循环类型的机器人陷阱是相当容易的。以下是AOPIC如何工作的总结:


  1. 获取一组N种子页面。

  2. 将X个信用额度分配给每个页面,以便每个页面在爬行开始之前都具有X / N信用(即相等的信用额度)。

  3. 选择P页面,P最高的信用额(或所有页面的信用量相同,然后抓取随机页面)。

  4. 抓取页面P(假设P被抓取时有100个学分

  5. 从P页提取所有链接(假设有10个)。

  6. 将P的分数设置为0.

  7. 取得10%的税款并将其分配给Lambda页面。

  8. 从P的P页面分配相等数量的信用原始信用 - 税:所以(100(P点数) - 10(10%税))/ 10(链接)=每个链接9个学分。

  9. 从步骤3重复。 / li>

由于Lambda页面不断收税,最终将是pa ge最大的信用额度,我们必须爬行它。我用引号说抓取,因为我们实际上并没有为Lambda页面提出一个HTTP请求,所以我们只是把它的信用并且分配给我们的数据库页面的全部。 p>

由于机器人陷阱只给内部链接信用,他们很少从外部获得信用,他们将不断将信用(从税收)泄漏给Lambda页面。 Lambda页面将将该信用额度均匀地分发给数据库中的所有页面,并且在每个周期之后,机器人陷阱页面将丢失越来越多的信用,直到它具有这么少的信用,几乎永远不会被再次爬行。这不会发生在良好的网页,因为他们经常从其他页面上发现的后退链接获得积分。这也会导致一个动态的页面排名,你会注意到,任何时候你拍摄数据库的快照,按照他们拥有的信用量的顺序排列页面,那么很可能会根据他们的真正的页面排名



这只能避免无限循环的机器人陷阱,但是有一个许多其他机器人陷阱,你应该注意,还有办法摆脱他们。


I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.

How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages.

What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..

I have taken Google as an example. Though Google doesn't leak how its web crawler algorithms and page ranking etc work, but any guesses?

解决方案

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must be performed on each extracted link before adding it to the URL frontier. (An alternative design would be to instead perform the URL-seen test when the URL is removed from the frontier, but this approach would result in a much larger frontier.)

To perform the URL-seen test, we store all of the URLs seen by Mercator in canonical form in a large table called the URL set. Again, there are too many entries for them all to fit in memory, so like the document fingerprint set, the URL set is stored mostly on disk.

To save space, we do not store the textual representation of each URL in the URL set, but rather a fixed-sized checksum. Unlike the fingerprints presented to the content-seen test’s document fingerprint set, the stream of URLs tested against the URL set has a non-trivial amount of locality. To reduce the number of operations on the backing disk file, we therefore keep an in-memory cache of popular URLs. The intuition for this cache is that links to some URLs are quite common, so caching the popular ones in memory will lead to a high in-memory hit rate.

In fact, using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy, we achieve an overall hit rate on the in-memory cache of 66.2%, and a hit rate of 9.5% on the table of recently-added URLs, for a net hit rate of 75.7%. Moreover, of the 24.3% of requests that miss in both the cache of popular URLs and the table of recently-added URLs, about 1=3 produce hits on the buffer in our random access file implementation, which also resides in user-space. The net result of all this buffering is that each membership test we perform on the URL set results in an average of 0.16 seek and 0.17 read kernel calls (some fraction of which are served out of the kernel’s file system buffers). So, each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set. These savings are purely due to the amount of URL locality (i.e., repetition of popular URLs) inherent in the stream of URLs encountered during a crawl.

Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.

Update 12/13/2012- the day after the world was supposed to end :)

Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

  1. Get a set of N seed pages.
  2. Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
  3. Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
  4. Crawl page P (let's say that P had 100 credits when it was crawled).
  5. Extract all the links from page P (let's say there are 10 of them).
  6. Set the credits of P to 0.
  7. Take a 10% "tax" and allocate it to a Lambda page.
  8. Allocate an equal amount of credits each link found on page P from P's original credit - the tax: so (100 (P credits) - 10 (10% tax))/10 (links) = 9 credits per each link.
  9. Repeat from step 3.

Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

这篇关于设计网页抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆