设计一个网络爬虫 [英] Designing a web crawler

查看:23
本文介绍了设计一个网络爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个面试问题如果你正在设计一个网络爬虫,你将如何避免陷入无限循环?"我正在尝试回答.

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.

这一切是如何从头开始的.假设谷歌从一些中心页面开始,比如数百个(如何首先找到这些中心页面是一个不同的子问题).当 Google 跟踪页面中的链接等时,它是否会不断制作哈希表以确保它不会跟踪之前访问过的页面.

How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages.

如果现在我们有 URL 缩短器等,如果同一个页面有 2 个名称(URL)会怎样.

What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..

我以谷歌为例.虽然谷歌没有泄露其网络爬虫算法和页面排名等的工作原理,但有什么猜测吗?

I have taken Google as an example. Though Google doesn't leak how its web crawler algorithms and page ranking etc work, but any guesses?

推荐答案

如果您想获得详细的答案,请查看 本文的第 3.8 节,描述了现代刮刀的 URL-seen 测试:

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

在提取链接的过程中,任何网络爬虫会遇到多个指向同一文档的链接.避免下载和处理文档多次,URL 可见的测试必须对每个提取的链接执行在将其添加到 URL 边界之前.(另一种设计是而是在以下情况下执行 URL-seen 测试从边界中删除 URL,但这种方法会导致更大的边界.)

In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must be performed on each extracted link before adding it to the URL frontier. (An alternative design would be to instead perform the URL-seen test when the URL is removed from the frontier, but this approach would result in a much larger frontier.)

执行URL-seen 测试,我们存储所有墨卡托在规范中看到的 URL在称为 URL 的大表中形成放.再次,条目太多为了它们都适合记忆,所以喜欢文档指纹集,URL集合主要存储在磁盘上.

To perform the URL-seen test, we store all of the URLs seen by Mercator in canonical form in a large table called the URL set. Again, there are too many entries for them all to fit in memory, so like the document fingerprint set, the URL set is stored mostly on disk.

保存空间,我们不存储文本URL 中每个 URL 的表示设置,而是一个固定大小的校验和.不同于指纹呈现给内容看到的测试的文档指纹集,流针对 URL 集测试的 URL 具有大量的局部性.到减少操作次数备份磁盘文件,因此我们保留流行 URL 的内存缓存.这个缓存的直觉是某些 URL 的链接很常见,所以在内存中缓存流行的将导致高内存命中率.

To save space, we do not store the textual representation of each URL in the URL set, but rather a fixed-sized checksum. Unlike the fingerprints presented to the content-seen test’s document fingerprint set, the stream of URLs tested against the URL set has a non-trivial amount of locality. To reduce the number of operations on the backing disk file, we therefore keep an in-memory cache of popular URLs. The intuition for this cache is that links to some URLs are quite common, so caching the popular ones in memory will lead to a high in-memory hit rate.

实际上,使用内存中缓存 2^18 个条目和类似 LRU 的时钟更换策略,我们实现内存中的整体命中率缓存为 66.2%,命中率为 9.5%在最近添加的 URL 表上,净命中率为 75.7%.而且,在 24.3% 的请求中未命中流行 URL 的缓存和最近添加的 URL 表,关于1=3 在我们的缓冲区上产生命中随机访问文件实现,它也驻留在用户空间中.这所有这些缓冲的最终结果是我们执行的每个成员资格测试在 URL 集上的结果是平均的0.16 寻道和 0.17 读内核调用(其中一部分是由内核的文件系统提供缓冲区).因此,每个 URL 设置成员资格测试诱导六分之一的内核调用作为成员测试文档指纹集.这些节省纯粹是因为金额URL 局部性(即,重复流行的 URL)流中固有的抓取过程中遇到的网址数.

In fact, using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy, we achieve an overall hit rate on the in-memory cache of 66.2%, and a hit rate of 9.5% on the table of recently-added URLs, for a net hit rate of 75.7%. Moreover, of the 24.3% of requests that miss in both the cache of popular URLs and the table of recently-added URLs, about 1=3 produce hits on the buffer in our random access file implementation, which also resides in user-space. The net result of all this buffering is that each membership test we perform on the URL set results in an average of 0.16 seek and 0.17 read kernel calls (some fraction of which are served out of the kernel’s file system buffers). So, each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set. These savings are purely due to the amount of URL locality (i.e., repetition of popular URLs) inherent in the stream of URLs encountered during a crawl.

基本上,它们使用哈希函数对所有 URL 进行哈希处理,该函数保证每个 URL 具有唯一的哈希值,并且由于 URL 的局部性,查找 URL 变得非常容易.Google 甚至开源了他们的哈希函数:CityHash

Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

警告!
他们也可能在谈论机器人陷阱!!!机器人陷阱是页面的一部分,它不断生成具有唯一 URL 的新链接,您基本上会陷入无限循环"中.通过遵循该页面提供的链接.这不完全是一个循环,因为访问同一个 URL 会导致循环,但它是一个无限的 URL 链,您应该避免抓取.

WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.

根据 Fr0zenFyr 的评论:如果使用 AOPIC 算法对于选择页面,则很容易避免无限循环类型的机器人陷阱.以下是 AOPIC 工作原理的摘要:

Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

  1. 获取一组 N 个种子页面.
  2. 为每个页面分配 X 个信用额度,以便在开始抓取之前每个页面都有 X/N 个信用额度(即相等的信用额度).
  3. 选择一个页面 P,其中 P 的信用额度最高(或者如果所有页面的信用额度相同,则抓取随机页面).
  4. 抓取页面 P(假设 P 被抓取时有 100 个积分).
  5. 从页面 P 中提取所有链接(假设有 10 个).
  6. 将 P 的信用设置为 0.
  7. 征收 10% 的税";并将其分配给 Lambda 页面.
  8. 从 P 的原始信用中分配在页面 P 上找到的每个链接等量的信用 - 税收:因此(100(P 信用)- 10(10% 税))/10(链接)=每个链接 9 个信用.
  9. 从第 3 步开始重复.

由于 Lambda 页面不断地征税,最终它会是信用量最大的页面,我们将不得不爬行";它.我说爬行"在引号中,因为我们实际上并没有为 Lambda 页面发出 HTTP 请求,我们只是获取它的信用并将它们平均分配到我们数据库中的所有页面.

Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

由于机器人陷阱只提供内部链接信用,并且很少从外部获得信用,因此它们会不断将信用(来自税收)泄漏到 Lambda 页面.Lambda 页面会将该积分均匀地分配给数据库中的所有页面,并且在每个循环中,bot 陷阱页面将丢失越来越多的积分,直到它的积分如此之少以至于几乎再也不会被抓取.好的页面不会发生这种情况,因为它们通常会从其他页面上的反向链接中获得积分.这也会导致动态页面排名,您会注意到,任何时候您拍摄数据库快照时,都会按页面的信用数量对页面进行排序,然后很可能会根据它们的 粗略排序真实网页排名.

Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

这只能避免无限循环类型的机器人陷阱,但有您应该注意许多其他机器人陷阱,也有一些方法可以绕过它们.

This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

这篇关于设计一个网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆