Scrapy 能够抓取任何类型的网站吗? [英] Is Scrapy able to crawl any type of websites?

查看:45
本文介绍了Scrapy 能够抓取任何类型的网站吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scrapy 框架在抓取任何网站时是否有效?我问这个问题是因为我在他们的教程中发现他们通常会构建依赖于网站架构(链接结构)的正则表达式来抓取它.这是否意味着 Scrapy 不能通用并且无法抓取任何网站,无论其 URL 的结构方式如何?因为就我而言,我必须处理大量网站:不可能为每个网站编写正则表达式.

Is Scrapy framework efficient in crawling any website ? I ask this question because I found on their tutorial that they build usually regular expressions that depends on the architecture (the structure of the links) of the website to crawl it. Does this mean Scrapy is not able to be generic and crawl any website whatever the manner on which its URL are structured ? Because in my case I have to deal with a very large number of websites: it is impossible to program regular expressions for each one of them.

推荐答案

广泛抓取

Scrapy 默认值针对抓取特定站点进行了优化.这些网站通常由单个 Scrapy 蜘蛛处理,尽管这不是必需或必需的(例如,有通用蜘蛛可以处理抛出的任何给定网站).

除了这种集中抓取",还有另一种常见的抓取类型,它涵盖大量(可能无限)的域,并且仅受时间或其他任意约束,而不是在域被抓取完成或没有更多请求要执行时停止.这些被称为广泛抓取",是搜索引擎采用的典型抓取工具.

In addition to this "focused crawl", there is another common type of crawling which covers a large (potentially unlimited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the domain was crawled to completion or when there are no more requests to perform. These are called "broad crawls" and is the typical crawlers employed by search engines.

这些是在广泛的爬网中常见的一些常见属性:

These are some common properties often found in broad crawls:

  • 他们抓取许多域(通常是无限的)而不是特定的域的网站
  • 他们不一定会抓取域名以完成,因为它这样做是不切实际的(或不可能的),而是限制
    按时间或网页数抓取

  • they crawl many domains (often, unbounded) instead of a specific set of sites
  • they don’t necessarily crawl domains to completion, because it would impractical (or impossible) to do so, and instead limit the
    crawl by time or number of pages crawled

它们在逻辑上更简单(与非常复杂的蜘蛛相反许多提取规则)因为数据通常在一个他们同时抓取许多域的单独阶段,这允许他们不受任何限制来实现更快的爬行速度特定站点约束(每个站点都被缓慢爬行以尊重礼貌,但许多网站是并行抓取的)

they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often post-processed in a separate stage they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in parallel)

如上所述,Scrapy 的默认设置是针对集中抓取而不是广泛抓取进行了优化.然而,由于其异步架构,Scrapy 非常适合执行快速广泛的抓取.

这篇关于Scrapy 能够抓取任何类型的网站吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆