Scrapy Crawl Spider 只抓取一定数量的层 [英] Scrapy Crawl Spider Only Scrape Certain Number Of Layers

查看:40
本文介绍了Scrapy Crawl Spider 只抓取一定数量的层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Scrapy CrawlSpider 类(此处的文档).

Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here).

class MySpider(CrawlSpider):
    name = 'abc.com'
    allowed_domains = ['abc.com']
    start_urls = ['http://www.abc.com']

    rules = (
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item')
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        ...

(1) 所以,这个蜘蛛会从 start_urls 中定义的页面:www.abc.com 开始,它会自动进行解析......然后跟随 www.abc.com 中的每个链接哪个遵循规则?我想知道有没有办法让我只能刮一定数量的层..说只刮第一层(直接来自 www.abc.com 的链接)?

(1) So, this spider will start from page: www.abc.com which is defined in the start_urls, and it will automatically do the parsing... and then follow every single link in the www.abc.com which follows the rule right? I am wondering is there a way so I could only scrape a certain number of layers.. say only scrape the 1st layer (links directly derived from www.abc.com)?

(2) 因为我在 allowed_deomains 中定义了只有 abc.com 的网址会被抓取.所以我不需要在规则中重新定义它?并做这样的事情:

(2) Since i have defined in the allowed_deomains that only abc.com urls would be scraped. So I don't need to redefine that in the rules? and do something like this:

Rule(SgmlLinkExtractor(allow=('item\.php', )), allow_domains="www.abc.com", callback='parse_item')

(3) 如果我使用crawlspider,如果我不在spider类中定义规则会怎样?它会抓取所有页面吗?或者它甚至不会遵循任何一个规则,因为没有满足"规则?

(3) If I am using crawlspider, what will happen if I don't define rules in the spider class? it will crawl follow all the pages? or it would not even follow any single one because the rule has not been 'met'?

推荐答案

  1. 设置DEPTH_LIMIT 设置:

DEPTH_LIMIT¶

DEPTH_LIMIT¶

默认值:0

允许抓取任何站点的最大深度.如果为零,没有限制.

The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

  • 不,您不需要添加额外的 url 检查.如果您没有在 Rule 级别指定 allow_domains,它将仅提取具有 abc.com 域的 URL.

  • No, you don't need to add an additional url check. If you don't specify allow_domains on the Rule level, it will extract only URLs with abc.com domain.

    希望有所帮助.

    这篇关于Scrapy Crawl Spider 只抓取一定数量的层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆