Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作 [英] Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

查看:50
本文介绍了Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy 1.0.3 并且无法发现CLOSESPIDER 扩展是如何工作的.对于命令:scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1正确是一个请求,但对于两页计数:scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2是无限的请求.

I use scrapy 1.0.3 and can't discover how works CLOSESPIDER extesnion. For command: scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1 is correctly one requst, but for two pages count: scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2 is infinity of requests.

所以请用简单的例子解释一下它是如何工作的.

So please explain me how it works in simple example.

这是我的蜘蛛代码:

class DomainLinksSpider(CrawlSpider):
    name = "domain_links"
    #allowed_domains = ["www.example.org"]
    start_urls = [ "www.example.org/",]

    rules = (

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'),
    )

    def parse_page(self, response):
        print '<<<',response.url
        items = []
        item = PathsSpiderItem()

        selected_links = response.selector.xpath('//a[@href]')

        for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response):
            item = PathsSpiderItem()
            item['url'] = link.url
            items.append(item)
        return items

甚至不为这个简单的蜘蛛工作:

even don't work for this simply spider:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['karen.pl']
    start_urls = ['http://www.karen.pl']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).


        # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()

        return item

但不是无穷大:

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=1'下载者/请求计数':1,

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=1 'downloader/request_count': 1,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=2'下载者/请求计数':17,

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=2 'downloader/request_count': 17,

scrapy 爬取示例 --set CLOSESPIDER_PAGECOUNT=3'下载者/请求计数':19,

scrapy crawl example --set CLOSESPIDER_PAGECOUNT=3 'downloader/request_count': 19,

可能是因为并行降级.是的,对于 CONCURRENT_REQUESTS = 1,CLOSESPIDER_PAGECOUNT 设置适用于第二个示例.我将检查第一个 - 它也有效.这对我来说几乎是无限的,因为带有许多网址(我的项目)的站点地图被抓取为下一页:)

Maby it is because of parallel downolading. Yes, for CONCURRENT_REQUESTS = 1, CLOSESPIDER_PAGECOUNT settings works for second example. I wil check the first - it works too. It was almost infinity for me becouse, sitemap with many urls (my items) was crawled as next page :)

推荐答案

CLOSESPIDER_PAGECOUNTCloseSpider 扩展,它计算每个响应,直到达到它的限制,当它告诉爬虫进程开始结束(完成请求并关闭可用插槽).

CLOSESPIDER_PAGECOUNT is controlled by the CloseSpider extension, which counts every response until reaching its limit that's when it tells the crawler process to start ending (finishing requests and closing available slots).

现在你的蜘蛛在你指定 CLOSESPIDER_PAGECOUNT=1 时结束的原因是因为在那一刻(当它得到第一个响应时)没有 pending 请求,它们是在你的第一个之后创建的,所以爬虫进程准备结束,不考虑后面的(因为它们会在第一个之后出生).

Now the reason why your spider ends when you specify CLOSESPIDER_PAGECOUNT=1 is because at that moment (when it gets its first response) there are no pending requests, they are being created after your first one, so the crawler process is ready to end, not taking into account the following ones (because they will be born after the first).

当您指定 CLOSESPIDER_PAGECOUNT>1 时,您的蜘蛛会被捕获创建请求并填充请求队列.当蜘蛛知道何时完成时,仍有待处理的请求要处理,这些请求作为关闭蜘蛛的一部分执行.

When you specify CLOSESPIDER_PAGECOUNT>1, your spider is caught creating requests and filling the requests queue. When the spider knows when to finish there are still pending requests to process, which are executed as part of closing the spider.

这篇关于Scrapy CLOSESPIDER_PAGECOUNT 设置无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆