Scrapy Spider 不会因使用 CloseSpider 扩展而终止 [英] Scrapy spider not terminating with use of CloseSpider extension

查看:88
本文介绍了Scrapy Spider 不会因使用 CloseSpider 扩展而终止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经建立了一个 Scrapy 蜘蛛来解析一个 xml 提要,处理大约 20,000 条记录.

I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records.

出于开发目的,我想限制处理的项目数量.通过阅读我确定的 Scrapy 文档,我需要使用 CloseSpider 扩展.

For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension.

我已遵循有关如何启用此功能的指南 - 在我的蜘蛛配置中,我有以下内容:

I have followed the guide on how to enable this - in my spider config I have the following:

CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
    'scrapy.extensions.closespider.CloseSpider': 500,
}

但是,我的蜘蛛从不终止 - 我知道 CONCURRENT_REQUESTS 设置会影响蜘蛛实际终止的时间(因为它将继续处理每个并发请求),但这仅设置为默认值为 16,但我的蜘蛛将继续处理所有项目.

However, my spider never terminates - I'm aware that the CONCURRENT_REQUESTS setting affects when the spider actually terminates (as it will carry on processing each concurrent request), but this is only set to the default of 16, and yet my spider will continue to process all the items.

我已尝试使用 CLOSESPIDER_TIMEOUT 设置,但同样没有效果.

I've tried using the CLOSESPIDER_TIMEOUT setting instead, but similarly this has no effect.

这是我运行蜘蛛时的一些调试信息:

Here is some debug info, from when I run the spider:

2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened

可以看出,正在应用 CloseSpider 扩展和 CLOSESPIDER_ITEMCOUNT 设置.

As can be seen, the CloseSpider extension and the CLOSESPIDER_ITEMCOUNT settings are being applied.

知道为什么这不起作用吗?

Any ideas why this is not working?

推荐答案

在 parik 的回答以及我自己的研究的帮助下,我想出了一个解决方案.不过,它确实有一些无法解释的行为,我将介绍(感谢评论).

I came up with a solution helped by parik's answer, along with my own research. It does have some unexplained behaviour though, which I will cover (comments appreciated).

在我的蜘蛛的 myspider_spider.py 文件中,我有(为简洁起见进行了编辑):

In my spider's myspider_spider.py file, I have (edited for brevity):

import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem

class MySpiderSpider(XMLFeedSpider):
    name = "myspiders"
    allowed_domains = {"www.mysource.com"}
    start_urls = [
        "https://www.mysource.com/source.xml"
        ]
    iterator = 'iternodes'
    itertag = 'item'
    item_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings)

    def __init__(self, settings):
        self.settings = settings

    def parse_node(self, response, node):
        if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
            raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
        else:
            self.item_count += 1
        id = node.xpath('id/text()').extract()
        title = node.xpath('title/text()').extract()
        item = MySpiderItem()
        item['id'] = id
        item['title'] = title

        return item

这有效 - 如果我将 CLOSESPIDER_ITEMCOUNT 设置为 10,它会在处理 10 个项目后终止(因此,在这方面它似乎忽略了 CONCURRENT_REQUESTS - 这是出乎意料的).

This works - if I set CLOSESPIDER_ITEMCOUNT to 10, it terminates after 10 items are processed (so, in that respect it seems to ignore CONCURRENT_REQUESTS - which was unexpected).

我在我的 settings.py 中注释掉了这一点:

I commented out this in my settings.py:

#EXTENSIONS = {
#   'scrapy.extensions.closespider.CloseSpider': 500,
#}

所以,它只是使用了 CloseSpider 异常.但是,日志显示以下内容:

So, it's simply using the CloseSpider exception. However, the log displays the following:

2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8599860,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
 'item_scraped_count': 10,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)

要突出显示的关键是第一行 INFOfinish_reason - INFO 下显示的消息不是我想要的引发 CloseSpider 异常时的设置.这意味着是 CloseSpider 扩展阻止了蜘蛛,但我知道它不是?非常混乱.

The key thing to highlight being the first line INFO and the finish_reason - the message displayed under INFO is not the one I'm setting when raising the CloseSpider exception. It implies it's the CloseSpider extension that's stopping the spider, but I know it isn't? Very confusing.

这篇关于Scrapy Spider 不会因使用 CloseSpider 扩展而终止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆