Scrapy Spider 不会因使用 CloseSpider 扩展而终止 [英] Scrapy spider not terminating with use of CloseSpider extension
问题描述
我已经建立了一个 Scrapy 蜘蛛来解析一个 xml 提要,处理大约 20,000 条记录.
I have set up a Scrapy spider that parses an xml feed, processing some 20,000 records.
出于开发目的,我想限制处理的项目数量.通过阅读我确定的 Scrapy 文档,我需要使用 CloseSpider 扩展.
For the purposes of development, I'd like to limit the number of items processed. From reading the Scrapy docs I identified I need to use the CloseSpider extension.
我已遵循有关如何启用此功能的指南 - 在我的蜘蛛配置中,我有以下内容:
I have followed the guide on how to enable this - in my spider config I have the following:
CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
'scrapy.extensions.closespider.CloseSpider': 500,
}
但是,我的蜘蛛从不终止 - 我知道 CONCURRENT_REQUESTS
设置会影响蜘蛛实际终止的时间(因为它将继续处理每个并发请求),但这仅设置为默认值为 16,但我的蜘蛛将继续处理所有项目.
However, my spider never terminates - I'm aware that the CONCURRENT_REQUESTS
setting affects when the spider actually terminates (as it will carry on processing each concurrent request), but this is only set to the default of 16, and yet my spider will continue to process all the items.
我已尝试使用 CLOSESPIDER_TIMEOUT
设置,但同样没有效果.
I've tried using the CLOSESPIDER_TIMEOUT
setting instead, but similarly this has no effect.
这是我运行蜘蛛时的一些调试信息:
Here is some debug info, from when I run the spider:
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened
可以看出,正在应用 CloseSpider
扩展和 CLOSESPIDER_ITEMCOUNT
设置.
As can be seen, the CloseSpider
extension and the CLOSESPIDER_ITEMCOUNT
settings are being applied.
知道为什么这不起作用吗?
Any ideas why this is not working?
推荐答案
在 parik 的回答以及我自己的研究的帮助下,我想出了一个解决方案.不过,它确实有一些无法解释的行为,我将介绍(感谢评论).
I came up with a solution helped by parik's answer, along with my own research. It does have some unexplained behaviour though, which I will cover (comments appreciated).
在我的蜘蛛的 myspider_spider.py
文件中,我有(为简洁起见进行了编辑):
In my spider's myspider_spider.py
file, I have (edited for brevity):
import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem
class MySpiderSpider(XMLFeedSpider):
name = "myspiders"
allowed_domains = {"www.mysource.com"}
start_urls = [
"https://www.mysource.com/source.xml"
]
iterator = 'iternodes'
itertag = 'item'
item_count = 0
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings)
def __init__(self, settings):
self.settings = settings
def parse_node(self, response, node):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
id = node.xpath('id/text()').extract()
title = node.xpath('title/text()').extract()
item = MySpiderItem()
item['id'] = id
item['title'] = title
return item
这有效 - 如果我将 CLOSESPIDER_ITEMCOUNT
设置为 10,它会在处理 10 个项目后终止(因此,在这方面它似乎忽略了 CONCURRENT_REQUESTS
- 这是出乎意料的).
This works - if I set CLOSESPIDER_ITEMCOUNT
to 10, it terminates after 10 items are processed (so, in that respect it seems to ignore CONCURRENT_REQUESTS
- which was unexpected).
我在我的 settings.py
中注释掉了这一点:
I commented out this in my settings.py
:
#EXTENSIONS = {
# 'scrapy.extensions.closespider.CloseSpider': 500,
#}
所以,它只是使用了 CloseSpider
异常.但是,日志显示以下内容:
So, it's simply using the CloseSpider
exception. However, the log displays the following:
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 8599860,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'closespider_itemcount',
'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
'item_scraped_count': 10,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
要突出显示的关键是第一行 INFO
和 finish_reason
- INFO
下显示的消息不是我想要的引发 CloseSpider
异常时的设置.这意味着是 CloseSpider
扩展阻止了蜘蛛,但我知道它不是?非常混乱.
The key thing to highlight being the first line INFO
and the finish_reason
- the message displayed under INFO
is not the one I'm setting when raising the CloseSpider
exception. It implies it's the CloseSpider
extension that's stopping the spider, but I know it isn't? Very confusing.
这篇关于Scrapy Spider 不会因使用 CloseSpider 扩展而终止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!