当 Scrapy 蜘蛛完成爬行时运行代码 [英] Running code when Scrapy spider has finished crawling
问题描述
一旦爬行完全完成以处理移动/清理数据,有没有办法让 Scrapy 执行代码?我确信这是微不足道的,但我的 Google-fu 似乎已经把我留给了这个问题.
这完全取决于您如何启动 Scrapy.
如果从带有 crawl
或 runspider
的命令行运行,只需等待该过程完成即可.请注意,0 退出代码并不意味着您已成功抓取所有内容.>
如果 用作库,您可以在 CrawlerProcess.start()
调用.
如果您需要可靠地跟踪状态,首先您要做的是跟踪 spider_closed
信号并检查其 reason
参数.页面的开头有一个示例,它希望您修改蜘蛛的代码.
在用作库时跟踪您添加的所有蜘蛛:
process = CrawlerProcess({})process.crawl(MySpider)def Spider_ended(蜘蛛,原因):打印('蜘蛛结束:',蜘蛛名,原因)对于 process.crawlers 中的爬虫:crawler.signals.connect(spider_ended, 信号=scrapy.signals.spider_closed)process.start()
检查reason
,如果不是'finished'
,说明有什么东西中断了爬虫.
该函数将为每个蜘蛛调用,因此如果您有很多蜘蛛,它可能需要一些复杂的错误处理.还要记住,在收到两次键盘中断后,Scrapy 开始不正常关闭并且不会调用该函数,但放置在 process.start()
之后的代码无论如何都会运行.
或者,您可以使用 extensions 机制连接到这些信号而不会弄乱代码库的其余部分.示例扩展 展示了如何跟踪此信号.
但所有这一切只是为了检测由于中断而导致的故障.您还需要订阅 spider_error
信号,在蜘蛛中出现 Python 异常时将调用该信号.还有必须完成的网络错误处理,请参阅 这个问题.
最后,我放弃了跟踪失败的想法,只用在 process.start()
返回后检查的全局变量来跟踪成功.就我而言,成功的时刻不是找到下一页"链接.但我有一个线性刮刀,所以很容易,你的情况可能不同.
Is there a way to get Scrapy to execute code once the crawl has completely finished to deal with moving / cleaning the data? Am sure it is trivial but my Google-fu seems to have left me for this issue.
It all depends on how you're launching Scrapy.
If running from a command line with crawl
or runspider
, just wait for the process to finish. Beware that 0 exit code won't mean you've crawled everything successfully.
If using as a library, you can append the code after CrawlerProcess.start()
call.
If you need to reliably track the status, first you have to do is to track spider_closed
signal and check its reason
parameter. There's an example at the start of the page, it expects you to modify the code of the spider.
To track all spiders you have added, when using as a library:
process = CrawlerProcess({})
process.crawl(MySpider)
def spider_ended(spider, reason):
print('Spider ended:', spider.name, reason)
for crawler in process.crawlers:
crawler.signals.connect(spider_ended, signal=scrapy.signals.spider_closed)
process.start()
Check the reason
, if it is not 'finished'
, something has interrupted the crawler.
The function will be called for each spider, so it may require some complex error handling if you have many. Also take in mind that after receiving two keyboard interrupts, Scrapy begins unclean shutdown and the function won't be called, but the code that is placed after process.start()
will run anyway.
Alternatively you can use the extensions mechanism to connect to these signals without messing with the rest of the code base. The sample extension shows how to track this signal.
But all of this was just to detect a failure because of interruption. You also need to subscribe to spider_error
signal that'll be called in case of a Python exception in a spider. And there is also network error handling that has to be done, see this question.
In the end I've ditched the idea of tracking failures and have just tracked success with a global variable that is checked after process.start()
returns. In my case the moment of success was not finding the "next page" link. But I had a linear scraper, so it was easy, your case may be different.
这篇关于当 Scrapy 蜘蛛完成爬行时运行代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!