当 Scrapy 蜘蛛完成爬行时运行代码 [英] Running code when Scrapy spider has finished crawling

查看:32
本文介绍了当 Scrapy 蜘蛛完成爬行时运行代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一旦爬行完全完成以处理移动/清理数据,有没有办法让 Scrapy 执行代码?我确信这是微不足道的,但我的 Google-fu 似乎已经把我留给了这个问题.

解决方案

这完全取决于您如何启动 Scrapy.

如果从带有 crawlrunspider 的命令行运行,只需等待该过程完成即可.请注意,0 退出代码并不意味着您已成功抓取所有内容.>

如果 用作库,您可以在 CrawlerProcess.start() 调用.

如果您需要可靠地跟踪状态,首先您要做的是跟踪 spider_closed 信号并检查其 reason 参数.页面的开头有一个示例,它希望您修改蜘蛛的代码.

在用作库时跟踪您添加的所有蜘蛛:

process = CrawlerProcess({})process.crawl(MySpider)def Spider_ended(蜘蛛,原因):打印('蜘蛛结束:',蜘蛛名,原因)对于 process.crawlers 中的爬虫:crawler.signals.connect(spider_ended, 信号=scrapy.signals.spider_closed)process.start()

检查reason,如果不是'finished',说明有什么东西中断了爬虫.
该函数将为每个蜘蛛调用,因此如果您有很多蜘蛛,它可能需要一些复杂的错误处理.还要记住,在收到两次键盘中断后,Scrapy 开始不正常关闭并且不会调用该函数,但放置在 process.start() 之后的代码无论如何都会运行.

或者,您可以使用 extensions 机制连接到这些信号而不会弄乱代码库的其余部分.示例扩展 展示了如何跟踪此信号.

但所有这一切只是为了检测由于中断而导致的故障.您还需要订阅 spider_error 信号,在蜘蛛中出现 Python 异常时将调用该信号.还有必须完成的网络错误处理,请参阅 这个问题.

最后,我放弃了跟踪失败的想法,只用在 process.start() 返回后检查的全局变量来跟踪成功.就我而言,成功的时刻不是找到下一页"链接.但我有一个线性刮刀,所以很容易,你的情况可能不同.

Is there a way to get Scrapy to execute code once the crawl has completely finished to deal with moving / cleaning the data? Am sure it is trivial but my Google-fu seems to have left me for this issue.

解决方案

It all depends on how you're launching Scrapy.

If running from a command line with crawl or runspider, just wait for the process to finish. Beware that 0 exit code won't mean you've crawled everything successfully.

If using as a library, you can append the code after CrawlerProcess.start() call.

If you need to reliably track the status, first you have to do is to track spider_closed signal and check its reason parameter. There's an example at the start of the page, it expects you to modify the code of the spider.

To track all spiders you have added, when using as a library:

process = CrawlerProcess({})
process.crawl(MySpider)

def spider_ended(spider, reason):
    print('Spider ended:', spider.name, reason)

for crawler in process.crawlers:
    crawler.signals.connect(spider_ended, signal=scrapy.signals.spider_closed)

process.start()

Check the reason, if it is not 'finished', something has interrupted the crawler.
The function will be called for each spider, so it may require some complex error handling if you have many. Also take in mind that after receiving two keyboard interrupts, Scrapy begins unclean shutdown and the function won't be called, but the code that is placed after process.start() will run anyway.

Alternatively you can use the extensions mechanism to connect to these signals without messing with the rest of the code base. The sample extension shows how to track this signal.

But all of this was just to detect a failure because of interruption. You also need to subscribe to spider_error signal that'll be called in case of a Python exception in a spider. And there is also network error handling that has to be done, see this question.

In the end I've ditched the idea of tracking failures and have just tracked success with a global variable that is checked after process.start() returns. In my case the moment of success was not finding the "next page" link. But I had a linear scraper, so it was easy, your case may be different.

这篇关于当 Scrapy 蜘蛛完成爬行时运行代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆