在芹菜任务中运行Scrapy蜘蛛 [英] Run a Scrapy spider in a Celery Task

查看:86
本文介绍了在芹菜任务中运行Scrapy蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不再起作用,scrapy的API已更改。

This is not working anymore, scrapy's API has changed.

现在,文档功能提供了一种 从脚本运行Scrapy ,但出现 ReactorNotRestartable 错误。

Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error.

我的任务:

from celery import Task

from twisted.internet import reactor

from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings

from .spiders import MySpider



class MyTask(Task):
    def run(self, *args, **kwargs):
        spider = MySpider
        settings = get_project_settings()
        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()

        log.start()
        reactor.run()


推荐答案

扭曲的反应堆无法重新启动。解决此问题的方法是让celery任务为您要执行的每个爬网派一个新的子进程,如下文中的建议:

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post:

  • Running Scrapy spiders in a Celery task

这可以通过使用 multiprocessing 包解决反应器无法重新启动的问题。但是问题在于,由于您实际上会遇到另一个问题,即守护进程无法生成子进程,因此最新的celery版本现在已解决该解决方法。因此,要使该替代方法起作用,您需要使用celery版本。

This gets around the "reactor cannot be restart-able" issue by utilizing the multiprocessing package. But the problem with this is that the workaround is now obsolete with the latest celery version due to the fact that you will instead run into another issue where a daemon process can't spawn sub processes. So in order for the workaround to work you need to go down in celery version.

是的,而 scrapy API已更改。但是进行了少量修改(导入Crawler 而不是 CrawlerProcess )。您可以通过使用celery版本来解决该问题。

Yes, and the scrapy API has changed. But with minor modifications (import Crawler instead of CrawlerProcess). You can get the workaround to work by going down in celery version.


芹菜问题可以在这里找到:
芹菜期#1709

这是我的更新的抓取脚本,它通过使用 billiard 而不是多处理:

Here is my updated crawl-script that works with newer celery versions by utilizing billiard instead of multiprocessing:

from scrapy.crawler import Crawler
from scrapy.conf import settings
from myspider import MySpider
from scrapy import log, project
from twisted.internet import reactor
from billiard import Process
from scrapy.utils.project import get_project_settings

class UrlCrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(settings)
        self.crawler.configure()
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        reactor.run()

def run_spider(url):
    spider = MySpider(url)
    crawler = UrlCrawlerScript(spider)
    crawler.start()
    crawler.join()




编辑:通过阅读芹菜问题#1709 他们建议使用台球而不是多进程,以解除子进程限制。换句话说,我们应该尝试台球,看看它是否有效!

By reading the celery issue #1709 they suggest to use billiard instead of multiprocessing in order for the subprocess limitation to be lifted. In other words we should try billiard and see if it works!

编辑2:是,通过使用台球,我的脚本适用于最新的celery版本!查看我更新的脚本。

Edit 2: Yes, by using billiard, my script works with the latest celery build! See my updated script.

这篇关于在芹菜任务中运行Scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆