当 CrawlerProcess 运行两次时,Scrapy 引发 ReactorNotRestartable [英] Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice

查看:169
本文介绍了当 CrawlerProcess 运行两次时,Scrapy 引发 ReactorNotRestartable的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些看起来像这样的代码:

I have some code which looks something like this:

def run(spider_name, settings):
    runner = CrawlerProcess(settings)
    runner.crawl(spider_name)
    runner.start()
    return True

我有两个 py.test 测试,每个测试都调用 run(),当第二个测试执行时,我收到以下错误.

I have two py.test tests which each call run(), when the second test executes I get the following error.

    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
    reactor.run(installSignalHandlers=False)  # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
    ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

    def startRunning(self):
        """
            Method called when reactor starts: do some initialization and fire
            startup events.

            Don't call this directly, call reactor.run() instead: it should take
            care of calling this.

            This method is somewhat misnamed.  The reactor will not necessarily be
            in the running state by the time this method returns.  The only
            guarantee is that it will be on its way to the running state.
            """
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
>           raise error.ReactorNotRestartable()
E           twisted.internet.error.ReactorNotRestartable

我知道这个反应堆已经在运行,所以当第二个测试运行时我不能runner.start().但是有没有办法在测试之间重置它的状态?所以他们更加孤立,实际上可以一个接一个地跑.

I get this reactor thing is already running so I cannot runner.start() when the second test runs. But is there some way to reset its state inbetween the tests? So they are more isolated and actually can run after one another.

推荐答案

如果您使用 CrawlerRunner 而不是 CrawlerProcesspytest-twisted,你应该能够像这样运行你的测试:

If you use CrawlerRunner instead of CrawlerProcess in conjunction with pytest-twisted, you should be able to use run your tests like this:

为 Pytest 安装 Twisted 集成:pip install pytest-twisted

Install Twisted integration for Pytest: pip install pytest-twisted

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
    """
    spider_cls: Scrapy Spider class
    settings: Scrapy settings
    returns: Twisted Deferred
    """
    runner = CrawlerRunner(settings)
    return runner.crawl(spider_cls)     # return Deferred


def test_scrapy_crawler():
    deferred = _run_crawler(MySpider, settings)

    @deferred.addCallback
    def _success(results):
        """
        After crawler completes, this function will execute.
        Do your assertions in this function.
        """

    @deferred.addErrback
    def _error(failure):
        raise failure.value

    return deferred

说白了,_run_crawler() 将在 Twisted 反应器中安排爬行,并在爬取完成时执行回调.在这些回调(_success()_error())中,您将进行断言.最后,您必须从 _run_crawler() 返回 Deferred 对象,以便测试等待爬行完成.带有 Deferred 的这部分是必不可少的,所有测试都必须完成.

To put it plainly, _run_crawler() will schedule a crawl in the Twisted reactor and execute callbacks when the scrape completes. In those callbacks (_success() and _error()) is where you will do your assertions. Lastly, you have to return the Deferred object from _run_crawler() so that the test waits until the crawl is complete. This part with the Deferred, is essential and must be done for all tests.

以下是如何使用 运行多次抓取和汇总结果的示例gatherResults.

Here's an example of how to run multiple crawls and aggregate results using gatherResults.

from twisted.internet import defer

def test_multiple_crawls():
    d1 = _run_crawler(Spider1, settings)
    d2 = _run_crawler(Spider2, settings)

    d_list = defer.gatherResults([d1, d2])

    @d_list.addCallback
    def _success(results):
        assert True

    @d_list.addErrback
    def _error(failure):
        assert False

    return d_list

我希望这会有所帮助,如果没有,请询​​问您在哪里挣扎.

I hope this helps, if it doesn't please ask where you're struggling.

这篇关于当 CrawlerProcess 运行两次时,Scrapy 引发 ReactorNotRestartable的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆