在 Python 中运行 Scrapy 任务 [英] Running Scrapy tasks in Python

查看:25
本文介绍了在 Python 中运行 Scrapy 任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我从命令行在一次性"场景中运行它时,我的 Scrapy 脚本似乎工作得很好,但是如果我尝试在同一个 python 会话中运行代码两次,我会收到此错误:

My Scrapy script seems to work just fine when I run it in 'one off' scenarios from the command line, but if I try running the code twice in the same python session I get this error:

ReactorNotRestartable"

"ReactorNotRestartable"

为什么?

违规代码(最后一行抛出错误):

The offending code (last line throws the error):

crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

# schedule spider
#crawler.crawl(MySpider())
spider = MySpider()
crawler.queue.append_spider(spider)

# start engine scrapy/twisted
crawler.start()

推荐答案

接近 Joël 的答案,但我想在评论中详细说明.如果您查看 Crawler 源代码,您会看到 CrawlerProcess 类有一个 start,还有一个 stop 函数.这个 stop 函数负责清理爬行的内部结构,以便系统最终处于可以再次启动的状态.

Close to Joël's answer, but I want to elaborate a bit more than is possible in the comments. If you look at the Crawler source code, you see that the CrawlerProcess class has a start, but also a stop function. This stop function takes care of cleaning up the internals of the crawling so that the system ends up in a state from which it can start again.

因此,如果您想在不离开进程的情况下重新开始抓取,请在适当的时间调用 crawler.stop().稍后,只需再次调用 crawler.start() 即可恢复操作.

So, if you want to restart the crawling without leaving your process, call crawler.stop() at the appropriate time. Later on, simply call crawler.start() again to resume operations.

回想起来,这是不可能的(由于 Twisted reactor,如不同答案中所述);stop 只负责干净的终止.回顾我的代码,我碰巧有一个 Crawler 进程的包装器.您可以在下面找到一些(编辑过的)代码以使其使用 Python 的多处理模块工作.通过这种方式,您可以更轻松地重新启动爬虫.(注意:我上个月在网上找到了代码,但我没有包括来源......所以如果有人知道它来自哪里,我会更新来源的学分.)

in retrospect, this is not possible (due to the Twisted reactor, as mentioned in a different answer); the stop just takes care of a clean termination. Looking back at my code, I happened to have a wrapper for the Crawler processes. Below you can find some (redacted) code to make it work using Python's multiprocessing module. In this way you can more easily restart crawlers. (Note: I found the code online last month, but I didn't include the source... so if someone knows where it came from, I'll update the credits for the source.)

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
from multiprocessing import Process

class CrawlerWorker(Process):
    def __init__(self, spider, results):
        Process.__init__(self)
        self.results = results

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.results.put(self.items)

# The part below can be called as often as you want
results = Queue()
crawler = CrawlerWorker(MySpider(myArgs), results)
crawler.start()
for item in results.get():
    pass # Do something with item

这篇关于在 Python 中运行 Scrapy 任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆