在循环中从scrapy中的脚本运行多个蜘蛛 [英] Run multiple spiders from script in scrapy in loop

查看:48
本文介绍了在循环中从scrapy中的脚本运行多个蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 100 多个蜘蛛,我想使用脚本一次运行 5 个蜘蛛.为此,我在数据库中创建了一个表来了解蜘蛛的状态,即它是否已完成运行、正在运行或正在等待运行.
我知道如何在脚本中运行多个蜘蛛

I have more than 100 spiders and i want to run 5 spiders at a time using a script. For this i have created a table in database to know about the status of a spider i.e. whether it has finished running , running or waiting to run.
I know how to run multiple spiders inside a script

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
    process.start()

但这是不允许的,因为会发生以下错误:

But this is not allowed as the following error occurs:

Traceback (most recent call last):
File "test.py", line 24, in <module>
  process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
  reactor.run(installSignalHandlers=False)  # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
  self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
  ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
  raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我已经搜索了上述错误,但无法解决.可以通过 ScrapyD 来管理蜘蛛,但我们不想使用 ScrapyD,因为许多蜘蛛仍处于开发阶段.

I have searched for above error and not able to resolve it. Managing spiders can be done via ScrapyD but we do not want to use ScrapyD as many spiders are still in development phase.

对上述情况的任何解决方法表示赞赏.

Any workaround for above scenario is appreciated.

谢谢

推荐答案

通过从脚本中删除循环并设置每 3 分钟的调度程序,我能够实现类似的功能.

I was able to implement a similar functionality by removing loop from the script and setting a scheduler for every 3 minutes.

循环功能是通过记录当前正在运行的蜘蛛数量并检查是否需要运行更多蜘蛛来实现的.因此,最终只有 5 个(可以更改)蜘蛛可以同时运行.

Looping functionality was achieved by maintaining a record of how many spiders are currently running and checking if more spiders need to be run or not.Thus at the end, only 5(can be changed) spiders can run concurrently.

这篇关于在循环中从scrapy中的脚本运行多个蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆