如何以编程方式安排 Scrapy 爬网执行 [英] How to schedule Scrapy crawl execution programmatically

查看:40
本文介绍了如何以编程方式安排 Scrapy 爬网执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个调度程序脚本来按顺序多次运行同一个蜘蛛.

I want to create a scheduler script to run the same spider multiple times in a sequence.

到目前为止,我得到了以下内容:

So far I got the following:

#!/usr/bin/python3
"""Scheduler for spiders."""
import time

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """Job to start spiders."""
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(DealsSpider)
    process.start() # the script will block here until the end of the crawl


if __name__ == '__main__':

    while True:
        crawl_job()
        time.sleep(30) # wait 30 seconds then crawl again

现在蜘蛛第一次正确执行,然后在时间延迟后,蜘蛛再次启动,但就在它开始抓取之前,我收到以下错误消息:

For now the first time the spider executes properly, then after the time delay, the spider starts up again but right before it would start scraping I get the following error message:

Traceback (most recent call last):
  File "scheduler.py", line 27, in <module>
    crawl_job()
  File "scheduler.py", line 17, in crawl_job
    process.start() # the script will block here until the end of the crawl
  File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

不幸的是,我不熟悉 Twisted 框架和它的 Reactor,所以任何帮助将不胜感激!

Unfortunately I'm not familiar with the Twisted framework and its Reactors, so any help would be appreciated!

推荐答案

您收到 ReactorNotRestartable 错误,因为 Reactor 无法在 Twisted 中多次启动.基本上,每次 process.start() 被调用时,它都会尝试启动反应器.网上有很多关于这方面的信息.这是一个简单的解决方案:

You're getting the ReactorNotRestartable error because the Reactor cannot be started multiple times in Twisted. Basically, each time process.start() is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """
    Job to start spiders.
    Return Deferred, which will execute after crawl has completed.
    """
    settings = get_project_settings()
    runner = CrawlerRunner(settings)
    return runner.crawl(DealsSpider)

def schedule_next_crawl(null, sleep_time):
    """
    Schedule the next crawl
    """
    reactor.callLater(sleep_time, crawl)

def crawl():
    """
    A "recursive" function that schedules a crawl 30 seconds after
    each successful crawl.
    """
    # crawl_job() returns a Deferred
    d = crawl_job()
    # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
    d.addCallback(schedule_next_crawl, 30)
    d.addErrback(catch_error)

def catch_error(failure):
    print(failure.value)

if __name__=="__main__":
    crawl()
    reactor.run()

与您的代码段有一些明显的不同.reactor 被直接调用,用 CrawlerProcess 代替 CrawlerRunnertime.sleep 已被移除,因此反应器不会't 阻塞,while 循环已被替换为通过 callLatercrawl 函数的连续调用.它很短,应该做你想做的.如果有任何部分让您感到困惑,请告诉我,我会详细说明.

There are a few noticeable differences from your snippet. The reactor is directly called, substitute CrawlerProcess for CrawlerRunner, time.sleep has been removed so that the reactor doesn't block, the while loop has been replaced with a continuous call to the crawl function via callLater. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.

import datetime as dt

def schedule_next_crawl(null, hour, minute):
    tomorrow = (
        dt.datetime.now() + dt.timedelta(days=1)
        ).replace(hour=hour, minute=minute, second=0, microsecond=0)
    sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
    reactor.callLater(sleep_time, crawl)

def crawl():
    d = crawl_job()
    # crawl everyday at 1pm
    d.addCallback(schedule_next_crawl, hour=13, minute=30)

这篇关于如何以编程方式安排 Scrapy 爬网执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆