按计划进行刮擦 [英] Scrapy on a schedule

查看:39
本文介绍了按计划进行刮擦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让 Scrapy 按计划运行正在推动我绕过 Twist(ed).

Getting Scrapy to run on a schedule is driving me around the Twist(ed).

我认为下面的测试代码可以工作,但是当蜘蛛第二次被触发时,我得到一个 twisted.internet.error.ReactorNotRestartable 错误:

I thought the below test code would work, but I get a twisted.internet.error.ReactorNotRestartable error when the spider is triggered a second time:

from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess

def run_spider_script():
    process.crawl(QuotesSpider)
    process.start()


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_script)

while True:
    schedule.run_pending()
    time.sleep(1)

我猜想,作为 CrawlerProcess 的一部分,Twisted Reactor 会被调用以重新启动,而这不是必需的,因此程序会崩溃.有什么办法可以控制吗?

I'm going to guess that as part of the CrawlerProcess, the Twisted Reactor is called to start again, when that's not required and so the program crashes. Is there any way I can control this?

此外,在这个阶段,如果有另一种方法可以让 Scrapy 蜘蛛自动按计划运行,我也很乐意.我试过 scrapy.cmdline.execute ,但也无法让它循环:

Also at this stage if there's an alternative way to automate a Scrapy spider to run on a schedule, I'm all ears. I tried scrapy.cmdline.execute , but couldn't get that to loop either:

from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess


def run_spider_cmd():
    print("Running spider")
    cmdline.execute("scrapy crawl quotes".split())


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_cmd)

while True:
    schedule.run_pending()
    time.sleep(1)

编辑

添加代码,它使用 Twisted task.LoopingCall() 每隔几秒运行一次测试蜘蛛.我是否完全错误地安排了每天在同一时间运行的蜘蛛?

Adding code, which uses Twisted task.LoopingCall() to run a test spider every few seconds. Am I going about this completely the wrong way to schedule a spider that runs at the same time each day?

from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:

            author = quote.xpath('.//small[@class="author"]/text()').extract_first()
            text = quote.xpath('.//span[@class="text"]/text()').extract_first()

            print(author, text)


def run_crawl():

    runner = CrawlerRunner()
    runner.crawl(QuotesSpider)


l = task.LoopingCall(run_crawl)
l.start(3)

reactor.run()

推荐答案

第一个值得注意的声明,通常只有一个 Twisted reactor 在运行,并且它不可重启(正如您所发现的).第二个是应避免阻塞任务/函数(即 time.sleep(n)),并应替换为异步替代方案(例如 'reactor.task.deferLater(n,...)`).

First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).

要从 Twisted 项目中有效地使用 Scrapy,需要 scrapy.crawler.CrawlerRunner 核心 API,而不是 scrapy.crawler.CrawlerProcess.两者之间的主要区别在于 CrawlerProcess 为您运行 Twisted 的 reactor(因此很难重新启动反应器),而 CrawlerRunner 依赖于在显影剂上启动反应器.以下是使用 CrawlerRunner 编写的代码的样子:

To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner core API as opposed to scrapy.crawler.CrawlerProcess. The main difference between the two is that CrawlerProcess runs Twisted's reactor for you (thus making it difficult to restart the reactor), where as CrawlerRunner relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner:

from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner

def run_crawl():
    """
    Run a spider within Twisted. Once it completes,
    wait 5 seconds and run another spider.
    """
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        })
    deferred = runner.crawl(QuotesSpider)
    # you can use reactor.callLater or task.deferLater to schedule a function
    deferred.addCallback(reactor.callLater, 5, run_crawl)
    return deferred

run_crawl()
reactor.run()   # you have to run the reactor yourself

这篇关于按计划进行刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆