按计划进行刮擦 [英] Scrapy on a schedule
问题描述
让 Scrapy 按计划运行正在推动我绕过 Twist(ed).
Getting Scrapy to run on a schedule is driving me around the Twist(ed).
我认为下面的测试代码可以工作,但是当蜘蛛第二次被触发时,我得到一个 twisted.internet.error.ReactorNotRestartable
错误:
I thought the below test code would work, but I get a twisted.internet.error.ReactorNotRestartable
error when the spider is triggered a second time:
from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_script():
process.crawl(QuotesSpider)
process.start()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_script)
while True:
schedule.run_pending()
time.sleep(1)
我猜想,作为 CrawlerProcess 的一部分,Twisted Reactor 会被调用以重新启动,而这不是必需的,因此程序会崩溃.有什么办法可以控制吗?
I'm going to guess that as part of the CrawlerProcess, the Twisted Reactor is called to start again, when that's not required and so the program crashes. Is there any way I can control this?
此外,在这个阶段,如果有另一种方法可以让 Scrapy 蜘蛛自动按计划运行,我也很乐意.我试过 scrapy.cmdline.execute
,但也无法让它循环:
Also at this stage if there's an alternative way to automate a Scrapy spider to run on a schedule, I'm all ears. I tried scrapy.cmdline.execute
, but couldn't get that to loop either:
from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_cmd():
print("Running spider")
cmdline.execute("scrapy crawl quotes".split())
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_cmd)
while True:
schedule.run_pending()
time.sleep(1)
编辑
添加代码,它使用 Twisted task.LoopingCall()
每隔几秒运行一次测试蜘蛛.我是否完全错误地安排了每天在同一时间运行的蜘蛛?
Adding code, which uses Twisted task.LoopingCall()
to run a test spider every few seconds. Am I going about this completely the wrong way to schedule a spider that runs at the same time each day?
from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
author = quote.xpath('.//small[@class="author"]/text()').extract_first()
text = quote.xpath('.//span[@class="text"]/text()').extract_first()
print(author, text)
def run_crawl():
runner = CrawlerRunner()
runner.crawl(QuotesSpider)
l = task.LoopingCall(run_crawl)
l.start(3)
reactor.run()
推荐答案
第一个值得注意的声明,通常只有一个 Twisted reactor 在运行,并且它不可重启(正如您所发现的).第二个是应避免阻塞任务/函数(即 time.sleep(n)
),并应替换为异步替代方案(例如 'reactor.task.deferLater(n,...)`).
First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)
) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).
要从 Twisted 项目中有效地使用 Scrapy,需要 scrapy.crawler.CrawlerRunner
核心 API,而不是 scrapy.crawler.CrawlerProcess
.两者之间的主要区别在于 CrawlerProcess
为您运行 Twisted 的 reactor
(因此很难重新启动反应器),而 CrawlerRunner
依赖于在显影剂上启动反应器.以下是使用 CrawlerRunner
编写的代码的样子:
To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner
core API as opposed to scrapy.crawler.CrawlerProcess
. The main difference between the two is that CrawlerProcess
runs Twisted's reactor
for you (thus making it difficult to restart the reactor), where as CrawlerRunner
relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner
:
from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
"""
Run a spider within Twisted. Once it completes,
wait 5 seconds and run another spider.
"""
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(QuotesSpider)
# you can use reactor.callLater or task.deferLater to schedule a function
deferred.addCallback(reactor.callLater, 5, run_crawl)
return deferred
run_crawl()
reactor.run() # you have to run the reactor yourself
这篇关于按计划进行刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!