有没有办法在scrapy中的 reactor.run() 之后运行代码? [英] Is there a way to run code after reactor.run() in scrapy?
问题描述
我正在开发一个scrapy api.我的问题之一是扭曲的反应堆无法重新启动.我使用 crawl runner
而不是 crawl process
修复了这个问题.我的蜘蛛从网站中提取链接,验证它们.我的问题是,如果我在 reactor.run()
之后添加验证代码,它就不起作用.这是我的代码:
I am working on a scrapy api. One of my issues was that the twisted reactor wasn't restartable. I fixed this using crawl runner
as opposed to crawl process
. My spider extracts links from a website, validates them. My issue is that if I add the validation code after reactor.run()
it doesn't work. This is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
如果在reactor.run()
之后添加验证链接的代码,它不会被执行.如果我把代码放在 reactor.run()
之前,什么都不会发生,因为蜘蛛还没有完成对所有链接的抓取.我应该怎么办?验证链接的代码非常好,我之前用过它,并且可以正常工作.
If add the code that validates the links after reactor.run()
, it doesn't get executed. And if I put the code before reactor.run()
, nothing happens because the spider hasn't yet finished crawling all the links. What should I do? The code that validates the links is perfectly fine I used it before and it works.
推荐答案
我们可以通过 d.addCallback(<callback_function>)
和 d.addErrback(
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
这篇关于有没有办法在scrapy中的 reactor.run() 之后运行代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!