有没有办法在scrapy中的 reactor.run() 之后运行代码? [英] Is there a way to run code after reactor.run() in scrapy?

查看:55
本文介绍了有没有办法在scrapy中的 reactor.run() 之后运行代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个scrapy api.我的问题之一是扭曲的反应堆无法重新启动.我使用 crawl runner 而不是 crawl process 修复了这个问题.我的蜘蛛从网站中提取链接,验证它们.我的问题是,如果我在 reactor.run() 之后添加验证代码,它就不起作用.这是我的代码:

I am working on a scrapy api. One of my issues was that the twisted reactor wasn't restartable. I fixed this using crawl runner as opposed to crawl process. My spider extracts links from a website, validates them. My issue is that if I add the validation code after reactor.run() it doesn't work. This is my code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse

list = set([])
list_validate = set([])
runner = CrawlerRunner()


class Crawler(CrawlSpider):

name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

def parse_links(self, response):
    base_url = url
    href = response.xpath('//a/@href').getall()
    list.add(urllib.parse.quote(response.url, safe=':/'))
    for link in href:
        if base_url not in link:
            list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
    for link in list:
        if base_url in link:
            list_validate.add(link)


runner.crawl(Crawler)
reactor.run()

如果在reactor.run()之后添加验证链接的代码,它不会被执行.如果我把代码放在 reactor.run() 之前,什么都不会发生,因为蜘蛛还没有完成对所有链接的抓取.我应该怎么办?验证链接的代码非常好,我之前用过它,并且可以正常工作.

If add the code that validates the links after reactor.run(), it doesn't get executed. And if I put the code before reactor.run(), nothing happens because the spider hasn't yet finished crawling all the links. What should I do? The code that validates the links is perfectly fine I used it before and it works.

推荐答案

我们可以通过 d.addCallback(<callback_function>)d.addErrback(

...
runner = CrawlerRunner()
d = runner.crawl(MySpider)  
def finished(d):            
      print("finished :D") 
def spider_error(err):
      print("Spider error :/")        
d.addCallback(finished) 
d.addErrback(spider_error)
reactor.run()

这篇关于有没有办法在scrapy中的 reactor.run() 之后运行代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆