Scrapy 从一个主蜘蛛运行多个蜘蛛? [英] Scrapy run multiple spiders from a main spider?

查看:51
本文介绍了Scrapy 从一个主蜘蛛运行多个蜘蛛?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个蜘蛛,它们接受一个主蜘蛛抓取的网址和数据.我的方法是在主蜘蛛中使用 CrawlerProcess 并将数据传递给两个蜘蛛.这是我的方法:

I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach:

class LightnovelSpider(scrapy.Spider):

    name = "novelDetail"
    allowed_domains = ["readlightnovel.com"]

    def __init__(self,novels = []):
        self.novels = novels

    def start_requests(self):
        for novel in self.novels:
            self.logger.info(novel)
            request = scrapy.Request(novel, callback=self.parseNovel)
            yield request

    def parseNovel(self, response):
        #stuff here

class chapterSpider(scrapy.Spider):
    name = "chapters"
    #not done here

class initCrawler(scrapy.Spider):
    name = "main"
    fromMongo = {}
    toChapter = {}
    toNovel = []
    fromScraper = []


    def start_requests(self):
        urls = ['http://www.readlightnovel.com/novel-list']

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):

        for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
            initCrawler.fromScraper.append(novel)

        self.checkchanged()

    def checkchanged(self):
        #some scraped data processing here
        self.dispatchSpiders()

    def dispatchSpiders(self):
        process = CrawlerProcess()
        novelSpider = LightnovelSpider()
        process.crawl(novelSpider,novels=initCrawler.toNovel)
        process.start()
        self.logger.info("Main Spider Finished")

我运行scrapy crawl main"并得到一个漂亮的错误

I run "scrapy crawl main" and get a beautiful error

我能看到的主要错误是 "twisted.internet.error.ReactorAlreadyRunning" .我不知道.是否有更好的方法从另一个运行多个蜘蛛和/或如何阻止此错误?

The main error i can see is a "twisted.internet.error.ReactorAlreadyRunning" . Which i have no idea about. Are there better approaches running multiple spiders from another and/or how can i stop this error?

推荐答案

经过一些研究,我能够通过使用属性装饰器@property"从主蜘蛛中检索数据来解决这个问题,如下所示:

After a some research i was able to solve this problem by using a property decorator "@property" to retrieve data from main spider like this:

class initCrawler(scrapy.Spider):

    #stuff here from question

    @property
    def getNovel(self):
        return self.toNovel

    @property
    def getChapter(self):
        return self.toChapter

然后像这样使用 CrawlerRunner:

Then used CrawlerRunner like this:

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(initCrawler)
    toNovel = initCrawler.toNovel
    toChapter = initCrawler.toChapter
    yield runner.crawl(chapterSpider,chapters=toChapter)
    yield runner.crawl(lightnovelSpider,novels=toNovel)

    reactor.stop()

crawl()
reactor.run()

这篇关于Scrapy 从一个主蜘蛛运行多个蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆