Scrapy 从一个主蜘蛛运行多个蜘蛛? [英] Scrapy run multiple spiders from a main spider?
问题描述
我有两个蜘蛛,它们接受一个主蜘蛛抓取的网址和数据.我的方法是在主蜘蛛中使用 CrawlerProcess 并将数据传递给两个蜘蛛.这是我的方法:
I have two spiders that take urls and data scraped by a main spider. My approach to this was to use CrawlerProcess in the main spider and passing data to the two spiders. Here's my approach:
class LightnovelSpider(scrapy.Spider):
name = "novelDetail"
allowed_domains = ["readlightnovel.com"]
def __init__(self,novels = []):
self.novels = novels
def start_requests(self):
for novel in self.novels:
self.logger.info(novel)
request = scrapy.Request(novel, callback=self.parseNovel)
yield request
def parseNovel(self, response):
#stuff here
class chapterSpider(scrapy.Spider):
name = "chapters"
#not done here
class initCrawler(scrapy.Spider):
name = "main"
fromMongo = {}
toChapter = {}
toNovel = []
fromScraper = []
def start_requests(self):
urls = ['http://www.readlightnovel.com/novel-list']
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
initCrawler.fromScraper.append(novel)
self.checkchanged()
def checkchanged(self):
#some scraped data processing here
self.dispatchSpiders()
def dispatchSpiders(self):
process = CrawlerProcess()
novelSpider = LightnovelSpider()
process.crawl(novelSpider,novels=initCrawler.toNovel)
process.start()
self.logger.info("Main Spider Finished")
我运行scrapy crawl main"并得到一个漂亮的错误
I run "scrapy crawl main" and get a beautiful error
我能看到的主要错误是 "twisted.internet.error.ReactorAlreadyRunning" .我不知道.是否有更好的方法从另一个运行多个蜘蛛和/或如何阻止此错误?
The main error i can see is a "twisted.internet.error.ReactorAlreadyRunning" . Which i have no idea about. Are there better approaches running multiple spiders from another and/or how can i stop this error?
推荐答案
经过一些研究,我能够通过使用属性装饰器@property"从主蜘蛛中检索数据来解决这个问题,如下所示:
After a some research i was able to solve this problem by using a property decorator "@property" to retrieve data from main spider like this:
class initCrawler(scrapy.Spider):
#stuff here from question
@property
def getNovel(self):
return self.toNovel
@property
def getChapter(self):
return self.toChapter
然后像这样使用 CrawlerRunner:
Then used CrawlerRunner like this:
from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(initCrawler)
toNovel = initCrawler.toNovel
toChapter = initCrawler.toChapter
yield runner.crawl(chapterSpider,chapters=toChapter)
yield runner.crawl(lightnovelSpider,novels=toNovel)
reactor.stop()
crawl()
reactor.run()
这篇关于Scrapy 从一个主蜘蛛运行多个蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!