Scrapy:通过管道从数据库获取 Start_Urls [英] Scrapy: Get Start_Urls from Database by Pipeline
问题描述
不幸的是我没有足够的人口发表评论,所以我不得不提出这个新问题,参考https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
我在数据库中有很多网址.所以我想从我的数据库中获取 start_url.到目前为止还不是什么大问题.好吧,我不想要蜘蛛内部的 mysql 东西,并且在管道中我遇到了问题.如果我尝试像引用的问题那样将管道对象移交给我的蜘蛛,我只会收到带有消息的属性错误
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem. Well I don't want the mysql things inside the spider and in the pipeline I get a problem. If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
我认为实际的问题是没有调用spider_opened函数(还插入了一个从未在控制台中显示其输出的打印语句).有人知道如何在蜘蛛内部获取管道对象吗?
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console). Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
管道.py
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
推荐答案
Scrapy pipelines 已经有了预期的方法 open_spider
和 close_spider
Scrapy pipelines already have expected methods of open_spider
and close_spider
取自文档:https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(自我,蜘蛛)
这个方法在蜘蛛打开时被调用.
参数: spider (Spider object) – 打开的蜘蛛
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(自我,蜘蛛)
当蜘蛛关闭时调用此方法.参数: spider (Spider object) – 关闭的蜘蛛
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
然而你最初的问题没有多大意义,你为什么要为你的蜘蛛分配管道引用?这似乎是一个非常糟糕的主意.
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
您应该做的是打开数据库并读取蜘蛛本身中的网址.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
@classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
这篇关于Scrapy:通过管道从数据库获取 Start_Urls的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!