Scrapy:通过管道从数据库获取 Start_Urls [英] Scrapy: Get Start_Urls from Database by Pipeline

查看:67
本文介绍了Scrapy:通过管道从数据库获取 Start_Urls的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不幸的是我没有足够的人口发表评论,所以我不得不提出这个新问题,参考https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider

Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider

我在数据库中有很多网址.所以我想从我的数据库中获取 start_url.到目前为止还不是什么大问题.好吧,我不想要蜘蛛内部的 mysql 东西,并且在管道中我遇到了问题.如果我尝试像引用的问题那样将管道对象移交给我的蜘蛛,我只会收到带有消息的属性错误

I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem. Well I don't want the mysql things inside the spider and in the pipeline I get a problem. If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message

'None Type' object has no attribute getUrl

我认为实际的问题是没有调用spider_opened函数(还插入了一个从未在控制台中显示其输出的打印语句).有人知道如何在蜘蛛内部获取管道对象吗?

I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console). Has somebody an idea how to get the pipeline object inside the spider?

MySpider.py

MySpider.py

def __init__(self):
    self.pipe = None

def start_requests(self):
    url = self.pipe.getUrl()
    scrapy.Request(url,callback=self.parse)

管道.py

@classmethod
def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
    spider.pipe = self

def getUrl(self):
     ...

推荐答案

Scrapy pipelines 已经有了预期的方法 open_spiderclose_spider

Scrapy pipelines already have expected methods of open_spider and close_spider

取自文档:https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider

open_spider(自我,蜘蛛)
这个方法在蜘蛛打开时被调用.
参数: spider (Spider object) – 打开的蜘蛛

open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened

close_spider(自我,蜘蛛)
当蜘蛛关闭时调用此方法.参数: spider (Spider object) – 关闭的蜘蛛

close_spider(self, spider)
This method is called when the spider is closed. Parameters: spider (Spider object) – the spider which was closed

然而你最初的问题没有多大意义,你为什么要为你的蜘蛛分配管道引用?这似乎是一个非常糟糕的主意.

However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.

您应该做的是打开数据库并读取蜘蛛本身中的网址.

What you should do is open up db and read urls in your spider itself.

from scrapy import Spider
class MySpider(Spider):
    name = 'myspider'
    start_urls = []

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.start_urls = self.get_urls_from_db()
        return spider

    def get_urls_from_db(self):
        db = # get db cursor here
        urls = # use cursor to pop your urls
        return urls

这篇关于Scrapy:通过管道从数据库获取 Start_Urls的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆