如何在Scrapy Spider中获取管道对象 [英] How to get the pipeline object in Scrapy spider

查看：308 发布时间：2020/5/10 23:59:35 python mongodb scrapy

本文介绍了如何在Scrapy Spider中获取管道对象的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经使用 mongodb 来存储爬网的数据.

I have use the mongodb to store the data of the crawl.

现在，我想查询数据的最后日期，这样我就可以继续抓取数据，而无需从url列表的开头重新启动它.(url可以由日期确定，例如:/2014- 03-22.html)

Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)

我只希望一个连接对象执行数据库操作，该操作正在管道中.

I want only one connection object to take the database operation, which is in pipeline.

所以，我想知道如何在蜘蛛中获得管道对象(不是新对象).

So, I want to know how can I get the pipeline object(not new one) in the spider.

或者，任何更好的增量更新解决方案...

Or, any better solution for incremental update...

谢谢.

对不起，我的英语不好... 现在就采样:

Sorry, for my poor english... Just sample now:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....
    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

还有蜘蛛:

class Spider(Spider):
    name = "test"
    ....

    def parse(self, response):
        # Want to get the Pipeline object
        mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
        mongo.get_date()          # In scrapy, it must have a Pipeline object for the spider
                                  # I want to get the Pipeline object, which created when scrapy started.

好吧，就是不想新建一个对象....我承认我是强迫症..

Ok, just don't want to new a new object....I admit I am an OCD..

推荐答案

Scrapy管道具有

A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....

    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

    def open_spider(self, spider):
        spider.myPipeline = self

然后，在蜘蛛网中

class Spider(Spider):
    name = "test"

    def __init__(self):
        self.myPipeline = None

    def parse(self, response):
        self.myPipeline.get_date()

我认为__init__()方法不是必需的，但我将其放在此处以表明open_spider初始化后将替换它.

I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.

这篇关于如何在Scrapy Spider中获取管道对象的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Scrapy Spider中获取管道对象 [英] How to get the pipeline object in Scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在Scrapy Spider中获取管道对象 [英] How to get the pipeline object in Scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭