如何在Scrapy蜘蛛中获取管道对象 [英] How to get the pipeline object in Scrapy spider
问题描述
我使用 mongodb 来存储爬取的数据.
I have use the mongodb to store the data of the crawl.
现在我想查询数据的最后日期,这样我就可以继续爬取数据,不需要从url列表的开头重新开始.(url,可以由日期确定,例如:/2014-03-22.html)
Now I want to query the last date of the data, that I can continue crawl the data and not need restart it from the begin of the url list.(url, which can determined by the date, like: /2014-03-22.html)
我只想要一个连接对象来执行数据库操作,这是在管道中.
I want only one connection object to take the database operation, which is in pipeline.
所以,我想知道如何在蜘蛛中获取管道对象(不是新对象).
So, I want to know how can I get the pipeline object(not new one) in the spider.
或者,任何更好的增量更新解决方案...
Or, any better solution for incremental update...
提前致谢.
对不起,我的英语不好...现在就来试一下:
Sorry, for my poor english... Just sample now:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
还有蜘蛛:
class Spider(Spider):
name = "test"
....
def parse(self, response):
# Want to get the Pipeline object
mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
mongo.get_date() # In scrapy, it must have a Pipeline object for the spider
# I want to get the Pipeline object, which created when scrapy started.
好吧,只是不想新建一个新对象......我承认我是一个强迫症......
Ok, just don't want to new a new object....I admit I am an OCD..
推荐答案
Scrapy Pipeline 有一个 open_spider 方法,在蜘蛛初始化后执行.您可以将数据库连接的引用、get_date() 方法或管道本身传递给您的蜘蛛.后者与您的代码的示例是:
A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:
# This is my Pipline
class MongoDBPipeline(object):
def __init__(self, mongodb_db=None, mongodb_collection=None):
self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
....
def process_item(self, item, spider):
....
def get_date(self):
....
def open_spider(self, spider):
spider.myPipeline = self
然后,在蜘蛛中:
class Spider(Spider):
name = "test"
def __init__(self):
self.myPipeline = None
def parse(self, response):
self.myPipeline.get_date()
我不认为 __init__()
方法在这里是必要的,但我把它放在这里是为了表明 open_spider 在初始化后替换它.
I don't think the __init__()
method is necessary here, but I put it here to show that open_spider replaces it after initialization.
这篇关于如何在Scrapy蜘蛛中获取管道对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!