破烂.开始爬行后如何更改蜘蛛设置? [英] Scrapy. How to change spider settings after start crawling?

查看:35
本文介绍了破烂.开始爬行后如何更改蜘蛛设置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法在 parse 方法中更改蜘蛛设置.但这绝对是一种方式.

例如:

<前>类SomeSpider(BaseSpider):名称 = '我的蜘蛛'allowed_domains = ['example.com']start_urls = ['http://example.com']settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.FirstPipeline']打印设置['ITEM_PIPELINES'][0]#printed 'myproject.pipelines.FirstPipeline'定义解析(自我,响应):#...一些代码settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.SecondPipeline']打印设置['ITEM_PIPELINES'][0]# 打印'myproject.pipelines.SecondPipeline'item = Myitem()item['mame'] = 'SecondPipeline 的名称'

但是!项目将由 FirstPipeline 处理.新的 ITEM_PIPELINES 参数不起作用.开始抓取后如何更改设置?提前致谢!

解决方案

如果您希望不同的蜘蛛拥有不同的管道,您可以为蜘蛛设置管道列表属性,该属性定义该蜘蛛的管道.比在管道中检查是否存在:

class MyPipeline(object):def process_item(self, item, spider):如果 self.__class__.__name__ 不在 getattr(spider, 'pipelines',[]) 中:归还物品...归还物品类 MySpider(CrawlSpider):管道 = 设置(['我的管道','我的管道3',])

如果您希望不同的项目由不同的管道处理,您可以这样做:

 class MyPipeline2(object):def process_item(self, item, spider):如果 isinstance(item, MyItem):...归还物品归还物品

I can't change spider settings in parse method. But it is definitely must be a way.

For example:

class SomeSpider(BaseSpider):
    name = 'mySpider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
    settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.FirstPipeline']
    print settings['ITEM_PIPELINES'][0]
    #printed 'myproject.pipelines.FirstPipeline'
    def parse(self, response):
        #...some code
        settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.SecondPipeline']
        print settings['ITEM_PIPELINES'][0]
        # printed 'myproject.pipelines.SecondPipeline'
        item = Myitem()
        item['mame'] = 'Name for SecondPipeline'  

But! Item will be processed by FirstPipeline. New ITEM_PIPELINES param don't work. How can I change settings after start crawling? Thanks in advance!

解决方案

If you want that different spiders to have different pipelines you can set for a spider a pipelines list attribute which defines the pipelines for that spider. Than in pipelines check for existence:

class MyPipeline(object):

    def process_item(self, item, spider):
        if self.__class__.__name__ not in getattr(spider, 'pipelines',[]):
            return item
        ...
        return item

class MySpider(CrawlSpider):
    pipelines = set([
        'MyPipeline',
        'MyPipeline3',
    ])

If you want that different items to be proceesed by different pipelines you can do this:

    class MyPipeline2(object):
        def process_item(self, item, spider):
            if isinstance(item, MyItem):
                ...
                return item
            return item

这篇关于破烂.开始爬行后如何更改蜘蛛设置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆