Scrapy:将参数从命令行动态传递到管道 [英] Scrapy: Dynamically passing parameter from command line to pipeline

查看:68
本文介绍了Scrapy:将参数从命令行动态传递到管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy.我有一个以以下内容开头的蜘蛛:

class For_Spider(Spider):名称 = "为"table = 'hello' # 创建虚拟属性.将被覆盖def start_requests(self):self.table = self.dc # dc 传入

我有以下管道:

class DynamicSQLlitePipeline(object):@类方法def from_crawler(cls, crawler):# 在这里,您将获得通过table"参数传递的任何值table = getattr(crawler.spider, "table")返回 cls(表)def __init__(self,table):尝试:db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"db = dataset.connect(db_path)table_name = table[0:3] # 前 3 个字母self.my_table = db[table_name]

当我启动蜘蛛时:

scrapy crawl for -a dc=input_string -a records=1

在重复执行执行后,并在诸如 爬虫对象与spider和pipeline对象是什么关系?,看来执行顺序是:

1) For_spider2) 动态SQLlitePipeline3) 开始请求

蜘蛛表"中的参数通过可以访问scrapy系统不同组件的from_crawler方法传递给DynamicSQLlitePipeline对象.表被初始化为我设置的hello"(一个虚拟变量).在上面的 1 和 2 之后执行返回到蜘蛛和 start_requests开始.命令行参数仅在 start_requests 中可用,因此动态设置表名为时已晚,因为管道已经实例化.

所以不知道有没有办法动态设置流水线表名.我该怎么做.

elRuLL 是正确的,他的解决方案有效.我查看了步骤 1 中的蜘蛛对象,并没有找到蜘蛛中列出的任何参数.我想念他们吗?

<预><代码>>>>Spider.__dict__mappingproxy({'__module__': 'scrapy.spiders', '__doc__': 'scrapy spiders 的基类.所有蜘蛛都必须继承这个\n 类.\n ', 'name': None, 'custom_settings': None,'__init__':<function Spider.__init__ at 0x00000000047A6D90>, 'logger': <property object at 0x0000000003E0E598>, 'log': <function Spider.log at 0x0000470000000007000000000000000000000000000000000000700000000000700000000000000000000000007000000070000000000000000000000000007000007000000000000000000000000000000;, 'set_crawler': <function Spider.set_crawler at 0x00000000047C9048>, '_set_crawler': <function Spider._set_crawler at 0x00000000047C90D0>, <start_requests00C9047C90D0><start_requests050C050C050C050C0000C<request_requests<0C9048> '_set_crawler'功能Spider.make_requests_from_url在0x00000000047C91E0>中 '解析':<作用Spider.parse在0x00000000047C9268>中 'update_settings':其中在0x0000000003912C88类方法对象>中 'handles_request':其中在0x0000000003E0B7F0>类方法对象;, '关闭':<0x0000 处的静态方法对象000004756BA8>, '__str__': <function Spider.__str__ at 0x00000000047C9488>, '__repr__': <function Spider.__str__ at 0x00000000047C9488 > ' __ id ' 对象' __ id ' ' ' 'Spread ' ' __ der': <属性 '__weakref__' 的 'Spider' 对象>})

解决方案

Scrapy 参数动态传递给蜘蛛实例,稍后可以在 Spider 中使用 self变量.

现在,start_requests 不是可以检查蜘蛛参数的第一个地方,当然那将是蜘蛛实例的构造函数(但要小心,因为 scrapy 也将重要的参数传递给了它的构造函数).

现在你的问题是你试图访问流水线构造函数上的类变量 table(因为 from_crawler 在构造函数之前执行),这是不正确的,因为你在 start_requests 上分配了 self.table 而这还没有发生.

正确的方法是直接获取 getattr(crawler.spider, 'dc'),因为蜘蛛从命令行获取了 dc 变量.>

I'm working with scrapy. I have a spider that starts with:

class For_Spider(Spider):

    name = "for"
    table = 'hello' # creating dummy attribute. will be overwritten

    def start_requests(self):

        self.table = self.dc # dc is passed in

I have the following pipeline :

class DynamicSQLlitePipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        # Here, you get whatever value was passed through the "table" parameter
        table = getattr(crawler.spider, "table")
        return cls(table)

    def __init__(self,table):
        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            db = dataset.connect(db_path)
            table_name = table[0:3]  # FIRST 3 LETTERS
            self.my_table = db[table_name]

When I start the spider with:

scrapy crawl for -a dc=input_string -a records=1

After stepping through the execution repeatly and with help from questions like What is the relationship between the crawler object with spider and pipeline objects? , It appears that the order of execution is :

1) For_spider
2) DynamicSQLlitePipeline
3) start_requests

The parameter in the spider "table" is passed to the DynamicSQLlitePipeline object by the from_crawler method which has access to different components of the scrapy system. Table is the initialized as "hello" (a dummy variable) that I set. after 1 and 2 above execution returns to the spider and the start_requests begins. The command line parameters only become available inside start_requests, so its too late to set the table name dynamically as the pipeline has already been instantiated.

Therefore I don't know if there is a way to set the pipeline table name dynamically. How can I do this.

edit:

elRuLL is correct, and his solution works. I looked through the spider object in step 1 and did not find any parameters listed in the spider. Am I missing them?

>>> Spider.__dict__
mappingproxy({'__module__': 'scrapy.spiders', '__doc__': 'Base class for scrapy spiders. All spiders must inherit from this\n    class.\n    ', 'name': None, 'custom_settings': None, '__init__': <function Spider.__init__ at 0x00000000047A6D90>, 'logger': <property object at 0x0000000003E0E598>, 'log': <function Spider.log at 0x00000000047A6EA0>, 'from_crawler': <classmethod object at 0x0000000003B28278>, 'set_crawler': <function Spider.set_crawler at 0x00000000047C9048>, '_set_crawler': <function Spider._set_crawler at 0x00000000047C90D0>, 'start_requests': <function Spider.start_requests at 0x00000000047C9158>, 'make_requests_from_url': <function Spider.make_requests_from_url at 0x00000000047C91E0>, 'parse': <function Spider.parse at 0x00000000047C9268>, 'update_settings': <classmethod object at 0x0000000003912C88>, 'handles_request': <classmethod object at 0x0000000003E0B7F0>, 'close': <staticmethod object at 0x0000000004756BA8>, '__str__': <function Spider.__str__ at 0x00000000047C9488>, '__repr__': <function Spider.__str__ at 0x00000000047C9488>, '__dict__': <attribute '__dict__' of 'Spider' objects>, '__weakref__': <attribute '__weakref__' of 'Spider' objects>})

解决方案

Scrapy arguments are passed dynamically to the spider instance, which can be used later within the Spider with the self variable.

Now, start_requests is not the first place where you can check for the spider arguments, of course that would be the constructor of the Spider instance (but be careful, because scrapy also passed important arguments into its constructor).

Now your problem was that you were trying to access the Class variable table on the Pipeline constructor (because the from_crawler is executed before the constructor), which is incorrect, because you were assigning self.table on start_requests which didn't happen just yet.

The correct way would be to get getattr(crawler.spider, 'dc') directly, as the spider got the dc variable from the command line.

这篇关于Scrapy:将参数从命令行动态传递到管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆