如何将实例变量添加到 Scrapy CrawlSpider? [英] How to add instance variable to Scrapy CrawlSpider?

查看:63
本文介绍了如何将实例变量添加到 Scrapy CrawlSpider?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 CrawlSpider,我想通过将函数传递给 process_request 来实现一些逻辑,以在运行中停止跟踪某些链接.

I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.

该函数使用蜘蛛的 class 变量来跟踪当前状态,并根据它(以及引用 URL),链接被删除或继续被处理:

This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:

class BroadCrawlSpider(CrawlSpider):
    name = 'bitsy'
    start_urls = ['http://scrapy.org']
    foo = 5

    rules = (
        Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
    )

    def parse_item(self, response):
        <some code>

    def filter_requests(self, request):
        if self.foo == 6 and request.headers.get('Referer', None) == someval:
             raise IgnoreRequest("Ignored request: bla %s" % request)
        return request

我认为如果我要在同一台机器上运行多个蜘蛛,它们都会使用相同的 class 变量,这不是我的本意.

I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.

有没有办法向 CrawlSpiders 添加实例变量?运行 Scrapy 时是否只创建了一个蜘蛛实例?

Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?

我可能可以使用带有每个进程 ID 值的字典来解决这个问题,但这会很难看...

I could probably work around it with a dictionary with values per process ID, but that will be ugly...

推荐答案

我认为 蜘蛛参数 将是您的解决方案.

I think spider arguments would be the solution in your case.

当像scrapy crawl some_spider这样调用scrapy时,你可以添加像scrapy crawl some_spider -a foo=bar这样的参数,蜘蛛会通过它的构造函数接收值,例如:

When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:

class SomeSpider(scrapy.Spider):
    def __init__(self, foo=None, *args, **kwargs):
        super(SomeSpider, self).__init__(*args, **kwargs)
        # Do something with foo

更重要的是,作为 scrapy.Spider 实际上将所有附加参数设置为实例属性,您甚至不需要显式覆盖 __init__ 方法,只需访问 .foo 属性.:)

What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)

这篇关于如何将实例变量添加到 Scrapy CrawlSpider?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆