如何在scrapy合约中向请求添加属性 [英] How to add attributes to a request in a scrapy contract

查看:45
本文介绍了如何在scrapy合约中向请求添加属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们使用元属性或从先前解析方法传递的 Request() 对象实例化 Item 或 ItemLoader,则 Scrapy 合约将失败.

Scrapy contract fails if we are instantiating an Item or ItemLoader with the meta attribute or the Request() object passed from a previous parse method.

我想可能会覆盖 ScrapesContract 预处理请求并在 request.meta 中加载一些虚拟值,但不确定这是否是好的做法.

I was thinking of maybe overriding ScrapesContract to preprocess the request and load some dummy values in request.meta, not sure if that is good practice though.

我在 docs 中看到了 pre_process 方法(在底部的 HasHeaderContract 中说明)从请求对象中获取属性,但我不确定它是否可以用于设置属性.

I have seen the pre_process method in the docs (illustrated in the HasHeaderContract at the bottom) to get attributes from the request object, but I'm not sure if it can be used to set attributes.

更多细节.来自示例爬虫的方法:

More details. Methods from an example crawler:

def parse_level_one(self, response):
   # populate loader
   return Request(url=url, callback=self.parse_level_two, meta={'loader': loader.load_item()})

def parse_level_two(self, response):
    """Parse product detail page

    @url http://example.com
    @scrapes some_field1 some_field2
    """
    loader = MyItemLoader(response.meta['loader'], response=response)

在cli中

$ scrapy check crawlername
Traceback... loader = MyItemLoader(response.meta['loader'], response=response)
KeyError: 'loader'

我正在考虑的想法是:

class LoadedScrapesContract(Contract):
    """ Contract to check presence of fields in scraped items
        @loadedscrapes page_name page_body
    """

    name = 'loadedscrapes'

    def pre_process(self, response):
        # MEDDLE WITH THE RESPONSE OBJECT HERE
        # TO ADD A META ATTRIBUTE TO RESPONSE,
        # LIKE AN EMPTY Item() or dict, JUST TO MAKE
        # THE ITEM LOADER INSTANTIATION PASS

    # this is same as ScrapesContract 
    def post_process(self, output):
        for x in output:
            if isinstance(x, BaseItem):
                for arg in self.args:
                    if not arg in x:
                        raise ContractFail("'%s' field is missing" % arg)

推荐答案

我为此找到的最佳解决方案是执行以下操作而不是破坏合同

The best solution I've found for this, is to do the following rather than mucking up the contract

loader = MyItemLoader(response.meta.get('loader', MyItem()), response=response)

我更喜欢这种方法,但要解决这个问题,请覆盖 adjust_request_args

I prefer this method, but to stick the question, override adjust_request_args

这篇关于如何在scrapy合约中向请求添加属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆