如何在scrapy规则中使用meta [英] how to use meta in scrapy rule

查看:32
本文介绍了如何在scrapy规则中使用meta的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

def parse(self,response):
    my_item={'test':123,'test2':321}
    google_url = 'https://www.google.com/search?q=coffee+cans'
    yield Request(url=google_url,callback=self.google,meta={'my_item':my_item})

def google(self,response):
    my_item = response.meta['my_item']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
            callback="chewy"),
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com'),
            process_request=request.meta['my_item']=my_item,callback='homedepot')
        )
def homedepot(self,response):
    #my_item = response.meta['my_item']

错误信息:

Traceback (most recent call last):
  File "/home/timmy/.local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 251, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 338, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/timmy/scrapy_tut/myproject/spiders/amazon.py", line 62
    process_request=request.meta['my_item']=my_item,callback='homedepot')
                                           ^
SyntaxError: invalid syntax

我编辑了问题以使其更具可测试性,如何将 my_item 传递给从 Rule(LinkExtractor...) 中提取的链接(我从初始化蜘蛛使我更容易做(使用元)但我仍然做不到.

I edited the question to make more testable ,How can I pass my_item to the links extracted from Rule(LinkExtractor...) ( I moved the rules from the initialization of spider to make it easier for me to do (use meta) but I still can't do it.

非常感谢任何帮助

我尝试使用

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
            process_request=lambda request:request.meta.update({'my_item':my_item}),callback='chewy'),
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com')
            ,process_request=lambda request:request.meta.update({'my_item':my_item}),callback='homedepot')

        )

这没有错误,但没有请求页面

This gives no error but the page isn't requested

推荐答案

你的第一个例子是错误的 Python 代码,正如 Python 报告的那样.

Your first example is wrong Python code, as Python reports.

您的第二个示例不起作用,因为您对 process_request 参数 Rulelambda 函数,返回 None.

Your second example does not work because your callback for the process_request parameter of Rule, the lambda function, returns None.

如果您查看文档:

process_request 是一个可调用的,或者一个字符串(在这种情况下将使用来自具有该名称的蜘蛛对象的方法),它将在此规则提取的每个请求中调用,并且必须返回请求或无(过滤请求).

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

这实际上并不是它不起作用的唯一原因.要使用基于规则的链接提取器,您必须:

That is actually not the only reason it does not work. To use rule-based link extractors, you must:

  • 子类 CrawlSpider.根据您的示例,您是否正在这样做尚不清楚.

  • Subclass CrawlSpider. From your examples it’s not clear if you are doing so.

不要像目前那样在子类中重新实现 parse 方法.如果 start_urls 对您来说不够好,请将其与 parse_start_url.

Don’t reimplement the parse method in your subclass, as you are currently doing. If start_urls is not good enough for you, use it in combination with parse_start_url.

规则必须声明为 类属性.相反,您将它们定义为 Spider 子类的方法中的变量.那行不通.

Rules must be declared as a class attribute. You are instead defining them as a variable within a method of your Spider subclass. That won’t work.

请重新阅读有关 CrawlSpider 的文档.

至于从响应的元数据传递值到下一个请求的元数据,您有两种选择:

As for passing a value from the meta of a response to the meta of the next request, you have 2 choices:

  • 将您的蜘蛛重新实现为 Spider 子类,而不是 CrawlSpider 子类,手动执行所有逻辑,无需基于规则的链接提取器.

  • Reimplement your spider as a Spider subclass, instead of a CrawlSpider subclass, manually performing all the logic without rule-based link extractors.

每当像 CrawlSpider 开始觉得限制太多了.通用蜘蛛子类适用于简单的用例,但是当您遇到一些不重要的事情时,您应该考虑切换到常规的 Spider 子类.

This is the natural step whenever a generic spider like CrawlSpider starts to feel too restrictive. Generic spider subclasses are good for simple use cases, but whenever you face something non-trivial, you should consider switching to a regular Spider subclass.

等待 Scrapy 1.7 发布,这应该很快就会发布(您可以同时使用 Scrapy 的 master 分支).Scrapy 1.7 process_requestresponse 参数> 回调,这将允许您执行以下操作:

Wait for Scrapy 1.7 to be released, which should happen shortly (you could use the master branch of Scrapy in the meantime). Scrapy 1.7 introduces a new response parameter for process_request callbacks, which will allow you to do something like:


def my_request_processor(request, response):
    request.meta['item'] = response.meta['item']
    return request

class MySpider(CrawlSpider):

    # …

    rules = (
        Rule(
            LinkExtractor(
                restrict_xpaths='//div[@class="r"]/a',
                allow='/p/',
                allow_domains='homedepot.com'
            ),
            process_request=my_request_processor,
            callback='homedepot'
        )
    )

    # …

这篇关于如何在scrapy规则中使用meta的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆