如何在scrapy规则中使用meta [英] how to use meta in scrapy rule

查看：32 发布时间：2021/7/16 22:23:08 python scrapy

本文介绍了如何在scrapy规则中使用meta的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

def parse(self,response):
    my_item={'test':123,'test2':321}
    google_url = 'https://www.google.com/search?q=coffee+cans'
    yield Request(url=google_url,callback=self.google,meta={'my_item':my_item})

def google(self,response):
    my_item = response.meta['my_item']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
            callback="chewy"),
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com'),
            process_request=request.meta['my_item']=my_item,callback='homedepot')
        )
def homedepot(self,response):
    #my_item = response.meta['my_item']

错误信息:

Traceback (most recent call last):
  File "/home/timmy/.local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 251, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 338, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/timmy/scrapy_tut/myproject/spiders/amazon.py", line 62
    process_request=request.meta['my_item']=my_item,callback='homedepot')
                                           ^
SyntaxError: invalid syntax

我编辑了问题以使其更具可测试性，如何将 my_item 传递给从 Rule(LinkExtractor...) 中提取的链接(我从初始化蜘蛛使我更容易做(使用元)但我仍然做不到.

I edited the question to make more testable ,How can I pass my_item to the links extracted from Rule(LinkExtractor...) ( I moved the rules from the initialization of spider to make it easier for me to do (use meta) but I still can't do it.

非常感谢任何帮助

我尝试使用

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
            process_request=lambda request:request.meta.update({'my_item':my_item}),callback='chewy'),
        Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com')
            ,process_request=lambda request:request.meta.update({'my_item':my_item}),callback='homedepot')

        )

这没有错误，但没有请求页面

This gives no error but the page isn't requested

推荐答案

你的第一个例子是错误的 Python 代码，正如 Python 报告的那样.

Your first example is wrong Python code, as Python reports.

您的第二个示例不起作用，因为您对 process_request 参数 Rule，lambda 函数，返回 None.

Your second example does not work because your callback for the process_request parameter of Rule, the lambda function, returns None.

如果您查看文档:

process_request 是一个可调用的，或者一个字符串(在这种情况下将使用来自具有该名称的蜘蛛对象的方法)，它将在此规则提取的每个请求中调用，并且必须返回请求或无(过滤请求).

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

这实际上并不是它不起作用的唯一原因.要使用基于规则的链接提取器，您必须:

That is actually not the only reason it does not work. To use rule-based link extractors, you must:

子类 CrawlSpider.根据您的示例，您是否正在这样做尚不清楚.



Subclass CrawlSpider. From your examples it’s not clear if you are doing so.
不要像目前那样在子类中重新实现 parse 方法.如果 start_urls 对您来说不够好，请将其与 parse_start_url.
Don’t reimplement the parse method in your subclass, as you are currently doing. If start_urls is not good enough for you, use it in combination with parse_start_url.
规则必须声明为 类属性.相反，您将它们定义为 Spider 子类的方法中的变量.那行不通.
Rules must be declared as a class attribute. You are instead defining them as a variable within a method of your Spider subclass. That won’t work.
请重新阅读有关 CrawlSpider 的文档.
至于从响应的元数据传递值到下一个请求的元数据，您有两种选择:
As for passing a value from the meta of a response to the meta of the next request, you have 2 choices:
将您的蜘蛛重新实现为 Spider 子类，而不是 CrawlSpider 子类，手动执行所有逻辑，无需基于规则的链接提取器.

Reimplement your spider as a Spider subclass, instead of a CrawlSpider subclass, manually performing all the logic without rule-based link extractors.
每当像 CrawlSpider 开始觉得限制太多了.通用蜘蛛子类适用于简单的用例，但是当您遇到一些不重要的事情时，您应该考虑切换到常规的 Spider 子类.
This is the natural step whenever a generic spider like CrawlSpider starts to feel too restrictive. Generic spider subclasses are good for simple use cases, but whenever you face something non-trivial, you should consider switching to a regular Spider subclass.

等待 Scrapy 1.7 发布，这应该很快就会发布(您可以同时使用 Scrapy 的 master 分支).Scrapy 1.7 为process_requestresponse 参数> 回调，这将允许您执行以下操作:

Wait for Scrapy 1.7 to be released, which should happen shortly (you could use the master branch of Scrapy in the meantime). Scrapy 1.7 introduces a new response parameter for process_request callbacks, which will allow you to do something like:


def my_request_processor(request, response):
    request.meta['item'] = response.meta['item']
    return request

class MySpider(CrawlSpider):

    # …

    rules = (
        Rule(
            LinkExtractor(
                restrict_xpaths='//div[@class="r"]/a',
                allow='/p/',
                allow_domains='homedepot.com'
            ),
            process_request=my_request_processor,
            callback='homedepot'
        )
    )

    # …

这篇关于如何在scrapy规则中使用meta的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在scrapy规则中使用meta [英] how to use meta in scrapy rule

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在scrapy规则中使用meta [英] how to use meta in scrapy rule

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭