如何在scrapy规则中使用meta [英] how to use meta in scrapy rule
问题描述
def parse(self,response):
my_item={'test':123,'test2':321}
google_url = 'https://www.google.com/search?q=coffee+cans'
yield Request(url=google_url,callback=self.google,meta={'my_item':my_item})
def google(self,response):
my_item = response.meta['my_item']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
callback="chewy"),
Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com'),
process_request=request.meta['my_item']=my_item,callback='homedepot')
)
def homedepot(self,response):
#my_item = response.meta['my_item']
错误信息:
Traceback (most recent call last):
File "/home/timmy/.local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/cmdline.py", line 149, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 251, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 137, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 338, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/home/timmy/.local/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 674, in exec_module
File "<frozen importlib._bootstrap_external>", line 781, in get_code
File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/timmy/scrapy_tut/myproject/spiders/amazon.py", line 62
process_request=request.meta['my_item']=my_item,callback='homedepot')
^
SyntaxError: invalid syntax
我编辑了问题以使其更具可测试性,如何将 my_item
传递给从 Rule(LinkExtractor...)
中提取的链接(我从初始化蜘蛛使我更容易做(使用元)但我仍然做不到.
I edited the question to make more testable ,How can I pass my_item
to the links extracted from Rule(LinkExtractor...)
( I moved the rules from the initialization of spider to make it easier for me to do (use meta) but I still can't do it.
非常感谢任何帮助
我尝试使用
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/dp',allow_domains='chewy.com'),
process_request=lambda request:request.meta.update({'my_item':my_item}),callback='chewy'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="r"]/a',allow='/p/',allow_domains='homedepot.com')
,process_request=lambda request:request.meta.update({'my_item':my_item}),callback='homedepot')
)
这没有错误,但没有请求页面
This gives no error but the page isn't requested
推荐答案
你的第一个例子是错误的 Python 代码,正如 Python 报告的那样.
Your first example is wrong Python code, as Python reports.
您的第二个示例不起作用,因为您对 process_request
参数 Rule
,lambda
函数,返回 None
.
Your second example does not work because your callback for the process_request
parameter of Rule
, the lambda
function, returns None
.
如果您查看文档:
process_request
是一个可调用的,或者一个字符串(在这种情况下将使用来自具有该名称的蜘蛛对象的方法),它将在此规则提取的每个请求中调用,并且必须返回请求或无(过滤请求).
process_request
is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).
这实际上并不是它不起作用的唯一原因.要使用基于规则的链接提取器,您必须:
That is actually not the only reason it does not work. To use rule-based link extractors, you must:
子类
CrawlSpider代码>
.根据您的示例,您是否正在这样做尚不清楚.
Subclass
CrawlSpider
. From your examples it’s not clear if you are doing so.
不要像目前那样在子类中重新实现 parse
方法.如果 start_urls
对您来说不够好,请将其与 parse_start_url
.
Don’t reimplement the parse
method in your subclass, as you are currently doing. If start_urls
is not good enough for you, use it in combination with parse_start_url
.
规则必须声明为 类属性.相反,您将它们定义为 Spider 子类的方法中的变量.那行不通.
Rules must be declared as a class attribute. You are instead defining them as a variable within a method of your Spider subclass. That won’t work.
请重新阅读有关 CrawlSpider 的文档.
至于从响应的元数据传递值到下一个请求的元数据,您有两种选择:
As for passing a value from the meta of a response to the meta of the next request, you have 2 choices:
将您的蜘蛛重新实现为
Spider
子类,而不是CrawlSpider
子类,手动执行所有逻辑,无需基于规则的链接提取器.
Reimplement your spider as a
Spider
subclass, instead of aCrawlSpider
subclass, manually performing all the logic without rule-based link extractors.
每当像 CrawlSpider
开始觉得限制太多了.通用蜘蛛子类适用于简单的用例,但是当您遇到一些不重要的事情时,您应该考虑切换到常规的 Spider
子类.
This is the natural step whenever a generic spider like CrawlSpider
starts to feel too restrictive. Generic spider subclasses are good for simple use cases, but whenever you face something non-trivial, you should consider switching to a regular Spider
subclass.
等待 Scrapy 1.7 发布,这应该很快就会发布(您可以同时使用 Scrapy 的 master
分支).Scrapy 1.7 为process_request
response 参数> 回调,这将允许您执行以下操作:
Wait for Scrapy 1.7 to be released, which should happen shortly (you could use the master
branch of Scrapy in the meantime). Scrapy 1.7 introduces a new response
parameter for process_request
callbacks, which will allow you to do something like:
def my_request_processor(request, response):
request.meta['item'] = response.meta['item']
return request
class MySpider(CrawlSpider):
# …
rules = (
Rule(
LinkExtractor(
restrict_xpaths='//div[@class="r"]/a',
allow='/p/',
allow_domains='homedepot.com'
),
process_request=my_request_processor,
callback='homedepot'
)
)
# …
这篇关于如何在scrapy规则中使用meta的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!