Scrapy:在 __init__ 中设置的规则被 CrawlSpider 忽略 [英] Scrapy: Rules set inside __init__ are ignored by CrawlSpider
问题描述
我已经坚持了几天,这让我发疯了.
I've been stuck on this for a few days, and it's making me go crazy.
我这样称呼我的爬虫蜘蛛:
I call my scrapy spider like this:
scrapy crawl example -a follow_links="True"
我传入follow_links"标志来确定是应该抓取整个网站,还是只抓取我在蜘蛛中定义的索引页.
I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider.
在蜘蛛的构造函数中检查这个标志以查看应该设置哪个规则:
This flag is checked in the spider's constructor to see which rule should be set:
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
如果为True",则允许所有链接;如果为False",则拒绝所有链接.
If it's "True", all links are allowed; if it's "False", all links are denied.
到目前为止,一切都很好,但是这些规则被忽略了.我可以遵循规则的唯一方法是在构造函数之外定义它们.这意味着,这样的事情会正常工作:
So far, so good, however these rules are ignored. The only way I can get rules to be followed is if I define them outside of the constructor. That means, something like this would work correctly:
class ExampleSpider(CrawlSpider):
rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def __init__(self, *args, **kwargs):
...
因此,基本上,在 __init__
构造函数中定义规则会导致规则被忽略,而在构造函数外部定义规则则按预期工作.
So basically, defining the rules within the __init__
constructor causes the rules to be ignored, whereas defining the rules outside of the constructor works as expected.
我无法理解这一点.我的代码如下.
I cannot understand this. My code is below.
import re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
# if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
# rules = (
# Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
# )
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# single page or follow links
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def parse_pages(self, response):
print("In parse_pages")
print(response.xpath('/html/body').extract())
return None
def parse_start_url(self, response):
print("In parse_start_url")
print(response.xpath('/html/body').extract())
return None
感谢您花时间帮助我解决这个问题.
Thank you for taking the time to help me on this matter.
推荐答案
这里的问题是 CrawlSpider
构造函数 (__init__
) 也在处理 rules
参数,因此如果需要分配它们,则必须在调用默认构造函数之前进行.
The problem here is that CrawlSpider
constructor (__init__
) is also handling the rules
parameter, so if you need to assign them, you'll have to do it before calling the default constructor.
换句话说,在调用 super(ExampleSpider, self).__init__(*args, **kwargs)
之前做你需要的一切:
In other words do everything you need before calling super(ExampleSpider, self).__init__(*args, **kwargs)
:
def __init__(self, *args, **kwargs):
# setting my own rules
super(ExampleSpider, self).__init__(*args, **kwargs)
这篇关于Scrapy:在 __init__ 中设置的规则被 CrawlSpider 忽略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!