Scrapy:在 init 中设置的规则被 CrawlSpider 忽略 [英] Scrapy: Rules set inside init are ignored by CrawlSpider

查看：33 发布时间：2021/7/16 22:04:57 python scrapy scrapy-spider

本文介绍了Scrapy:在 __init__ 中设置的规则被 CrawlSpider 忽略的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经坚持了几天，这让我发疯了.

I've been stuck on this for a few days, and it's making me go crazy.

我这样称呼我的爬虫蜘蛛:

I call my scrapy spider like this:

scrapy crawl example -a follow_links="True"

我传入follow_links"标志来确定是应该抓取整个网站，还是只抓取我在蜘蛛中定义的索引页.

I pass in the "follow_links" flag to determine whether the entire website should be scraped, or just the index page I have defined in the spider.

在蜘蛛的构造函数中检查这个标志以查看应该设置哪个规则:

This flag is checked in the spider's constructor to see which rule should be set:

def __init__(self, *args, **kwargs):

    super(ExampleSpider, self).__init__(*args, **kwargs)

    self.follow_links = kwargs.get('follow_links')
    if self.follow_links == "True":
        self.rules = (
            Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
        )
    else:
        self.rules = (
            Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
        )

如果为True"，则允许所有链接；如果为False"，则拒绝所有链接.

If it's "True", all links are allowed; if it's "False", all links are denied.

到目前为止，一切都很好，但是这些规则被忽略了.我可以遵循规则的唯一方法是在构造函数之外定义它们.这意味着，这样的事情会正常工作:

So far, so good, however these rules are ignored. The only way I can get rules to be followed is if I define them outside of the constructor. That means, something like this would work correctly:

class ExampleSpider(CrawlSpider):

    rules = (
        Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
    )

    def __init__(self, *args, **kwargs):
        ...

因此，基本上，在 __init__ 构造函数中定义规则会导致规则被忽略，而在构造函数外部定义规则则按预期工作.

So basically, defining the rules within the __init__ constructor causes the rules to be ignored, whereas defining the rules outside of the constructor works as expected.

我无法理解这一点.我的代码如下.

I cannot understand this. My code is below.

import re
import scrapy

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content


class ExampleSpider(CrawlSpider):

    name = "example"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']    
    # if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
    # rules = (
    #     Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
    # )

    def __init__(self, *args, **kwargs):

        super(ExampleSpider, self).__init__(*args, **kwargs)

        # single page or follow links
        self.follow_links = kwargs.get('follow_links')
        if self.follow_links == "True":
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
            )
        else:
            # the rule below will always be ignored (why?!)
            self.rules = (
                Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
            )


    def parse_pages(self, response):
        print("In parse_pages")
        print(response.xpath('/html/body').extract())
        return None


    def parse_start_url(self, response):
        print("In parse_start_url")
        print(response.xpath('/html/body').extract())
        return None

感谢您花时间帮助我解决这个问题.

Thank you for taking the time to help me on this matter.

Scrapy:在 init 中设置的规则被 CrawlSpider 忽略 [英] Scrapy: Rules set inside init are ignored by CrawlSpider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy:在 __init__ 中设置的规则被 CrawlSpider 忽略 [英] Scrapy: Rules set inside __init__ are ignored by CrawlSpider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Scrapy:在 init 中设置的规则被 CrawlSpider 忽略 [英] Scrapy: Rules set inside init are ignored by CrawlSpider

登录关闭