在 scrapy.Request 中添加 dont_filter=True 参数如何使我的解析方法起作用? [英] How does adding dont_filter=True argument in scrapy.Request make my parsing method to work ?

查看:166
本文介绍了在 scrapy.Request 中添加 dont_filter=True 参数如何使我的解析方法起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个简单的爬虫

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["https://www.dmoz.org"]
    start_urls = ('https://www.dmoz.org/')

    def parse(self,response):
        yield scrapy.Request(self.start_urls[0],callback=self.parse2)

    def parse2(self, response):
        print(response.url)

当您运行程序时,parse2 方法不起作用,也不会打印 response.url.然后我在下面的线程中找到了解决方案.

When you run the program, parse2 method doesn't work and it doesn't print response.url. Then I found the solution to this in the thread below.

为什么我的爬虫蜘蛛的解析方法没有调用我的第二个请求

只是我需要在请求方法中添加 dont_filter=True 作为参数才能使 parse2 函数工作.

Its just that I needed to add dont_filter=True as argument in request method to make the parse2 function work.

yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)

但是在scrapy文档和许多youtube教程中给出的示例中,他们从未在scrapy.Request方法中使用dont_filter = True参数,并且他们的第二个解析函数仍然有效.

But in the examples given in scrapy documentation and many youtube tutorials, they never used dont_filter = True argument in scrapy.Request method and still their second parse functions works.

看看这个

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

为什么除非添加 dont_filter=True,否则我的蜘蛛不能工作?我究竟做错了什么 ?我的蜘蛛在第一个示例中过滤的重复链接是什么?

Why can't my spider work unless dont_filter=True is added ? What am I doing wrong ? What were the duplicate links that my spider had filtered in my first example ?

附言我本可以在上面发布的 QA 主题中解决这个问题,但除非我有 50 名声誉(可怜的我!!)

P.S. I could've resolved resolved this in the QA thread I posted above, But I'm not allowed to comment unless I have 50 reputation (poor me !!)

推荐答案

简短回答:您正在提出重复的请求.Scrapy 内置了重复过滤,默认情况下是打开的.这就是 parse2 不会被调用的原因.当您添加 dont_filter=True 时,scrapy 不会过滤掉重复的请求.所以这次请求被处理了.

Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default. That's why the parse2 doesn't get called. When you add that dont_filter=True, scrapy doesn't filter out the duplicate requests. So this time the request is processed.

更长的版本:

在 Scrapy 中,如果您设置了 start_urls 或定义了方法 start_requests(),蜘蛛会自动请求这些 url 并将响应传递给 parse 方法,这是用于解析请求的默认方法.现在你可以从这里产生新的请求,这些请求将再次被 Scrapy 解析.如果不设置回调,将再次使用 parse 方法.如果您设置了回调,则将使用该回调.

In Scrapy, if you have set start_urls or have the method start_requests() defined, the spider automatically requests those urls and passes the response to the parse method which is the default method used for parsing requests. Now you can yield new requests from here which will again be parsed by Scrapy. If you don't set a callback, the parse method will be used again. If you set a callback, that callback will be used.

Scrapy 还有一个内置过滤器,可以阻止重复的请求.也就是说,如果 Scrapy 已经抓取了一个站点并解析了响应,即使您使用该 url 产生另一个请求,scrapy 也不会处理它.

Scrapy also has a built in filter which stops duplicate requests. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that url, scrapy will not process it.

就您而言,您在 start_urls 中有 url.Scrapy 从那个 url 开始.它抓取站点并将响应传递给 parse.在那个 parse 方法中,你再次向同一个 url(scrapy 刚刚处理的)发出请求,但这次使用 parse2 作为回调.当此请求被产生时,scrapy 将其视为重复.所以它会忽略请求并且从不处理它.所以没有调用 parse2 .

In your case, you have the url in start_urls. Scrapy starts with that url. It crawls the site and passes the response to parse. Inside that parse method, you again yield a request to that same url (which scrapy just processed) but this time with parse2 as the callback. When this request is yielded, scrapy sees this as a duplicate. So it ignores the request and never processes it. So no calls to parse2 is made.

如果你想控制应该处理哪些 url 和使用哪个回调,我建议你覆盖 start_requests() 并返回一个 scrapy.Request 列表而不是使用单个 start_urls 属性.

If you want to control which urls should be processed and which callback to be used, I recommend you override the start_requests() and return a list of scrapy.Request instead of using the single start_urls attribute.

这篇关于在 scrapy.Request 中添加 dont_filter=True 参数如何使我的解析方法起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆