设置process_request和callback参数时，Scrapy规则不起作用 [英] Scrapy rules not working when process_request and callback parameter are set

查看：292 发布时间：2020/9/29 0:17:39 callback scrapy web-crawler rules

本文介绍了设置process_request和callback参数时，Scrapy规则不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个规则来抓取 CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage（）是指将Cookie添加到请求中的函数和 parse_tagPage（）是解析目标页面的函数。根据文档，CrawlSpider应该使用 request_tagPage 发出请求，并在返回响应后调用 parse_tagPage（）进行解析。但是，我意识到使用 request_tagPage（）时，spider根本不会调用 parse_tagPage（）。因此，在实际代码中，我手动添加了 parse_tagPage（）回调函数作为 request_tagPage 中的回调，如下所示： / p>

request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned, it calls parse_tagPage() to parse it. However, I realized that when request_tagPage() is used, spider doesn't call the parse_tagPage() at all. So in the actual code, I manually add parse_tagPage() callback function as a callback in request_tagPage, like this:

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

它起作用了，但是现在蜘蛛不使用规则来扩展其爬网。从 start_urls 爬取链接后，它将关闭。但是，在我手动将 parse_tagPage（）设置为 request_tagPage（）的回调之前，这些规则有效。所以我想这可能是一个错误？是一种启用 request_tagPage（）的方法，我需要在请求 parse_tagPage（）中附加cookie

It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls. However, before I manually set the parse_tagPage() as callback into request_tagPage(), the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage(), which I need to attach cookie in the request, parse_tagPage() , which used to parse a page and rules, which directs spider to crawl?

推荐答案

<用来解析页面和规则，从而引导蜘蛛进行抓取？

I found the problem. CrawlSpider uses its default parse() to apply the rules. So when my custom parse_tagPage() is called, there is no more parse() follows up to keep applying the rules. Solution is to simply add the default parse() into my custom parse_tagPage(). It now looks like this:

def parse_tagPage(self, response):
    # parse the response, get the information I want...
    # save the information into a local file...
    return self.parse(response) # simply calls the default parse() function to enable the rules

这篇关于设置process_request和callback参数时，Scrapy规则不起作用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

设置process_request和callback参数时，Scrapy规则不起作用 [英] Scrapy rules not working when process_request and callback parameter are set

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

设置process_request和callback参数时，Scrapy规则不起作用 [英] Scrapy rules not working when process_request and callback parameter are set

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭