设置process_request和callback参数时,Scrapy规则不起作用 [英] Scrapy rules not working when process_request and callback parameter are set

查看:292
本文介绍了设置process_request和callback参数时,Scrapy规则不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个规则来抓取 CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage()是指将Cookie添加到请求中的函数和 parse_tagPage()是解析目标页面的函数。根据文档,CrawlSpider应该使用 request_tagPage 发出请求,并在返回响应后调用 parse_tagPage()进行解析。但是,我意识到使用 request_tagPage()时,spider根本不会调用 parse_tagPage()。因此,在实际代码中,我手动添加了 parse_tagPage()回调函数作为 request_tagPage 中的回调,如下所示: / p>

request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned, it calls parse_tagPage() to parse it. However, I realized that when request_tagPage() is used, spider doesn't call the parse_tagPage() at all. So in the actual code, I manually add parse_tagPage() callback function as a callback in request_tagPage, like this:

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

它起作用了,但是现在蜘蛛不使用规则来扩展其爬网。从 start_urls 爬取链接后,它将关闭。但是,在我手动将 parse_tagPage()设置为 request_tagPage()的回调之前,这些规则有效。所以我想这可能是一个错误?是一种启用 request_tagPage()的方法,我需要在请求 parse_tagPage()中附加cookie

It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls. However, before I manually set the parse_tagPage() as callback into request_tagPage(), the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage(), which I need to attach cookie in the request, parse_tagPage() , which used to parse a page and rules, which directs spider to crawl?

推荐答案

<用来解析页面和规则,从而引导蜘蛛进行抓取?

I found the problem. CrawlSpider uses its default parse() to apply the rules. So when my custom parse_tagPage() is called, there is no more parse() follows up to keep applying the rules. Solution is to simply add the default parse() into my custom parse_tagPage(). It now looks like this:

def parse_tagPage(self, response):
    # parse the response, get the information I want...
    # save the information into a local file...
    return self.parse(response) # simply calls the default parse() function to enable the rules

这篇关于设置process_request和callback参数时,Scrapy规则不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆