设置process_request和callback参数时,Scrapy规则不起作用 [英] Scrapy rules not working when process_request and callback parameter are set
问题描述
我有这个规则来抓取 CrawlSpider
rules = [
Rule(LinkExtractor(
allow= '/topic/\d+/organize$',
restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
),
process_request='request_tagPage', callback = "parse_tagPage", follow = True)
]
request_tagePage()
是指将Cookie添加到请求中的函数和 parse_tagPage()
是解析目标页面的函数。根据文档,CrawlSpider应该使用 request_tagPage
发出请求,并在返回响应后调用 parse_tagPage()
进行解析。但是,我意识到使用 request_tagPage()
时,spider根本不会调用 parse_tagPage()
。因此,在实际代码中,我手动添加了 parse_tagPage()
回调函数作为 request_tagPage
中的回调,如下所示: / p>
request_tagePage()
refers to a function to add cookie into requests and parse_tagPage()
refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage
to make requests and once responses are returned, it calls parse_tagPage()
to parse it. However, I realized that when request_tagPage()
is used, spider doesn't call the parse_tagPage()
at all. So in the actual code, I manually add parse_tagPage()
callback function as a callback in request_tagPage
, like this:
def request_tagPage(self, request):
return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
headers = self.headers,\
callback=self.parse_tagPage) # manually add a callback function.
它起作用了,但是现在蜘蛛不使用规则来扩展其爬网。从 start_urls
爬取链接后,它将关闭。但是,在我手动将 parse_tagPage()
设置为 request_tagPage()
的回调之前,这些规则有效。所以我想这可能是一个错误?是一种启用 request_tagPage()
的方法,我需要在请求 parse_tagPage()
中附加cookie
It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls
. However, before I manually set the parse_tagPage()
as callback into request_tagPage()
, the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage()
, which I need to attach cookie in the request, parse_tagPage()
, which used to parse a page and rules
, which directs spider to crawl?
推荐答案
<用来解析页面和规则
,从而引导蜘蛛进行抓取?
I found the problem. CrawlSpider
uses its default parse()
to apply the rules. So when my custom parse_tagPage()
is called, there is no more parse()
follows up to keep applying the rules. Solution is to simply add the default parse()
into my custom parse_tagPage()
. It now looks like this:
def parse_tagPage(self, response):
# parse the response, get the information I want...
# save the information into a local file...
return self.parse(response) # simply calls the default parse() function to enable the rules
这篇关于设置process_request和callback参数时,Scrapy规则不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!