Scrapy 处理 301/302 响应代码以及跟踪目标 url [英] Scrapy handle 301/302 response code as well as follow the target url
问题描述
我使用scrapy 1.0.5 版来实现爬虫.目前我已经设置了 REDIRECT_ENABLED = False
和 handle_httpstatus_list = [500, 301, 302]
来抓取带有 301 和 302 响应的页面.但是,由于 REDIRECT_ENABLED
设置为 False
,蜘蛛不会转到 Location
响应标头中的目标 url.我怎样才能做到这一点?
I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False
and handle_httpstatus_list = [500, 301, 302]
to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED
is set to False
, the spider doesn't goes to the target url in Location
response header. How can I achieve this ?
推荐答案
这是一个很长的书,因为我做了这样的事情,但你需要生成一个带有 url、meta 和回调参数的请求对象.
It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters.
但我似乎记得你可以这样做:
But I seem to recall you can do it along the lines of:
def parse(self,response):
# do whatever you need to do .... then
if response.status in [301, 302] and 'Location' in response.headers:
# test to see if it is an absolute or relative URL
newurl = urljoin(request.url, response.headers['location'])
# or
newurl = response.headers['location']
yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever)
这篇关于Scrapy 处理 301/302 响应代码以及跟踪目标 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!