Scrapy 处理 301/302 响应代码以及跟踪目标 url [英] Scrapy handle 301/302 response code as well as follow the target url

查看:34
本文介绍了Scrapy 处理 301/302 响应代码以及跟踪目标 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy 1.0.5 版来实现爬虫.目前我已经设置了 REDIRECT_ENABLED = Falsehandle_httpstatus_list = [500, 301, 302] 来抓取带有 301 和 302 响应的页面.但是,由于 REDIRECT_ENABLED 设置为 False,蜘蛛不会转到 Location 响应标头中的目标 url.我怎样才能做到这一点?

I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False and handle_httpstatus_list = [500, 301, 302] to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED is set to False, the spider doesn't goes to the target url in Location response header. How can I achieve this ?

推荐答案

这是一个很长的书,因为我做了这样的事情,但你需要生成一个带有 url、meta 和回调参数的请求对象.

It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters.

但我似乎记得你可以这样做:

But I seem to recall you can do it along the lines of:

def parse(self,response):
    # do whatever you need to do .... then
    if response.status in [301, 302] and 'Location' in response.headers:
        # test to see if it is an absolute or relative URL
        newurl = urljoin(request.url, response.headers['location'])
        # or 
        newurl = response.headers['location']
        yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever) 

这篇关于Scrapy 处理 301/302 响应代码以及跟踪目标 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆