Scrapy 处理 301/302 响应代码以及跟踪目标 url [英] Scrapy handle 301/302 response code as well as follow the target url

查看：34 发布时间：2021/7/16 22:13:13 web-scraping scrapy scrapy-spider

本文介绍了Scrapy 处理 301/302 响应代码以及跟踪目标 url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用scrapy 1.0.5 版来实现爬虫.目前我已经设置了 REDIRECT_ENABLED = False 和 handle_httpstatus_list = [500, 301, 302] 来抓取带有 301 和 302 响应的页面.但是，由于 REDIRECT_ENABLED 设置为 False，蜘蛛不会转到 Location 响应标头中的目标 url.我怎样才能做到这一点?

I am using scrapy version 1.0.5 for implementation of a crawler. Currently I have set REDIRECT_ENABLED = False and handle_httpstatus_list = [500, 301, 302] to scrape the pages with 301 and 302 responses. However, since REDIRECT_ENABLED is set to False, the spider doesn't goes to the target url in Location response header. How can I achieve this ?

推荐答案

这是一个很长的书，因为我做了这样的事情，但你需要生成一个带有 url、meta 和回调参数的请求对象.

It is a long tome since I did anything like this but you need to generate a request object with url, meta and callback parameters.

但我似乎记得你可以这样做:

But I seem to recall you can do it along the lines of:

def parse(self,response):
    # do whatever you need to do .... then
    if response.status in [301, 302] and 'Location' in response.headers:
        # test to see if it is an absolute or relative URL
        newurl = urljoin(request.url, response.headers['location'])
        # or 
        newurl = response.headers['location']
        yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever)

这篇关于Scrapy 处理 301/302 响应代码以及跟踪目标 url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy 处理 301/302 响应代码以及跟踪目标 url [英] Scrapy handle 301/302 response code as well as follow the target url

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scrapy 处理 301/302 响应代码以及跟踪目标 url [英] Scrapy handle 301/302 response code as well as follow the target url

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭