使用scrapy蜘蛛捕获http状态码 [英] Capturing http status codes with scrapy spider

查看：114 发布时间：2021/7/16 22:00:17 python web-scraping scrapy

本文介绍了使用scrapy蜘蛛捕获http状态码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是scrapy的新手.我正在编写一个蜘蛛程序，旨在检查服务器状态代码的一长串网址，并在适当的情况下检查它们重定向到的网址.重要的是，如果有重定向链，我需要知道每次跳转时的状态代码和 url.我正在使用 response.meta['redirect_urls'] 来捕获 url，但我不确定如何捕获状态代码 - 似乎没有响应元键.

I am new to scrapy. I am writing a spider designed to check a long list of urls for the server status codes and, where appropriate, what URLs they are redirected to. Importantly, if there is a chain of redirects, I need to know the status code and url at each jump. I am using response.meta['redirect_urls'] to capture the urls, but am unsure how to capture the status codes - there doesn't seem to be a response meta key for it.

我意识到我可能需要编写一些自定义中间件来公开这些值，但不太清楚如何记录每个跃点的状态代码，也不太清楚如何从蜘蛛访问这些值.我看过但找不到任何人这样做的例子.如果有人能指出我正确的方向，我将不胜感激.

I realise I may need to write some custom middlewear to expose these values but am not quite clear how to log the status codes for every hop, nor how to access these values from the spider. I've had a look but can't find an example of anyone doing this. If anyone can point me in the right direction it would be much appreciated.

例如，

    items = []
    item = RedirectItem()
    item['url'] = response.url
    item['redirected_urls'] = response.meta['redirect_urls']     
    item['status_codes'] = #????
    items.append(item)

编辑 - 根据来自 warawauk 的反馈以及来自 IRC 频道 (freenode #scrappy) 上的人的一些积极主动的帮助，我已经设法做到了.我认为这有点 hacky，所以欢迎提出任何改进意见:

Edit - Based on feedback from warawauk and some really proactive help from the guys on the IRC channel (freenode #scrappy) I've managed to do this. I believe it's a little hacky so any comments for improvement welcome:

(1) 在设置中禁用默认中间件，并添加自己的:

(1) Disable the default middleware in the settings, and add your own:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
    'myproject.middlewares.CustomRedirectMiddleware': 100,
}

(2) 在 middlewares.py 中创建 CustomRedirectMiddleware.它继承自主要的重定向中间件类并捕获重定向:

(2) Create your CustomRedirectMiddleware in your middlewares.py. It inherits from the main redirectmiddleware class and captures the redirect:

class CustomRedirectMiddleware(RedirectMiddleware):
    """Handle redirection of requests based on response status and meta-refresh html tag"""

    def process_response(self, request, response, spider):
        #Get the redirect status codes
        request.meta.setdefault('redirect_status', []).append(response.status)
        if 'dont_redirect' in request.meta:
            return response
        if request.method.upper() == 'HEAD':
            if response.status in [301, 302, 303, 307] and 'Location' in response.headers:
                redirected_url = urljoin(request.url, response.headers['location'])
                redirected = request.replace(url=redirected_url)

                return self._redirect(redirected, request, spider, response.status)
            else:
                return response

        if response.status in [302, 303] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = self._redirect_request_using_get(request, redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if response.status in [301, 307] and 'Location' in response.headers:
            redirected_url = urljoin(request.url, response.headers['location'])
            redirected = request.replace(url=redirected_url)
            return self._redirect(redirected, request, spider, response.status)

        if isinstance(response, HtmlResponse):
            interval, url = get_meta_refresh(response)
            if url and interval < self.max_metarefresh_delay:
                redirected = self._redirect_request_using_get(request, url)
                return self._redirect(redirected, request, spider, 'meta refresh')


        return response

(3) 您现在可以使用

(3) You can now access the list of redirects in your spider with

request.meta['redirect_status']

使用scrapy蜘蛛捕获http状态码 [英] Capturing http status codes with scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy蜘蛛捕获http状态码 [英] Capturing http status codes with scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭