如何在scrapy中检测HTTP响应状态代码并相应地设置代理? [英] How to detect HTTP response status code and set a proxy accordingly in scrapy?

查看:65
本文介绍了如何在scrapy中检测HTTP响应状态代码并相应地设置代理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法根据 HTTP 响应状态代码设置新的代理 ip(例如:从池中)?例如,从一个 IP 开始,形成一个 IP 列表,直到它得到 503 响应(或另一个 http 错误代码),然后使用下一个直到它被阻止,等等,例如:

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code? For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:

if http_status_code in [403, 503, ..., n]:
    proxy_ip = 'new ip'
    # Then keep using it till it's gets another error code

有什么想法吗?

推荐答案

Scrapy 有一个下载器中间件,默认启用它来处理代理.它被称为 HTTP 代理中间件它的作用是允许您向您的 Request 提供元密钥 proxy 并使用该代理来处理此请求.

Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.

有几种方法可以做到这一点.
第一个,直接在你的蜘蛛代码中使用它:

There are few ways of doing this.
First one, straight-forward just use it in your spider code:

def parse(self, response):
    if response.status in range(400, 600):
        return Request(response.url, 
                       meta={'proxy': 'http://myproxy:8010'}
                       dont_filter=True)  # you need to ignore filtering because you already did one request to this url

另一种更优雅的方法是使用自定义下载器中间件,它可以为多个回调处理此问题并保持您的蜘蛛代码更清晰:

Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:

from project.settings import PROXY_URL
class MyDM(object):
    def process_response(self, request, response, spider):
        if response.status in range(400, 600):
            logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
            return Request(response.url, 
                           meta={'proxy': PROXY_URL}
                           dont_filter=True)
        return response

请注意,默认情况下,scrapy 不允许通过除 200 之外的任何响应代码.Scrapy 使用 Redirect middleware 自动处理重定向代码 300 并使用 HttpError 中间件在 400500 上引发请求错误.要处理 200 以外的请求,您需要:

Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:

在请求元中指定:

Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all 
Request(url, meta={'handle_httpstatus_all': True})

设置项目/蜘蛛宽参数:

Set a project/spider wide parameters:

HTTPERROR_ALLOW_ALL = True  # for all
HTTPERROR_ALLOWED_CODES = [404, 505]  # for specific

根据 http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

这篇关于如何在scrapy中检测HTTP响应状态代码并相应地设置代理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆