如何重新安排 403 HTTP 状态代码以便稍后在 scrapy 中抓取? [英] How to reschedule 403 HTTP status codes to be crawled later in scrapy?

查看:43
本文介绍了如何重新安排 403 HTTP 状态代码以便稍后在 scrapy 中抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据这些 说明 我可以看到 HTTP 500 错误、连接丢失错误等总是被重新安排,但如果 403 错误也被重新安排,或者如果它们被简单地视为有效响应或在达到重试限制后被忽略,我找不到任何地方.

As per these instructions I can see that HTTP 500 errors, connection lost errors etc. are always rescheduled but I couldn't find anywhere if 403 error are rescheduled too or if they are simply treated as a valid response or ignored after reaching the retry limits.

同样来自同一指令:

在抓取过程中收集失败的页面并重新安排在最后,一旦蜘蛛完成了所有常规(非失败)页面.一旦没有更多失败的页面可以重试,这个中间件发送信号(retry_complete),因此其他扩展可以连接到那个信号.

Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages. Once there are no more failed pages to retry, this middleware sends a signal (retry_complete), so other extensions could connect to that signal.

这些Failed Pages 指的是什么?它们是否包括 403 错误?

What does these Failed Pages refer to ? Do they include 403 errors ?

另外,当scrapy遇到HTTP 400状态时,我可以看到这个异常被引发:

Also, I can see this exception being raised when scrapy encounters a HTTP 400 status:

2015-12-07 12:33:42 [scrapy] DEBUG: Ignoring response <400 http://example.com/q?x=12>: HTTP status code is not handled or not allowed

从这个异常中,我认为很明显 HTTP 400 响应被忽略并且没有重新安排.

From this exception I think it's clear that HTTP 400 responses are ignored and not rescheduled.

我不确定 403 HTTP 状态是否被忽略或重新安排在最后被抓取.因此,我尝试根据 这些 文档.这是我迄今为止尝试过的:

I'm not sure if 403 HTTP status is ignored or rescheduled to be crawled at the end. So I tried rescheduling all the responses that have HTTP status 403 according to these docs. Here's what I have tried so far:

在 middlewares.py 文件中:

In a middlewares.py file:

def process_response(self, request, response, spider):
    if response.status == 403:
        return request
    else:
        return response

在settings.py中:

In the settings.py:

RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 403, 404, 408]

我的问题是:

  1. 这些Failed Pages 指的是什么?它们是否包括 403 错误?
  2. 我是否需要编写 process_response 来重新安排 403 错误页面,还是它们会被 scrapy 自动重新安排?
  3. scrapy 重新安排了哪些类型的异常和(HTTP 代码)?
  4. 如果我重新安排了一个 404 错误页面,我会进入一个无限循环还是有一个超时,之后将不再进行重新安排?
  1. What does these Failed Pages refer to ? Do they include 403 errors ?
  2. Do I need to write process_response to reschedule 403 error pages or are they automatically rescheduled by scrapy ?
  3. What type of exceptions and (HTTP codes) are rescheduled by scrapy ?
  4. If I reschedule a 404 error page, will I be entering an infinite loop or is there a timeout after which the rescheduling will not be done further ?

推荐答案

  1. 您可以找到重试的默认状态这里.

将 403 添加到 settings.py 文件中的 RETRY_HTTP_CODES 应该会处理该请求并重试.

Adding 403 to RETRY_HTTP_CODES in the settings.py file should handle that request and retry.

RETRY_HTTP_CODES 中的那些,我们已经检查了默认的.

The ones inside the RETRY_HTTP_CODES, we already checked the default ones.

这篇关于如何重新安排 403 HTTP 状态代码以便稍后在 scrapy 中抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆