即使在收到 200 状态代码时重试 Scrapy 请求 [英] Retrying a Scrapy Request even when receiving a 200 status code

查看:37
本文介绍了即使在收到 200 状态代码时重试 Scrapy 请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取的网站有时会返回 200,但 response.body 中没有任何文本(当我尝试使用 Selector 解析它时引发 AttributeError).

There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).

是否有一种简单的方法可以检查以确保正文包含文本,如果没有,请重试请求直到它包含?下面是一些伪代码来概述我正在尝试做的事情.

Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.

def check_response(response):
    if response.body != '':
        return response
    else:
        return Request(copy_of_response.request,
                       callback=check_response)

基本上,有没有一种方法可以重复具有完全相同属性(方法、网址、负载、cookie 等)的请求?

Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?

推荐答案

遵循 EAFP 原理:

Follow the EAFP principle:

请求原谅比许可更容易.这个常见的 Python编码风格假定存在有效的键或属性,并且如果假设证明为假,则捕获异常.这种干净和快速风格的特点是存在许多尝试和除外声明.该技术与许多人常见的 LBYL 风格形成鲜明对比.其他语言,如 C.

Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

处理异常并使用 dont_filter=True:

Handle an exception and yield a Request to the current url with dont_filter=True:

dont_filter (boolean) – 表示这个请求不应该被由调度器过滤.当你想执行一个多次相同的请求,忽略重复过滤器.用小心,否则你会陷入爬行循环.默认为 False.

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

def parse(response):
    try:
        # parsing logic here
    except AttributeError:
        yield Request(response.url, callback=self.parse, dont_filter=True)

<小时>

您也可以制作当前请求的副本(未测试):

new_request = response.request.copy()
new_request.dont_filter = True
yield new_request

或者,使用 replace():

Or, make a new request using replace():

new_request = response.request.replace(dont_filter=True)
yield new_request

这篇关于即使在收到 200 状态代码时重试 Scrapy 请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆