如何在项目获取空字段时重试请求 n 次? [英] How to retry the request n times when an item gets an empty field?

查看:33
本文介绍了如何在项目获取空字段时重试请求 n 次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试删除一系列网页,但出现漏洞,有时看起来该网站无法正确发送 html 响应.这导致 csv 输出文件有空行.当响应中的 xpath 选择器为空时,如何重试 n 次请求和解析?请注意,我没有任何 HTTP 错误.

I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.

推荐答案

你可以使用自定义重试中间件来做到这一点,你只需要覆盖当前重试中间件:

you could do this with a Custom Retry Middleware, you just need to override the process_response method of the current Retry Middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message


class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        # this is your check
        if response.status == 200 and response.xpath(spider.retry_xpath):
            return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
        return response

然后在settings.py中启用它而不是默认的RetryMiddleware:

Then enable it instead of the default RetryMiddleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}

现在您有一个中间件,您可以在其中配置 xpath 以使用 retry_xpath 属性在蜘蛛中重试:

Now you have a middleware where you can configure the xpath to retry inside your spider with the attribute retry_xpath:

class MySpider(Spider):
    name = "myspidername"

    retry_xpath = '//h2[@class="tadasdop-cat"]'
    ...

当您的项目字段为空时,这不一定会重试,但您可以在此 retry_xpath 属性中指定该字段的相同路径以使其工作.

This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath attribute to make it work.

这篇关于如何在项目获取空字段时重试请求 n 次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆