为特定的scrapy请求添加延迟 [英] Add a delay to a specific scrapy Request

查看:351
本文介绍了为特定的scrapy请求添加延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以延迟特定scrapy请求的重试.我有一个中间件,它需要将页面请求推迟到稍后的时间.我知道如何进行基本的延迟(队列结束),以及如何延迟所有请求(全局设置),但我只想延迟这个单独的请求.这在队列末尾附近最为重要,如果我执行简单的延迟,它会立即再次成为下一个请求.

Is it possible to delay the retry of a particular scrapy Request. I have a middleware which needs to defer the request of a page until a later time. I know how to do the basic deferal (end of queue), and also how to delay all requests (global settings), but I want to just delay this one individual request. This is most important near the end of the queue, where if I do the simple deferral it immediately becomes the next request again.

推荐答案

方法一

一种方法是向您的 Spider 添加中间件(sourcea>, 链接):

# File: middlewares.py

from twisted.internet import reactor
from twisted.internet.defer import Deferred


class DelayedRequestsMiddleware(object):
    def process_request(self, request, spider):
        delay_s = request.meta.get('delay_request_by', None)
        if not delay_s:
            return

        deferred = Deferred()
        reactor.callLater(delay_s, deferred.callback, None)
        return deferred

您以后可以像这样在 Spider 中使用它:

Which you could later use in your Spider like this:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
    }

    def start_requests(self):
        # This request will have itself delayed by 5 seconds
        yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', 
                             meta={'delay_request_by': 5})
        # This request will not be delayed
        yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')

    def parse(self, response):
        ...  # Process results here

方法二

您可以使用自定义重试中间件(来源),您只需要覆盖当前 process_response 方法="nofollow noreferrer">重试中间件:

Method 2

You could do this with a Custom Retry Middleware (source), you just need to override the process_response method of the current Retry Middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message


class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)

            # Your delay code here, for example sleep(10) or polling server until it is alive

            return self._retry(request, reason, spider) or response

        return response

然后在settings.py中启用它而不是默认的RetryMiddleware:

Then enable it instead of the default RetryMiddleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}

这篇关于为特定的scrapy请求添加延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆