Scrapy-设置延迟以重试中间件 [英] Scrapy - set delay to retry middleware

查看:113
本文介绍了Scrapy-设置延迟以重试中间件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy-splash ,并且我的内存有问题.我可以清楚地看到 docker python3 所使用的内存在逐渐增加,直到PC冻结为止.

I'm using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasing until PC freezes.

无法弄清楚为什么会这样,因为我有 CONCURRENT_REQUESTS = 3 ,并且没有3 HTML 会消耗10GB RAM.

Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3 and there is no way 3 HTML consumes 10GB RAM.

因此,有一种解决方法可将 maxrss 设置为某个合理的值.当RAM使用率具有此值时,docker将重新启动,以便刷新RAM.

So there is a workaround to set maxrss to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed.

但是问题在于,在 docker 关闭时, scrapy 继续发送请求,因此有两个 urls 未被抓取.重试中间件正在尝试立即重试这些请求,然后放弃.

But the problem is that for the time docker is down, scrapy continues sending requests so there is a couple of urls not scraped. Retry middleware is trying to retry these requests right now and then give up.

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.ex.com/eiB3t/ via http://127.0.0.1:8050/execute> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-03-30 14:28:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.ex.com/eiB3t/

所以我有两个问题

  1. 您知道更好的解决方案吗?
  2. 如果没有,我如何在一段时间后将 Scrapy 设置为 retry 请求(假设是分钟,所以 docker 有时间重新启动)?
  1. Do you know a better solution?
  2. If not, how can I set Scrapy to retry request after some time (let's say on minute so docker has time to restart)?

推荐答案

一种方法是向您的Spider添加中间件(已链接):

One way would be to add a middleware to your Spider (source, linked):

# File: middlewares.py

from twisted.internet import reactor
from twisted.internet.defer import Deferred


class DelayedRequestsMiddleware(object):
    def process_request(self, request, spider):
        delay_s = request.meta.get('delay_request_by', None)
        if not delay_s:
            return

        deferred = Deferred()
        reactor.callLater(delay_s, deferred.callback, None)
        return deferred

稍后您可以在Spider中使用这种方式:

Which you could later use in your Spider like this:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
    }

    def start_requests(self):
        # This request will have itself delayed by 5 seconds
        yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', 
                             meta={'delay_request_by': 5})
        # This request will not be delayed
        yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')

    def parse(self, response):
        ...  # Process results here

此处描述了相关方法:方法#2

Related method is described here: Method #2

这篇关于Scrapy-设置延迟以重试中间件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆