Scrapy - 设置延迟重试中间件 [英] Scrapy - set delay to retry middleware
问题描述
我正在使用 Scrapy-splash
并且内存有问题.我可以清楚地看到 docker
python3
使用的内存逐渐增加,直到 PC 冻结.
I'm using Scrapy-splash
and I have a problem with memory. I can clearly see that memory used by docker
python3
is gradually increasing until PC freezes.
无法弄清楚为什么它会这样,因为我有 CONCURRENT_REQUESTS=3
并且没有办法 3 HTML
消耗 10GB RAM.
Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3
and there is no way 3 HTML
consumes 10GB RAM.
因此有一种变通方法可以将 maxrss
设置为某个合理的值.当 RAM 使用率具有此值时,将重新启动 docker 以刷新 RAM.
So there is a workaround to set maxrss
to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed.
但问题是在 docker
宕机期间,scrapy
继续发送请求,所以有几个 urls
没有被抓取.重试中间件正在尝试立即重试这些请求,然后放弃.
But the problem is that for the time docker
is down, scrapy
continues sending requests so there is a couple of urls
not scraped. Retry middleware is trying to retry these requests right now and then give up.
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.ex.com/eiB3t/ via http://127.0.0.1:8050/execute> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-03-30 14:28:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.ex.com/eiB3t/
所以我有两个问题
- 你知道更好的解决方案吗?
- 如果没有,我怎样才能在一段时间后将
Scrapy
设置为retry
请求(假设一分钟,以便docker
有时间重新启动)?
- Do you know a better solution?
- If not, how can I set
Scrapy
toretry
request after some time (let's say on minute sodocker
has time to restart)?
推荐答案
一种方法是向您的 Spider 添加中间件 (source、链接):
One way would be to add a middleware to your Spider (source, linked):
# File: middlewares.py
from twisted.internet import reactor
from twisted.internet.defer import Deferred
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay_s = request.meta.get('delay_request_by', None)
if not delay_s:
return
deferred = Deferred()
reactor.callLater(delay_s, deferred.callback, None)
return deferred
您以后可以像这样在 Spider 中使用它:
Which you could later use in your Spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
}
def start_requests(self):
# This request will have itself delayed by 5 seconds
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',
meta={'delay_request_by': 5})
# This request will not be delayed
yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
def parse(self, response):
... # Process results here
此处描述了相关方法:方法 #2
这篇关于Scrapy - 设置延迟重试中间件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!