Scrapy:是否可以暂停 Scrapy 并在 x 分钟后恢复? [英] Scrapy: Is it possible to pause Scrapy and resume after x minutes?

查看:43
本文介绍了Scrapy:是否可以暂停 Scrapy 并在 x 分钟后恢复?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个大型网站.他们有一个速率限制系统.遇到403页面可以暂停scrapy 10分钟吗?我知道我可以设置 DOWNLOAD_DELAY 但我注意到我可以通过设置一个小的 DOWNLOAD_DELAY 来加快抓取速度,然后在出现 403 时暂停抓取几分钟.这样速率限制只会每小时触发一次.

I'm trying to crawl a large site. They have a rate limiting system in place. Is it possible to pause scrapy for 10 minutes when it encounter a 403 page? I know I can set a DOWNLOAD_DELAY but I noticed that I can scrape faster by setting a small DOWNLOAD_DELAY and then pause scrapy for a few minutes when it gets 403. This way the rate limiting gets triggered only once every hour or so.

推荐答案

你可以自己写重试中间件放到middleware.py

You can write your own retry middleware and put it to middleware.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep

class SleepRetryMiddleware(RetryMiddleware):
    def __init__(self, settings):
        RetryMiddleware.__init__(self, settings)

    def process_response(self, request, response, spider):
        if response.status in [403]:
            sleep(120)  # few minutes
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        return super(SleepRetryMiddleware, self).process_response(request, response, spider)

并且不要忘记更改settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project.middlewares.SleepRetryMiddleware': 100,
}

这篇关于Scrapy:是否可以暂停 Scrapy 并在 x 分钟后恢复?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆