在 500 个请求后添加延迟 [英] Add a delay after 500 requests scrapy

查看:26
本文介绍了在 500 个请求后添加延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个起始 2000 个网址的列表,我正在使用:

DOWNLOAD_DELAY = 0.25

为了控制请求的速度,但我也想在 n 个请求后添加更大的延迟.例如,我希望每个请求延迟 0.25 秒,每 500 个请求延迟 100 秒.

示例代码:

导入操作系统从 os.path 导入加入导入scrapy导入时间date = time.strftime("%d/%m/%Y").replace('/','_')list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla','http://runrun.es/':'runrunes','http://www.noticierodigital.com/':'noticiero_digital','http://www.eluniversal.com/':'el_universal','http://www.el-nacional.com/':'el_nacional','http://globovision.com/':'globovision','http://www.talcualdigital.com/':'talcualdigital','http://www.maduradas.com/':'maduradas','http://laiguana.tv/':'laiguana','http://www.aporrea.org/':'aporrea'}root_dir = os.getcwd()output_dir = join(root_dir,'data/',date)类TestSpider(scrapy.Spider):name = "news_spider"下载延迟 = 1start_urls = list_of_pages.keys()定义解析(自我,响应):如果不是 os.path.exists(output_dir):os.makedirs(output_dir)文件名 = list_of_pages[response.url]打印 time.time()with open(join(output_dir,filename), 'wb') as f:f.write(response.body)

在这种情况下,列表更短,但想法是相同的.我希望每个请求都有一个延迟级别,每个N"个请求都有一个延迟级别.我没有抓取链接,只是保存主页.

解决方案

您可以考虑使用 AutoThrottle 扩展,它不能严格控制延迟,而是有自己的算法,可以根据响应时间和并发请求的数量动态调整蜘蛛的速度.>

如果您需要在抓取过程的某些阶段更好地控制延迟,您可能需要一个 自定义中间件 或自定义扩展(类似于 AutoThrottle - 来源).

您还可以更改.download_delay你的蜘蛛的属性动态.顺便说一句,这正是 AutoThrottle 扩展在幕后所做的 - 它 动态更新 .download_delay.

一些相关主题:

I have a list of start 2000 urls and I'm using:

DOWNLOAD_DELAY = 0.25 

For controlling the speed of the requests, But I also want to add a bigger delay after n requests. For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.

Edit:

Sample code:

import os
from os.path import join
import scrapy
import time

date = time.strftime("%d/%m/%Y").replace('/','_')

list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',                 
                 'http://runrun.es/':'runrunes',
                 'http://www.noticierodigital.com/':'noticiero_digital',
                 'http://www.eluniversal.com/':'el_universal',
                 'http://www.el-nacional.com/':'el_nacional',
                 'http://globovision.com/':'globovision',
                 'http://www.talcualdigital.com/':'talcualdigital',
                 'http://www.maduradas.com/':'maduradas',
                 'http://laiguana.tv/':'laiguana',
                 'http://www.aporrea.org/':'aporrea'}

root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)

class TestSpider(scrapy.Spider):
    name = "news_spider"
    download_delay = 1

    start_urls = list_of_pages.keys()

    def parse(self, response):
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        filename =   list_of_pages[response.url]
        print time.time()
        with open(join(output_dir,filename), 'wb') as f:
            f.write(response.body)

The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests. I'm not crawling the links, just saving the main page.

解决方案

You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.

If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).

You can also change the .download_delay attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay value on the fly.

Some related topics:

这篇关于在 500 个请求后添加延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆