在 500 个请求后添加延迟 [英] Add a delay after 500 requests scrapy
问题描述
我有一个起始 2000 个网址的列表,我正在使用:
DOWNLOAD_DELAY = 0.25
为了控制请求的速度,但我也想在 n 个请求后添加更大的延迟.例如,我希望每个请求延迟 0.25 秒,每 500 个请求延迟 100 秒.
示例代码:
导入操作系统从 os.path 导入加入导入scrapy导入时间date = time.strftime("%d/%m/%Y").replace('/','_')list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla','http://runrun.es/':'runrunes','http://www.noticierodigital.com/':'noticiero_digital','http://www.eluniversal.com/':'el_universal','http://www.el-nacional.com/':'el_nacional','http://globovision.com/':'globovision','http://www.talcualdigital.com/':'talcualdigital','http://www.maduradas.com/':'maduradas','http://laiguana.tv/':'laiguana','http://www.aporrea.org/':'aporrea'}root_dir = os.getcwd()output_dir = join(root_dir,'data/',date)类TestSpider(scrapy.Spider):name = "news_spider"下载延迟 = 1start_urls = list_of_pages.keys()定义解析(自我,响应):如果不是 os.path.exists(output_dir):os.makedirs(output_dir)文件名 = list_of_pages[response.url]打印 time.time()with open(join(output_dir,filename), 'wb') as f:f.write(response.body)
在这种情况下,列表更短,但想法是相同的.我希望每个请求都有一个延迟级别,每个N"个请求都有一个延迟级别.我没有抓取链接,只是保存主页.
您可以考虑使用 AutoThrottle 扩展,它不能严格控制延迟,而是有自己的算法,可以根据响应时间和并发请求的数量动态调整蜘蛛的速度.>
如果您需要在抓取过程的某些阶段更好地控制延迟,您可能需要一个 自定义中间件 或自定义扩展(类似于 AutoThrottle - 来源).
您还可以更改.download_delay代码>你的蜘蛛的属性
动态.顺便说一句,这正是 AutoThrottle 扩展在幕后所做的 - 它 动态更新 .download_delay
值.
一些相关主题:
I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests. For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests. I'm not crawling the links, just saving the main page.
You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay
attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay
value on the fly.
Some related topics:
这篇关于在 500 个请求后添加延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!