pyspider降低爬取频率的问题，如何限制on_start方法中for循环的执行频率？

查看：340 发布时间：2017/9/6 4:05:03

本文介绍了pyspider降低爬取频率的问题，如何限制on_start方法中for循环的执行频率？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

看文档中设置@every是用来限制多久执行一次on_start方法，我现在是暴力遍历url去爬，我需要限制for循环的执行频率

def on_start(self):
        for id in range(133, 160999+1):
            self.crawl('https://www.lagou.com/gongsi/'+ str(companyId) + '.html', js_script="""function(){setTimeout("$('.text_over').click()", 1000);}""",fetch_type='js', callback=self.index_page, save={'companyId': id})

就是如何能够控制for的执行呢，比如每1s去爬一个url，for的执行速度太快瞬间新建了很多任务，导致我的ip被封了。。。我这个暴力url办法比较笨，哪位大神有好的意见也请告诉我。

我现在用了一个代理ip，这个代理限制每秒只处理5个http请求，因为我self.crawl的回调函数中也还有self.crawl，回调函数也不知道是多久后执行的，不太懂整个请求队列以及on_start中几秒去爬一个url是合适的？

解决方案

The crawl rate is controlled by rate and burst with token-bucket algorithm.

rate - how many requests in one second

burst - consider this situation, rate/burst = 0.1/3, it means that the spider scrawls 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.

这篇关于pyspider降低爬取频率的问题，如何限制on_start方法中for循环的执行频率？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspider降低爬取频率的问题，如何限制on_start方法中for循环的执行频率？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspider降低爬取频率的问题，如何限制on_start方法中for循环的执行频率？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭