python爬虫 - pyspider中可以使用time.sleep()吗？

查看：597 发布时间：2017/9/6 3:47:52 python爬虫 pyspider

本文介绍了python爬虫 - pyspider中可以使用time.sleep()吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题

最近刚开始上手使用pyspider写爬虫，因为经常被ban，所以想下调一下抓取速率。尝试在脚本里用time.sleep()，发现效果不是我想像中的。
一个最简单的示例脚本如下：

 @every(seconds=1)
    def on_start(self):     
        cur_time  = time.ctime()
        file_object = open('/var/www/pyspider/time.txt', 'a')
        file_object.write("url:http://xxx/list.html time:"+cur_time+"\n")        
        file_object.close( )   
        self.crawl('http://xxx/list.html', callback=self.index_page)

    @config(age=1)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            timestr = time.time()           
            self.crawl(each.attr.href, taskid=taskid, callback=self.detail_page)
            time.sleep(5)

    @config(priority=2)
    def detail_page(self, response):
        cur_time  = time.ctime() 
        file_object = open('/var/www/pyspider/time.txt', 'a')
        file_object.write("url:"+response.url+" time:"+cur_time+"\n")        
        file_object.close( )

rate/burst 是0.5/3
发现脚本不是每5秒爬一下，而是sleep了35秒（for循环有7次）后，仍然按照rate/burst的配置走的，记录的文本如下：
url:http://xxx/list.html time:Thu Dec 29 01:46:24 2016
url:http://xxx/6.html time:Thu Dec 29 01:47:00 2016
url:http://xxx/4.html time:Thu Dec 29 01:47:00 2016
url:http://xxx/1.html time:Thu Dec 29 01:47:00 2016
url:http://xxx/2.html time:Thu Dec 29 01:47:02 2016
url:http://xxx/3.html time:Thu Dec 29 01:47:04 2016
url:http://xxx/5.html time:Thu Dec 29 01:47:06 2016
url:http://xxx/7.html time:Thu Dec 29 01:47:08 2016
可见在start执行后，sleep了35秒，再按rate/burst执行的，这是个什么机制啊？除了用rate外，还有没有办法可以自定义抓取速率呢？

解决方案

self.crawl 实际上只是提交一个任务，而不是立即执行一个任务。

这篇关于python爬虫 - pyspider中可以使用time.sleep()吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

python爬虫 - pyspider中可以使用time.sleep()吗？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

python爬虫 - pyspider中可以使用time.sleep()吗？

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭