为什么scrapy爬虫停止? [英] Why scrapy crawler stops?

查看:110
本文介绍了为什么scrapy爬虫停止?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy框架编写了一个爬虫来解析产品站点.爬虫在没有完成完整解析过程的情况下突然停止.我对此进行了大量研究,大多数答案表明我的爬虫被网站阻止.有什么机制可以检测我的蜘蛛是被网站阻止还是自己停止?

以下是spider的信息级日志条目.

2013-09-23 09:59:07+0000 [scrapy] INFO:Scrapy 0.18.0 启动(机器人:爬虫)2013-09-23 09:59:08+0000 [蜘蛛] 信息:蜘蛛打开2013-09-23 09:59:08+0000 [spider] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2013-09-23 10:00:08+0000 [spider] INFO:抓取 10 页(以 10 页/分钟),抓取 7 个项目(以 7 个项目/分钟)2013-09-23 10:01:08+0000 [spider] INFO:抓取 22 页(以 12 页/分钟),抓取 19 个项目(以 12 个项目/分钟)2013-09-23 10:02:08+0000 [spider] INFO:抓取 31 页(以 9 页/分钟),抓取 28 个项目(以 9 个项目/分钟)2013-09-23 10:03:08+0000 [spider] INFO:抓取 40 页(以 9 页/分钟),抓取 37 个项目(以 9 个项目/分钟)2013-09-23 10:04:08+0000 [spider] INFO:抓取 49 页(以 9 页/分钟),抓取 46 个项目(以 9 个项目/分钟)2013-09-23 10:05:08+0000 [spider] INFO:抓取 59 页(以 10 页/分钟),抓取 56 个项目(以 10 个项目/分钟)

以下是蜘蛛关闭前日志文件中调试级别条目的最后一部分:

2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) (参考:http://site_name)2013-09-25 11:33:24+0000 [spider] 调试:从<200 http://url.html>//以json形式报废的数据2013-09-25 11:33:25+0000 [spider] INFO:关闭蜘蛛(已完成)2013-09-25 11:33:25+0000 [spider] 信息:倾倒 Scrapy 统计信息:{'下载器/请求字节':36754,'下载者/请求计数':103,'下载器/request_method_count/GET':103,下载器/响应字节":390792,下载器/响应计数":103,'下载器/response_status_count/200':102,'下载器/response_status_count/302':1,'finish_reason': '完成','finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),'item_scraped_count': 99,日志计数/调试":310,'log_count/INFO': 14,'request_depth_max': 1,'response_received_count':102,'调度程序/出队':100,'调度程序/出队/磁盘':100,'调度程序/排队':100,调度程序/入队/磁盘":100,'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}2013-09-25 11:33:25+0000 [spider] INFO:Spider 关闭(已完成)

仍有页面需要解析,但蜘蛛停止了.

解决方案

到目前为止,我知道对于蜘蛛来说:

<块引用>

  1. 有一些 url 队列或池需要通过解析来抓取/解析方法.您可以指定,将 url 绑定到特定方法或让默认的解析"完成这项工作.
  2. 从解析方法中,您必须返回/产生另一个请求,以提供该池或项目
  3. 当池中的 url 用完或发送停止信号时,蜘蛛会停止爬行.

如果你分享你的蜘蛛代码会很好,这样我们就可以检查这些绑定是否正确.例如,使用 SgmlLinkExtractor 很容易错过一些绑定.

I have written a crawler using scrapy framework to parse a products site. The crawler stops in between suddenly without completing the full parsing process. I have researched a lot on this and most of the answers indicate that my crawler is being blocked by the website. Is there any mechanism by which I can detect whether my spider is being stopped by website or does it stop on its own?

The below is info level log entry of spider .

2013-09-23 09:59:07+0000 [scrapy] INFO: Scrapy 0.18.0 started (bot: crawler)  
2013-09-23 09:59:08+0000 [spider] INFO: Spider opened  
2013-09-23 09:59:08+0000 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  
2013-09-23 10:00:08+0000 [spider] INFO: Crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min)  
2013-09-23 10:01:08+0000 [spider] INFO: Crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min)  
2013-09-23 10:02:08+0000 [spider] INFO: Crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min)  
2013-09-23 10:03:08+0000 [spider] INFO: Crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min)  
2013-09-23 10:04:08+0000 [spider] INFO: Crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min)  
2013-09-23 10:05:08+0000 [spider] INFO: Crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)  

Below is last part of debug level entry in log file before spider is closed:

2013-09-25 11:33:24+0000 [spider] DEBUG: Crawled (200) <GET http://url.html> (referer: http://site_name)
2013-09-25 11:33:24+0000 [spider] DEBUG: Scraped from <200 http://url.html>

//scrapped data in json form

2013-09-25 11:33:25+0000 [spider] INFO: Closing spider (finished)  
2013-09-25 11:33:25+0000 [spider] INFO: Dumping Scrapy stats:  
    {'downloader/request_bytes': 36754,  
     'downloader/request_count': 103,  
     'downloader/request_method_count/GET': 103,  
     'downloader/response_bytes': 390792,  
     'downloader/response_count': 103,  
     'downloader/response_status_count/200': 102,  
     'downloader/response_status_count/302': 1,  
     'finish_reason': 'finished',  
     'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359),  
     'item_scraped_count': 99,  
     'log_count/DEBUG': 310,  
     'log_count/INFO': 14,  
     'request_depth_max': 1,  
     'response_received_count': 102,  
     'scheduler/dequeued': 100,  
     'scheduler/dequeued/disk': 100,  
     'scheduler/enqueued': 100,  
     'scheduler/enqueued/disk': 100,  
     'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)}  
2013-09-25 11:33:25+0000 [spider] INFO: Spider closed (finished)  

Still there are pages remaining to be parsed, but the spider stops.

解决方案

So far I know that for a spider:

  1. There are some queue or pool of urls to be scraped/parsed with parsing methods. You can specify, bind the url to a specific method or let the default 'parse' do the job.
  2. From parsing methods you must return/yield another request(s), to feed that pool, or item(s)
  3. When the pool runs out of urls or a stop signal is sent the spider stops crawling.

Would be nice if you share your spider code so we can check if those binds are correct. It's easy to miss some bindings by mistake using SgmlLinkExtractor for example.

这篇关于为什么scrapy爬虫停止?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆