如何限制草率的请求对象? [英] How to limit scrapy request objects?

查看:41
本文介绍了如何限制草率的请求对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个我认为正在泄漏内存的蜘蛛,结果是当我检查telnet控制台>>> prefs()时,它只是从链接丰富的页面上抓取了太多链接(有时它最多可容纳100,000个)

现在,我一次又一次地浏览了文档和Google,但我找不到一种方法来限制Spider接收的请求.我想要的是能够告诉它阻止一次接收请求一定数量进入调度程序.我试过设置 DEPTH_LIMIT ,但这只能让它抓取大量内容,然后在它抓取的内容上运行回调.

这似乎是一件很简单的事情,而且我确信人们以前曾经遇到过这个问题,所以我知道必须有一种方法可以解决它.有什么想法吗?

这是MEMUSAGE_ENABLE = True的输出

  {'downloader/request_bytes':105716,'downloader/request_count':315,"downloader/request_method_count/GET":315,'downloader/response_bytes':10066538,'downloader/response_count':315,'downloader/response_status_count/200':313,'downloader/response_status_count/301':1,'downloader/response_status_count/302':1,'dupefilter/filtered':32444,'finish_reason':'memusage_exceeded','finish_time':datetime.datetime(2015,1,14,14,2,2,38,134402),'item_scraped_count':312,'log_count/DEBUG':946,'log_count/ERROR':2'log_count/INFO':9'memdebug/gc_garbage_count':0,'memdebug/live_refs/EnglishWikiSpider':1,'memdebug/live_refs/Request':70194,'memusage/limit_notified':1,'memusage/limit_reached':1'memusage/max':422600704,娱乐/启动":34791424,异地/域":316,异地/已过滤":18172,'request_depth_max':3,'response_received_count':313,``调度程序/出队'':315,``调度程序/出队/内存'':315,``调度程序/排队'':70508,计划程序/排队/内存":70508,'start_time':datetime.datetime(2015、1、14、14、1、31、988254)} 

解决方案

我解决了我的问题,答案很难找到答案,所以我把它贴在这里,以防其他人遇到相同的问题.

在检查了scrapy代码并返回到文档之后,我可以看到scrapy将所有请求都保留在内存中,我已经推断出了这一点,但是在代码中还进行了一些检查以查看是否存在其中的作业目录.将待处理的请求写入磁盘(在core.scheduler中)

因此,如果运行带有工作目录的scrapy spider,它将把待处理的请求写入磁盘,然后从磁盘检索它们,而不是将它们全部存储在内存中.

  $ scrapy爬行蜘蛛-s JOBDIR = somedirname 

执行此操作时,如果进入telnet控制台,我可以看到内存中的请求数始终约为25,并且已将100,000+个写入磁盘,这正是我希望其运行的方式.

这似乎是一个普遍的问题,因为人们将抓取一个大型站点,每个页面具有多个可提取的链接.令我惊讶的是它没有更多的文献记载或更容易找到.

http://doc.scrapy.org/en/latest/topics/jobs.html那里的抓人站点指出,主要目的是稍后暂停和恢复,但它也是如此.

So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs()

Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only lets it grab a large amount and then run the callback on the ones that it has grabbed.

It seems like a fairly straightforward thing to do and I am sure people have run into this problem before, so I know there must be a way to get it done. Any ideas?

EDIT: Here is the output from MEMUSAGE_ENABLE = True

     {'downloader/request_bytes': 105716,
     'downloader/request_count': 315,
     'downloader/request_method_count/GET': 315,
     'downloader/response_bytes': 10066538,
     'downloader/response_count': 315,
     'downloader/response_status_count/200': 313,
     'downloader/response_status_count/301': 1,
     'downloader/response_status_count/302': 1,
     'dupefilter/filtered': 32444,
     'finish_reason': 'memusage_exceeded',
     'finish_time': datetime.datetime(2015, 1, 14, 14, 2, 38, 134402),
     'item_scraped_count': 312,
     'log_count/DEBUG': 946,
     'log_count/ERROR': 2,
     'log_count/INFO': 9,
     'memdebug/gc_garbage_count': 0,
     'memdebug/live_refs/EnglishWikiSpider': 1,
     'memdebug/live_refs/Request': 70194,
     'memusage/limit_notified': 1,
     'memusage/limit_reached': 1,
     'memusage/max': 422600704,
     'memusage/startup': 34791424,
     'offsite/domains': 316,
     'offsite/filtered': 18172,
     'request_depth_max': 3,
     'response_received_count': 313,
     'scheduler/dequeued': 315,
     'scheduler/dequeued/memory': 315,
     'scheduler/enqueued': 70508,
     'scheduler/enqueued/memory': 70508,
     'start_time': datetime.datetime(2015, 1, 14, 14, 1, 31, 988254)}

解决方案

I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.

After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)

So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.

$ scrapy crawl spider -s JOBDIR=somedirname

when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.

It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.

http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.

这篇关于如何限制草率的请求对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆