如何限制草率的请求对象? [英] How to limit scrapy request objects?

查看：41 发布时间：2021/4/16 18:56:43 python web-scraping scrapy web-crawler bots

本文介绍了如何限制草率的请求对象?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我有一个我认为正在泄漏内存的蜘蛛，结果是当我检查telnet控制台>>> prefs()时，它只是从链接丰富的页面上抓取了太多链接(有时它最多可容纳100,000个)

现在，我一次又一次地浏览了文档和Google，但我找不到一种方法来限制Spider接收的请求.我想要的是能够告诉它阻止一次接收请求一定数量进入调度程序.我试过设置 DEPTH_LIMIT ，但这只能让它抓取大量内容，然后在它抓取的内容上运行回调.

这似乎是一件很简单的事情，而且我确信人们以前曾经遇到过这个问题，所以我知道必须有一种方法可以解决它.有什么想法吗?

这是MEMUSAGE_ENABLE = True的输出

  {'downloader/request_bytes':105716，'downloader/request_count':315，"downloader/request_method_count/GET":315，'downloader/response_bytes':10066538，'downloader/response_count':315，'downloader/response_status_count/200':313，'downloader/response_status_count/301':1，'downloader/response_status_count/302':1，'dupefilter/filtered':32444，'finish_reason':'memusage_exceeded'，'finish_time':datetime.datetime(2015，1，14，14，2，2，38，134402)，'item_scraped_count':312，'log_count/DEBUG':946，'log_count/ERROR':2'log_count/INFO':9'memdebug/gc_garbage_count':0，'memdebug/live_refs/EnglishWikiSpider':1，'memdebug/live_refs/Request':70194，'memusage/limit_notified':1，'memusage/limit_reached':1'memusage/max':422600704，娱乐/启动":34791424，异地/域":316，异地/已过滤":18172，'request_depth_max':3，'response_received_count':313，``调度程序/出队'':315，``调度程序/出队/内存'':315，``调度程序/排队'':70508，计划程序/排队/内存":70508，'start_time':datetime.datetime(2015、1、14、14、1、31、988254)}

解决方案

我解决了我的问题，答案很难找到答案，所以我把它贴在这里，以防其他人遇到相同的问题.

在检查了scrapy代码并返回到文档之后，我可以看到scrapy将所有请求都保留在内存中，我已经推断出了这一点，但是在代码中还进行了一些检查以查看是否存在其中的作业目录.将待处理的请求写入磁盘(在core.scheduler中)

因此，如果运行带有工作目录的scrapy spider，它将把待处理的请求写入磁盘，然后从磁盘检索它们，而不是将它们全部存储在内存中.

  $ scrapy爬行蜘蛛-s JOBDIR = somedirname

执行此操作时，如果进入telnet控制台，我可以看到内存中的请求数始终约为25，并且已将100,000+个写入磁盘，这正是我希望其运行的方式.

这似乎是一个普遍的问题，因为人们将抓取一个大型站点，每个页面具有多个可提取的链接.令我惊讶的是它没有更多的文献记载或更容易找到.

http://doc.scrapy.org/en/latest/topics/jobs.html那里的抓人站点指出，主要目的是稍后暂停和恢复，但它也是如此.

So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs()

Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only lets it grab a large amount and then run the callback on the ones that it has grabbed.

It seems like a fairly straightforward thing to do and I am sure people have run into this problem before, so I know there must be a way to get it done. Any ideas?

EDIT: Here is the output from MEMUSAGE_ENABLE = True

     {'downloader/request_bytes': 105716,
     'downloader/request_count': 315,
     'downloader/request_method_count/GET': 315,
     'downloader/response_bytes': 10066538,
     'downloader/response_count': 315,
     'downloader/response_status_count/200': 313,
     'downloader/response_status_count/301': 1,
     'downloader/response_status_count/302': 1,
     'dupefilter/filtered': 32444,
     'finish_reason': 'memusage_exceeded',
     'finish_time': datetime.datetime(2015, 1, 14, 14, 2, 38, 134402),
     'item_scraped_count': 312,
     'log_count/DEBUG': 946,
     'log_count/ERROR': 2,
     'log_count/INFO': 9,
     'memdebug/gc_garbage_count': 0,
     'memdebug/live_refs/EnglishWikiSpider': 1,
     'memdebug/live_refs/Request': 70194,
     'memusage/limit_notified': 1,
     'memusage/limit_reached': 1,
     'memusage/max': 422600704,
     'memusage/startup': 34791424,
     'offsite/domains': 316,
     'offsite/filtered': 18172,
     'request_depth_max': 3,
     'response_received_count': 313,
     'scheduler/dequeued': 315,
     'scheduler/dequeued/memory': 315,
     'scheduler/enqueued': 70508,
     'scheduler/enqueued/memory': 70508,
     'start_time': datetime.datetime(2015, 1, 14, 14, 1, 31, 988254)}

解决方案

I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.

After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)

So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.

$ scrapy crawl spider -s JOBDIR=somedirname

when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.

It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.

http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.

这篇关于如何限制草率的请求对象?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何限制草率的请求对象? [英] How to limit scrapy request objects?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何限制草率的请求对象? [英] How to limit scrapy request objects?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭