如何在满足管道中的条件后立即停止所有蜘蛛和引擎? [英] How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

查看：66 发布时间：2021/7/16 21:45:08 python scrapy web-crawler

本文介绍了如何在满足管道中的条件后立即停止所有蜘蛛和引擎?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们有一个用scrapy 编写的系统来抓取一些网站.有几个蜘蛛，以及所有爬虫通过的所有项目的几个级联管道.管道组件之一向 google 服务器查询地理编码地址.Google 规定了 每个 IP 地址每天 2500 个请求的限制，并威胁要禁止一个 IP 地址，如果它继续查询 Google，即使 Google 回复了警告消息:OVER_QUERY_LIMIT".

因此，我想知道我可以从管道内调用的任何机制，这些机制将完全并立即停止所有蜘蛛和主引擎的所有进一步爬行/处理.

我检查了其他类似的问题，但他们的答案没有奏效:

强制我的爬虫蜘蛛停止爬行

<块引用>
from scrapy.project 导入爬虫crawler._signal_shutdown(9,0) #如果cnxn失败就运行这个.
这不起作用，因为蜘蛛停止执行需要时间，因此向 google 发出了更多请求(这可能会禁止我的 IP 地址)
<块引用>
导入系统sys.exit("关闭一切！")
这个根本行不通；项目不断生成并传递到管道，尽管日志呕吐 sys.exit() -> exceptions.SystemExit 引发(无效)
如何遇到第一个异常时，我可以让scrapy爬行中断并退出吗?
<块引用>
crawler.engine.close_spider(self, 'log message')
这个和上面提到的第一种情况有同样的问题.
我试过了:
<块引用>
scrapy.project.crawler.engine.stop()
无济于事
编辑:如果我正在筹备中:
<块引用>
从scrapy.contrib.closespider 导入CloseSpider
我应该从管道范围将什么作为爬虫"参数传递给 CloseSpider 的 init()?
解决方案
您可以引发 CloseSpider 异常以关闭蜘蛛.但是，我认为这不会在管道中起作用.
编辑:avaleske 在对此答案的评论中指出，他能够从管道中引发 CloseSpider 异常.最明智的做法是使用它.
在 Scrapy 用户组中描述了类似的情况，在此线程.
我引用:
<块引用>
要为您的代码的任何部分关闭蜘蛛，您应该使用engine.close_spider 方法.请参阅此扩展以了解用法例子:https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/closespider.py#L61
您可以编写自己的扩展程序，同时以 closespider.py 为例，如果满足特定条件，它将关闭蜘蛛.
另一个黑客"是在管道中的蜘蛛上设置一个标志.例如:
管道:
def process_item(self, item, spider):如果 some_flag:spider.close_down = True
蜘蛛:
def 解析(自我，响应):如果 self.close_down:引发 CloseSpider(原因 = 超出 API 使用率')
We have a system written with scrapy to crawl a few websites. There are several spiders, and a few cascaded pipelines for all items passed by all crawlers. One of the pipeline components queries the google servers for geocoding addresses. Google imposes a limit of 2500 requests per day per IP address, and threatens to ban an IP address if it continues querying google even after google has responded with a warning message: 'OVER_QUERY_LIMIT'.

Hence I want to know about any mechanism which I can invoke from within the pipeline that will completely and immediately stop all further crawling/processing of all spiders and also the main engine.

I have checked other similar questions and their answers have not worked:

Force my scrapy spider to stop crawling

from scrapy.project import crawler crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

this does not work as it takes time for the spider to stop execution and hence many more requests are made to google (which could potentially ban my IP address)

import sys sys.exit("SHUT DOWN EVERYTHING!")

this one doesn't work at all; items keep getting generated and passed to the pipeline, although the log vomits sys.exit() -> exceptions.SystemExit raised (to no effect)

How can I make scrapy crawl break and exit when encountering the first exception?

crawler.engine.close_spider(self, 'log message')

this one has the same problem as the first case mentioned above.

I tried:

scrapy.project.crawler.engine.stop()

To no avail

EDIT: If I do in the pipeline:

from scrapy.contrib.closespider import CloseSpider

what should I pass as the 'crawler' argument to the CloseSpider's init() from the scope of my pipeline?
解决方案
You can raise a CloseSpider exception to close down a spider. However, I don't think this will work from a pipeline.

EDIT: avaleske notes in the comments to this answer that he was able to raise a CloseSpider exception from a pipeline. Most wise would be to use this.

A similar situation has been described on the Scrapy Users group, in this thread.

I quote:

To close an spider for any part of your code you should use engine.close_spider method. See this extension for an usage example: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/closespider.py#L61

You could write your own extension, whilst looking at closespider.py as an example, which will shut down a spider if a certain condition has been met.

Another "hack" would be to set a flag on the spider in the pipeline. For example:

pipeline:
def process_item(self, item, spider): if some_flag: spider.close_down = True
spider:
def parse(self, response): if self.close_down: raise CloseSpider(reason='API usage exceeded')

这篇关于如何在满足管道中的条件后立即停止所有蜘蛛和引擎?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在满足管道中的条件后立即停止所有蜘蛛和引擎? [英] How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在满足管道中的条件后立即停止所有蜘蛛和引擎? [英] How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭