Scrapy Spider:完成后重新启动蜘蛛 [英] Scrapy Spider: Restart spider when finishes

查看：84 发布时间：2021/6/26 19:25:50 python python-2.7 web-scraping scrapy

本文介绍了Scrapy Spider:完成后重新启动蜘蛛的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果关闭的原因是我的互联网连接(夜间互联网中断 5 分钟)，我正试图让我的 Scrapy 蜘蛛再次启动.当互联网出现故障时，蜘蛛会在 5 次尝试后关闭.

I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries.

我试图在我的蜘蛛定义中使用这个函数，试图在关闭时重新启动蜘蛛:

I'm trying to use this function inside my spider definition trying to restart the spider when closed:

def handle_spider_closed(spider, reason):
    relaunch = False
    for key in spider.crawler.stats._stats.keys():
        if 'DNSLookupError' in key:
            relaunch = True
            break

    if relaunch:
        spider = mySpider()
        settings = get_project_settings()
        crawlerProcess = CrawlerProcess(settings)
        crawlerProcess.configure()
        crawlerProcess.crawl(spider)
        spider.crawler.queue.append_spider(another_spider)

我尝试了很多事情，比如重新实例化一个蜘蛛，但得到了错误Reactor已经在运行或类似的错误.

I tried a lot of things like re instantiate an spider but got the error Reactor is already running or something like that.

我想过从脚本中执行蜘蛛，当蜘蛛完成时再次调用它，但两者都不起作用，因为反应器仍在使用中.

I thought about executing the spider from a script, and when the spider finishes call it again, but didn't work neither, because of the reactor is still in use.

我的目的是在蜘蛛关闭后重置它(蜘蛛关闭是因为它失去了互联网连接)

有谁知道这样做的好方法和简单方法吗?

Does anyone knows a good and easy way to do this ?

推荐答案

我找到了问题的解决方案！我想做什么?

I found the solution to my issue ! What was I trying to do ?

在失败或关闭时处理蜘蛛
尝试在关闭时重新执行 Spider

我通过这样处理蜘蛛的错误来管理:

I managed by handling the error of the spider like this:

import time

class mySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["google.com"]
    start_urls = [
        "http://www.google.com",
    ]

    def handle_error(self, failure):
        self.log("Error Handle: %s" % failure.request)
        self.log("Sleeping 60 seconds")
        time.sleep(60)
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error, dont_filter=True)

    def start_requests(self):
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error)

我使用 dont_filter=True 来让 Spider 允许复制请求，只有当它遇到错误时.
errback=self.handle_error 使 Spider 通过自定义的 handle_error 函数

I used dont_filter=True to make the Spider allow to duplicate a request, only when it goes through error.
errback=self.handle_error makes the Spider go through the custom handle_error function

这篇关于Scrapy Spider:完成后重新启动蜘蛛的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy Spider:完成后重新启动蜘蛛 [英] Scrapy Spider: Restart spider when finishes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy Spider:完成后重新启动蜘蛛 [英] Scrapy Spider: Restart spider when finishes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭