使用 Google Cloud Functions 时带有 scrapy 的 ReactorNotRestartable [英] ReactorNotRestartable with scrapy when using Google Cloud Functions

查看:31
本文介绍了使用 Google Cloud Functions 时带有 scrapy 的 ReactorNotRestartable的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Google Cloud Functions 发送多个抓取请求.但是,我似乎收到了 ReactorNotRestartable 错误.从 StackOverflow 上的其他帖子,例如 这个,我知道这是因为它不是可以重新启动反应器,特别是在执行循环时.

I am trying to send multiple crawl requests with Google Cloud Functions. However, I seem to be getting the ReactorNotRestartable error. From other posts on StackOverflow, such as this one, I understand that this comes because it is not possible to restart the reactor, in particular when doing a loop.

解决这个问题的方法是将 start() 放在 for 循环之外.但是,对于 Cloud Functions,这是不可能的,因为每个请求在技术上都应该是独立的.

The way to solve this is by putting the start() outside the for loop. However, with Cloud Functions this is not possible as each request should be technically independent.

CrawlerProcess 是否以某种方式与 Cloud Functions 一起缓存?如果是这样,我们如何才能消除这种行为.

Is the CrawlerProcess somehow cached with Cloud Functions? And if so, how can we remove this behaviour.

例如,我尝试将导入和初始化过程放在函数内部而不是外部,以防止缓存导入,但这不起作用:

I tried for instance to put the import and initialization process inside a function, instead of outside, to prevent the caching of imports, but that did not work:

# main.py

def run_single_crawl(data, context):
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()

    process.crawl(MySpider)
    process.start()

推荐答案

默认情况下,scrapy 的异步特性不适用于 Cloud Functions,因为我们需要一种方法阻止爬取以防止函数提前返回并在进程终止之前杀死实例.

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

相反,我们可以使用 scrapydo 来运行您现有的蜘蛛以阻塞方式:

Instead, we can use scrapydo to run your existing spider in a blocking fashion:

requirements.txt:

scrapydo

main.py:

import scrapy
import scrapydo

scrapydo.setup()


class MyItem(scrapy.Item):
    url = scrapy.Field()


class MySpider(scrapy.Spider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/"]

    def parse(self, response):
        yield MyItem(url=response.url)


def run_single_crawl(data, context):
    results = scrapydo.run_spider(MySpider)

这也展示了一个简单的例子,说明如何从蜘蛛获取一个或多个 scrapy.Item 并从爬虫中收集结果,如果不使用 ,这也是一个挑战scrapydo.

This also shows a simple example of how to yield one or more scrapy.Item from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo.

另外:确保您已为您的项目启用结算功能.默认情况下,Cloud Functions 无法发出出站请求,爬虫会成功,但不会返回任何结果.

Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.

这篇关于使用 Google Cloud Functions 时带有 scrapy 的 ReactorNotRestartable的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆