在 Celery 任务中运行 Scrapy 蜘蛛 [英] Running Scrapy spiders in a Celery task

查看:37
本文介绍了在 Celery 任务中运行 Scrapy 蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Django 站点,当用户请求时会在该站点上进行抓取,并且我的代码在新进程中启动了 Scrapy 蜘蛛独立脚本.当然,这对增加用户不起作用.

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

像这样:

class StandAloneSpider(Spider):
    #a regular spider

settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...

crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl( spider )
crawler.start()

我决定使用 Celery 并使用 worker 来排队抓取请求.

I've decided to use Celery and use workers to queue up the crawl requests.

但是,我遇到了 Tornado 反应堆无法重启的问题.第一个和第二个蜘蛛运行成功,但随后的蜘蛛会抛出 ReactorNotRestartable 错误.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

任何人都可以分享在 Celery 框架内运行 Spiders 的任何技巧?

Anyone can share any tips with running Spiders within the Celery framework?

推荐答案

好的,这就是我如何让 Scrapy 与我的 Django 项目一起工作,该项目使用 Celery 来排队要爬取的内容.实际的解决方法主要来自位于此处的 joehillen 代码 http://snippets.scrapy.org/snippets/13/

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/

首先是 tasks.py 文件

from celery import task

@task()
def crawl_domain(domain_pk):
    from crawl import domain_crawl
    return domain_crawl(domain_pk)

然后是crawl.py文件

from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from spider import DomainSpider
from models import Domain

class DomainCrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        self.crawler.install()
        self.crawler.configure()

    def _crawl(self, domain_pk):
        domain = Domain.objects.get(
            pk = domain_pk,
        )
        urls = []
        for page in domain.pages.all():
            urls.append(page.url())
        self.crawler.crawl(DomainSpider(urls))
        self.crawler.start()
        self.crawler.stop()

    def crawl(self, domain_pk):
        p = Process(target=self._crawl, args=[domain_pk])
        p.start()
        p.join()

crawler = DomainCrawlerScript()

def domain_crawl(domain_pk):
    crawler.crawl(domain_pk)

这里的技巧是从多处理导入过程",它解决了 Twisted 框架中的ReactorNotRestartable"问题.因此,基本上 Celery 任务调用domain_crawl"函数,该函数反复重用DomainCrawlerScript"对象来与 Scrapy 蜘蛛进行交互.(我知道我的例子有点多余,但我这样做的原因是在我的设置中有多个版本的 python [我的 django 网络服务器实际上使用的是 python2.4 而我的工作服务器使用 python2.7])

The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])

在我这里的例子中,DomainSpider"只是一个修改过的 Scrapy Spider,它接受一个 url 列表,然后将它们设置为start_urls".

In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".

希望这有帮助!

这篇关于在 Celery 任务中运行 Scrapy 蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆