在芹菜任务中运行Scrapy蜘蛛 [英] Running Scrapy spiders in a Celery task

查看:424
本文介绍了在芹菜任务中运行Scrapy蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Django网站,当用户请求时发生刮擦,我的代码在一个新的过程中启动了Scrapy蜘蛛独立脚本。自然而然,这并不适用于用户的增加。



这样的一个例子:

  class StandAloneSpider(Spider):
#a常规蜘蛛

settings.overrides ['LOG_ENABLED'] = True
#更多设置可以更改..

crawler = CrawlerProcess(settings)
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl(spider)
crawler.start()

ve决定使用Celery并使用工作人员排队爬网请求。



但是,我遇到了与无法重新启动的Tornado反应器的问题。第一个和第二个蜘蛛运行成功,但后来的蜘蛛将抛出ReactorNotRestartable错误。



任何人都可以在Celery框架内分享运行Spiders的任何提示?

解决方案

这里是我如何使用我的Django项目,使用Celery排队爬网。实际的解决方法主要来自joehillen的代码位于这里 http://snippets.scrapy.org/snippets/13/



首先, tasks.py 文件

 从芹菜进口任务

@task()
def crawl_domain(domain_pk):
从爬网导入domain_crawl
返回domain_crawl(domain_pk)

然后 crawl.py 来自多进程导入的进程



从scrapy.crawler导入CrawlerProcess
从scrapy.conf导入设置
从蜘蛛导入DomainSpider
从模型导入域

class DomainCrawlerScript():

def __init __(self):
self.crawler = CrawlerProcess )
self.crawler.install()
self.crawler.configure()

def _crawl(self,domain_pk):
domain = Domain.objects.get (
pk = domain_pk,

urls = []
for domain.pages.all():
urls.append(page.url())
self .crawler.crawl(DomainSpider(urls))
self.crawler.start()
self.crawler.stop()

def crawl(self,domain_pk):
p = Process(target = self._crawl,args = [domain_pk])
p.start()
p.join()

crawler = DomainCrawlerScript()

def domain_crawl(domain_pk):
crawler.crawl(domain_pk)

这里的诀窍是来自多进程导入过程,它绕过Twisted框架中的ReactorNotRestartable问题。所以基本上Celery任务调用了domain_crawl函数,它重复使用了DomainCrawlerScript对象,并且与Scrapy蜘蛛接口。 (我知道我的例子有点多余,但是我做了这个原因是在我的设置中有多个版本的python [我的django webserver实际上是使用python2.4,我的工作服务器使用python2.7])



在我的例子中,DomainSpider只是一个修改后的Scrapy Spider,它将URL列表设置为start_urls。



希望这有帮助!


I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

Something like this:

class StandAloneSpider(Spider):
    #a regular spider

settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...

crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl( spider )
crawler.start()

I've decided to use Celery and use workers to queue up the crawl requests.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

Anyone can share any tips with running Spiders within the Celery framework?

解决方案

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/

First the tasks.py file

from celery import task

@task()
def crawl_domain(domain_pk):
    from crawl import domain_crawl
    return domain_crawl(domain_pk)

Then the crawl.py file

from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from spider import DomainSpider
from models import Domain

class DomainCrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        self.crawler.install()
        self.crawler.configure()

    def _crawl(self, domain_pk):
        domain = Domain.objects.get(
            pk = domain_pk,
        )
        urls = []
        for page in domain.pages.all():
            urls.append(page.url())
        self.crawler.crawl(DomainSpider(urls))
        self.crawler.start()
        self.crawler.stop()

    def crawl(self, domain_pk):
        p = Process(target=self._crawl, args=[domain_pk])
        p.start()
        p.join()

crawler = DomainCrawlerScript()

def domain_crawl(domain_pk):
    crawler.crawl(domain_pk)

The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])

In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".

Hope this helps!

这篇关于在芹菜任务中运行Scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆