运行scrapy crawler的最简单方法,因此它不会阻止脚本 [英] Easiest way to run scrapy crawler so it doesn't block the script

查看:27
本文介绍了运行scrapy crawler的最简单方法,因此它不会阻止脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

官方文档给出了很多方法用于从代码运行 scrapy 爬虫:

The official docs give many ways for running scrapy crawlers from code:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

但它们都阻止脚本,直到爬行完成.python中以非阻塞、异步方式运行爬虫的最简单方法是什么?

But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?

推荐答案

我尝试了所有能找到的解决方案,唯一对我有用的是 这个.但是为了让它与 scrapy 1.1rc1 一起工作,我不得不稍微调整一下:

I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit:

from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process

class CrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(spider.__class__, settings)
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        reactor.run()

def crawl_async():
    spider = MySpider()
    crawler = CrawlerScript(spider)
    crawler.start()
    crawler.join()

所以现在当我调用 crawl_async 时,它开始爬行并且不会阻塞我当前的线程.我对 scrapy 完全陌生,所以这可能不是一个很好的解决方案,但它对我有用.

So now when I call crawl_async, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy, so may be this isn't a very good solution but it worked for me.

我使用了这些版本的库:

I used these versions of the libraries:

cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22

这篇关于运行scrapy crawler的最简单方法,因此它不会阻止脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆