CrawlerProcess 与 CrawlerRunner [英] CrawlerProcess vs CrawlerRunner

查看:65
本文介绍了CrawlerProcess 与 CrawlerRunner的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scrapy 1.x 文档 解释了有两种方法可以从脚本运行 Scrapy 蜘蛛:

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:

两者有什么区别?什么时候应该用process",什么时候用runner"?

What is the difference between the two? When should I use "process" and when "runner"?

推荐答案

Scrapy 的文档在给出两者的实际应用示例方面做得很差.

Scrapy's documentation does a pretty bad job at giving examples on real applications of both.

CrawlerProcess 假设scrapy 是唯一会使用twisted 反应器的东西.如果您在 python 中使用线程来运行其他代码,这并不总是正确的.我们以此为例.

CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.

from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...
class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop

现在,如您所见,该函数只会在爬虫停止时执行,如果我希望该函数在爬虫在同一个反应器中爬行时执行怎么办?

Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy

def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function

Runner 类不限于此功能,您可能需要在您的反应器上进行一些自定义设置(延迟、线程、getPage、自定义错误报告等)

The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

这篇关于CrawlerProcess 与 CrawlerRunner的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆