如何使用 Scrapy 一次运行单个蜘蛛的多个版本? [英] How to run several versions of one single spider at one time with Scrapy?

查看:38
本文介绍了如何使用 Scrapy 一次运行单个蜘蛛的多个版本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题如下:

为了赢得时间,我想运行多个版本的单个蜘蛛.过程(解析定义)相同,项目相同,数据库中的集合相同.改变的是 start_url 变量.它看起来像这样:

To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the same, the items are the same, and the collection in database is the same. What is changing is the start_url variable. It looks like this:

"https://www.website.com/details/{0}-{1}-{2}/{3}/meeting".format(year,month,day,type_of_meeting)

考虑到日期相同,比如2018-10-24,我想同时推出两个版本:

Considering the date is the same, for instance 2018-10-24, I would like to launch two versions in the same time:

  • 版本 1 type_of_meeting = pmu
  • 版本 2 带有 type_of_meeting = pmh

这是我的问题的第一部分.在这里,我想知道是否必须在 一个单独的蜘蛛中创建两个不同的类,例如 class SpiderPmu(scrapy.Spider):class SpiderPmh(scrapy.Spider):spider.py 中.但是,如果这是您认为我必须做的最佳方式,那么考虑到 settings.py、pipelines.py,我不知道如何实现它.我已经从 scrapy.crawler 模块中阅读了关于 CrawlerProcess 的内容,但我不太明白如何在我的项目中实现它.堆栈主题scrapy doc.我不确定那部分process = CrawlerProcess()process.crawl(MySpider1)process.crawl(MySpider2)process.start() 必须在 spider.py 文件中.最重要的是,我不确定它是否能解决我的问题.

This is the first part of my problematic. And here I wonder if I must create two different classes in one single spider, like class SpiderPmu(scrapy.Spider): and class SpiderPmh(scrapy.Spider): in spider.py. But if it is the best way you think I must do, I don't know how to implement it considering settings.py, pipelines.py. I already read about CrawlerProcess from scrapy.crawler module but I don't understand well how to implement it in my project. stack subject, scrapy doc. I am not sure the part process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() must be in the spider.py file. Above all, I am not sure it answers to my problematic.

第二部分,是如何考虑不同的日期间隔来启动多个版本.

The second part, is how to launch several versions considering different date intervals.

我已经在我的蜘蛛类中创建了一些间隔范围,例如:

I already created some range of intervals in my spider class like:

  • year = range(2005,2019)
  • month = range(1,13)
  • day = range(1,32)

并将其放入循环中.效果很好.

and put it in a loop. That works well.

但是为了赢得时间,我想以不同的年份间隔发射几个蜘蛛.

But to win time, I would like to launch several spiders, with different intervals of years.

  • 第一个版本 year = range(2005,2007)
  • 第二个版本 year = range(2007,2009)
  • 依此类推,直到 year = range(2017,2019)

同时使用七个版本意味着快七倍.

Seven versions in the same time means seven times faster.

我可以为每个年份的范围创建 7 个不同的项目,但我认为这不是最聪明的方法...而且我不确定它是否会在使用相同的集合数据库 7 时产生冲突不同的项目同时运行.

I could create 7 different projects for each range of years, but I think this is not the smartest way... and I am not sure if it will, or not, create a conflict to use the same collection database for 7 different projects running in the same time.

我希望做一些类似打开 7 个命令的事情:

I expect to do something like opening 7 commands:

  1. scrapy crawl spiderpmu 版本type_of_race = pmu
  2. 输入年份范围": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. 蜘蛛正在爬行
  1. scrapy crawl spiderpmu for the version type_of_race = pmu
  2. "Enter a range of year": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. Spider is crawling

并行,如果这是强制性的,要做:

in parallel, if this is compulsory, to do:

  1. scrapy crawl spiderpmh 版本type_of_race = pmh
  2. 输入年份范围": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. 蜘蛛正在爬行
  1. scrapy crawl spiderpmh for the version type_of_race = pmh
  2. "Enter a range of year": with raw_input = 2010, 2012 ==> range(2010,2012)
  3. Spider is crawling

如果需要,可以使用单个蜘蛛或单个项目.

Possibly using one single spider, or one single project if needed.

我该怎么办?

PS:我已经安排了 prolipo 作为代理,Tor 网络更改 IP,并且 USER_AGENT 总是在更改.因此,我避免因同时爬行多个蜘蛛而被禁止.我的蜘蛛很礼貌",AUTOTHROTTLE_ENABLED = True.我想保持礼貌,但要更快.

PS: I already made arrangements with prolipo as proxy, Tor network to change IP, and USER_AGENT always changing. So, I avoid to be banned by crawling with multiple spiders in the same time. And my spider is "polite" with AUTOTHROTTLE_ENABLED = True. I want to keep it polite, but faster.

Scrapy 版本:1.5.0,Python 版本:2.7.9,Mongodb 版本:3.6.4,Pymongo 版本:3.6.1

Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1

推荐答案

所以,我找到了一个受 scrapy crawl -a variable=value

So, I find a solution inspired of the scrapy crawl -a variable=value

有关蜘蛛,在蜘蛛"文件夹中:

The spider concerned, in "spiders" folder was transformed:

class MySpider(scrapy.Spider):
name = "arg"
allowed_domains = ['www.website.com']

    def __init__ (self, lo_lim=None, up_lim=None , type_of_race = None) : #lo_lim = 2017 , up_lim = 2019, type_of_race = pmu
        year  = range(int(lo_lim), int(up_lim)) # lower limit, upper limit, must be convert to integer type, instead this is string type
        month = range(1,13) #12 months
        day   = range(1,32) #31 days
        url   = []
        for y in year:
            for m in month:
                for d in day:
                    url.append("https://www.website.com/details/{}-{}-{}/{}/meeting".format(y,m,d,type_of_race))

        self.start_urls = url #where url = ["https://www.website.com/details/2017-1-1/pmu/meeting",
                                        #"https://www.website.com/details/2017-1-2/pmu/meeting",
                                        #...
                                        #"https://www.website.com/details/2017-12-31/pmu/meeting"
                                        #"https://www.website.com/details/2018-1-1/pmu/meeting",
                                        #"https://www.website.com/details/2018-1-2/pmu/meeting",
                                        #...
                                        #"https://www.website.com/details/2018-12-31/pmu/meeting"]

    def parse(self, response):
        ...`

然后,它解决了我的问题:保持一个蜘蛛,并通过多个命令一次运行它的多个版本而不会出现问题.

Then, it answers to my problematic: to keep one single spider, and to run several versions of it by serveral commands at one time without trouble.

没有 def __init__ 它对我不起作用.我尝试了很多方法,这就是适合我的完美代码.

Without a def __init__ it didn't work for me. I tried a lot of ways, that is this perfectible code that works for me.

Scrapy 版本:1.5.0,Python 版本:2.7.9,Mongodb 版本:3.6.4,Pymongo 版本:3.6.1

Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1

这篇关于如何使用 Scrapy 一次运行单个蜘蛛的多个版本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆