如何在程序中将参数传递给scrapy蜘蛛? [英] How to pass parameters to scrapy spiders in program?

查看:46
本文介绍了如何在程序中将参数传递给scrapy蜘蛛?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python和scrapy的新手.我使用了这个博客中的方法以编程方式运行多个scrapy蜘蛛来运行我的烧瓶应用程序中的蜘蛛.这是代码:

I am a newbie of python and scrapy. I used the method in this blog Running multiple scrapy spiders programmatically to run my spiders in a flask app.Here is the code:

# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]

# crawlers that are running 
RUNNING_CRAWLERS = []

def spider_closing(spider):
    """
    Activates on spider closed signal
    """
    log.msg("Spider closed: %s" % spider, level=log.INFO)
    RUNNING_CRAWLERS.remove(spider)
    if not RUNNING_CRAWLERS:
        reactor.stop()

# start logger
log.start(loglevel=log.DEBUG)

# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
    settings = Settings()

    # crawl responsibly
    settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
    crawler = Crawler(settings)
    crawler_obj = spider()
    RUNNING_CRAWLERS.append(crawler_obj)

    # stop reactor when spider closes
    crawler.signals.connect(spider_closing, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(crawler_obj)
    crawler.start()

# blocks process; so always keep as the last statement
reactor.run()

这是我的蜘蛛代码:

class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

如您所见,我的代码中有一个参数 term = 'man',它是我的 start urls 的一部分.我不想固定这个参数,所以我想知道如何在我的程序中动态地给出start url或参数term?就像在命令行中运行蜘蛛一样,有一种方法可以传递参数,如下所示:

As you can see, there is a parameter term = 'man' in my code, and it's part of my start urls. I don't want this parameter to be fixed, so I wonder how can I give the start url or the parameter term dynamically in my program? Just like running a spider in command line there is a way can pass parameter as below:

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      self.start_urls = [kwargs.get('start_url')] 
And start it like: scrapy crawl my_spider -a start_url="http://some_url"

谁能告诉我如何处理这个问题?

Can anybody tell me how to deal with this?

推荐答案

首先,要在一个脚本中运行多个蜘蛛,推荐的方式是使用 scrapy.crawler.CrawlerProcess, 在这里传递蜘蛛类而不是蜘蛛实例.

First of all, to run multiple spiders in a script, the recommended way is to use scrapy.crawler.CrawlerProcess, where you pass spider classes and not spider instances.

要使用 CrawlerProcess 将参数传递给您的蜘蛛,您只需将参数添加到 .crawl() 调用中,在蜘蛛子类之后,例如

To pass arguments to your spider with CrawlerProcess, you just have to add the arguments to the .crawl() call, after the spider subclass, e.g.

    process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')

以这种方式传递的参数然后可用作蜘蛛属性(与命令行上的 -a term=someterm 相同)

Arguments passed this way are then available as spider attributes (same as with -a term=someterm on the command line)

最后,您可以使用 start_requests,您可以使用 self.term:

Finally, instead of building start_urls in __init__, you can achieve the same with start_requests, and you can build initial requests like this, using self.term:

def start_requests(self):
    yield Request("http://epgd.biosino.org/"
                  "EPGD/search/textsearch.jsp?"
                  "textquery={}"
                  "&submit=Feeling+Lucky".format(self.term))

这篇关于如何在程序中将参数传递给scrapy蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆