scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛 [英] scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()

查看:39
本文介绍了scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过脚本以编程方式调用蜘蛛.我无法使用 CrawlerProcess 通过构造函数覆盖设置.让我用默认的爬虫来说明这一点,用于从官方 scrapy 站点抓取引号(官方scrapy引用示例蜘蛛).

I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last code snippet at official scrapy quotes example spider).

class QuotesSpider(Spider):

    name = "quotes"

    def __init__(self, somestring, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.somestring = somestring
        self.custom_settings = kwargs


    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

这是我尝试运行引号蜘蛛的脚本

Here is the script through which I try to run the quotes spider

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

    def main():

    proc = CrawlerProcess(get_project_settings())

    custom_settings_spider = \
    {
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    }
    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()

推荐答案

Scrapy Settings 有点像 Python dicts.因此,您可以在将设置对象传递给 CrawlerProcess 之前更新设置对象:

Scrapy Settings are a bit like Python dicts. So you can update the settings object before passing it to CrawlerProcess:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

def main():

    s = get_project_settings()
    s.update({
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    })
    proc = CrawlerProcess(s)

    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()

<小时>

编辑以下 OP 的评论:


Edit following OP's comments:

这是使用 CrawlerRunner 的变体,每次抓取都有一个新的 CrawlerRunner,并在每次迭代时重新配置日志记录以每次写入不同的文件:

Here's a variation using CrawlerRunner, with a new CrawlerRunner for each crawl and re-configuring logging at each iteration to write to different files each time:

import logging
from twisted.internet import reactor, defer

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        page = getattr(self, 'page', 1)
        yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
                             self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }


@defer.inlineCallbacks
def crawl():
    s = get_project_settings()
    for i in range(1, 4):
        s.update({
            'FEED_URI': 'quotes%03d.csv' % i,
            'LOG_FILE': 'quotes%03d.log' % i
        })

        # manually configure logging for LOG_FILE
        configure_logging(settings=s, install_root_handler=False)
        logging.root.setLevel(logging.NOTSET)
        handler = _get_handler(s)
        logging.root.addHandler(handler)

        runner = CrawlerRunner(s)
        yield runner.crawl(QuotesSpider, page=i)

        # reset root handler
        logging.root.removeHandler(handler)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

这篇关于scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆