如何在scrapy中通过CrawlerProcess传递自定义设置? [英] How to pass custom settings through CrawlerProcess in scrapy?

查看:61
本文介绍了如何在scrapy中通过CrawlerProcess传递自定义设置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 CrawlerProcesses,每个都调用不同的蜘蛛.我想将自定义设置传递给这些进程之一以将蜘蛛的输出保存到 csv,我想我可以这样做:

I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this:

storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'}
process = CrawlerProcess(get_project_settings())
process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings )
process.start() 

在我的蜘蛛中,我把它们当作一个论点:

and in my spider I read them as an argument:

    def __init__(self, crawl_links=None, allowed_domains=None, customom_settings=None,  *args, **kwargs):
    self.start_urls = crawl_links
    self.allowed_domains = allowed_domains
    self.custom_settings = custom_settings
    self.rules = ......
    super(mySpider, self).__init__(*args, **kwargs)

但是我怎么能告诉我的项目设置文件settings.py"这些自定义设置呢?我不想对它们进行硬编码,而是希望它们被自动读取.

but how can I tell my project settings file "settings.py" about these custom settings? I don't want to hard code them, rather I want them to be read automatically.

推荐答案

您不能将这些设置告诉您的文件.您可能对爬虫设置和蜘蛛设置感到困惑.在scrapy中,截至本文撰写时的提要参数需要传递给爬虫进程而不是蜘蛛.您必须将它们作为参数传递给您的爬虫进程.我和你有同样的用例.您所做的是读取当前项目设置,然后为每个爬虫进程覆盖它.请参阅下面的示例代码:

You cannot tell your file about these settings. You are perhaps confused between crawler settings and spider settings. In scrapy, the feed paramaters as of the time of this wrting need to be passed to the crawler process and not to the spider. You have to pass them as parameters to your crawler process. I have the same use case as you. What you do is read the current project settings and then override it for each crawler process. Please see the example code below:

s = get_project_settings()
s['FEED_FORMAT'] = 'csv'
s['LOG_LEVEL'] = 'INFO'
s['FEED_URI'] = 'Q1.csv'
s['LOG_FILE'] = 'Q1.log'

proc = CrawlerProcess(s)

然后您对 process.crawl() 的调用不正确.蜘蛛的名字应该作为字符串的第一个参数传递,像这样: process.crawl('MySpider', crawl_links=main_links) 当然 MySpider 应该是赋予蜘蛛类中的 name 属性的值.

And then your call to process.crawl() is not correct. The name of the spider should be passed as the first argument as a string, like this: process.crawl('MySpider', crawl_links=main_links) and of course MySpider should be the value given to the name attribute in your spider class.

这篇关于如何在scrapy中通过CrawlerProcess传递自定义设置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆