当脚本不在根目录中时,获取临时项目设置 [英] Getting scrapy project settings when script is outside of root directory

查看:75
本文介绍了当脚本不在根目录中时,获取临时项目设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个Scrapy Spider,可以从位于项目根目录中的脚本成功运行它。由于我需要从同一脚本(这是一个应用户要求调用该脚本的django应用程序)的不同项目中运行多个蜘蛛程序,因此将脚本从一个项目的根目录移到了父目录。由于某种原因,该脚本不再能够获得项目的自定义设置,以便将已抓取的结果通过管道传递到数据库表中。这是我用来从脚本运行蜘蛛的草书文档中的代码:

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:

def spiderCrawl():
   settings = get_project_settings()
   settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
   process = CrawlerProcess(settings)
   process.crawl(MySpider3)
   process.start()

是否存在一些额外的模块需要导入以便从项目外部获取项目设置?还是需要对该代码进行一些补充?

Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.

from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider 

tc_spider.spiderCrawl()
vs_spider.spiderCrawl()


推荐答案

感谢此处已提供的一些答案,我发现scrapy并没有真正导入settings.py文件。

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

TLDR:请确保将 SCRAPY_SETTINGS_MODULE变量设置为实际的settings.py文件。我正在Scraper的__init __()函数中进行此操作。

请考虑具有以下结构的项目。

Consider a project with the following structure.

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

基本上,命令
屏幕在my_project文件夹中执行了apy startproject scraper ,我在外部scraper文件夹中添加了 run_scraper.py 文件,将文件添加为 main.py 文件到我的根文件夹, quotes_spider.py 到蜘蛛文件夹。

Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

我的主文件:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

我的 run_scraper.py 文件:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spider = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

此外,请注意,设置可能需要进行查找,因为路径需要根据根文件夹(my_project,而不是scraper)。
因此,在我的情况下:

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

然后重复所有具有的设置变量!

And repeat for all the settings variables you have!

这篇关于当脚本不在根目录中时,获取临时项目设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆