当脚本在根目录之外时获取scrapy项目设置 [英] Getting scrapy project settings when script is outside of root directory

查看:51
本文介绍了当脚本在根目录之外时获取scrapy项目设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个 Scrapy 蜘蛛,它可以从位于项目根目录中的脚本成功运行.由于我需要从同一个脚本的不同项目运行多个蜘蛛程序(这将是一个 django 应用程序,根据用户的请求调用脚本),我将脚本从其中一个项目的根目录移动到父目录.出于某种原因,脚本不再能够获取项目的自定义设置,以便将抓取的结果通过管道传输到数据库表中.这是我用来从脚本运行蜘蛛的 scrapy 文档中的代码:

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:

def spiderCrawl():
   settings = get_project_settings()
   settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
   process = CrawlerProcess(settings)
   process.crawl(MySpider3)
   process.start()

是否有一些额外的模块需要导入才能从项目外部获取项目设置?或者是否需要对该代码进行一些添加?下面我还有运行蜘蛛的脚本的代码,谢谢.

Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.

from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider 

tc_spider.spiderCrawl()
vs_spider.spiderCrawl()

推荐答案

感谢这里已经提供的一些答案,我意识到scrapy实际上并没有导入settings.py文件.我就是这样解决的.

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

TLDR:确保将SCRAPY_SETTINGS_MODULE"变量设置为实际的 settings.py 文件.我在 Scraper 的 __init__() 函数中执行此操作.

考虑一个具有以下结构的项目.

Consider a project with the following structure.

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

基本上,命令scrapy startproject scraper 在 my_project 文件夹中执行,我添加了一个 run_scraper.py 文件到外部 scraper 文件夹,一个 main.py文件到我的根文件夹,quotes_spider.py 到蜘蛛文件夹.

Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

我的主文件:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

我的run_scraper.py文件:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spider = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

另外,请注意设置可能需要查看,因为路径需要根据根文件夹(my_project,而不是刮板).所以就我而言:

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

并重复您拥有的所有设置变量!

And repeat for all the settings variables you have!

这篇关于当脚本在根目录之外时获取scrapy项目设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆