当脚本在根目录之外时获取scrapy项目设置 [英] Getting scrapy project settings when script is outside of root directory
问题描述
我制作了一个 Scrapy 蜘蛛,它可以从位于项目根目录中的脚本成功运行.由于我需要从同一个脚本的不同项目运行多个蜘蛛程序(这将是一个 django 应用程序,根据用户的请求调用脚本),我将脚本从其中一个项目的根目录移动到父目录.出于某种原因,脚本不再能够获取项目的自定义设置,以便将抓取的结果通过管道传输到数据库表中.这是我用来从脚本运行蜘蛛的 scrapy 文档中的代码:
I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
是否有一些额外的模块需要导入才能从项目外部获取项目设置?或者是否需要对该代码进行一些添加?下面我还有运行蜘蛛的脚本的代码,谢谢.
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()
推荐答案
感谢这里已经提供的一些答案,我意识到scrapy实际上并没有导入settings.py文件.我就是这样解决的.
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR:确保将SCRAPY_SETTINGS_MODULE"变量设置为实际的 settings.py 文件.我在 Scraper 的 __init__() 函数中执行此操作.
考虑一个具有以下结构的项目.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
基本上,命令scrapy startproject scraper
在 my_project 文件夹中执行,我添加了一个 run_scraper.py
文件到外部 scraper 文件夹,一个 main.py
文件到我的根文件夹,quotes_spider.py
到蜘蛛文件夹.
Basically, the command
scrapy startproject scraper
was executed in the my_project folder, I've added a run_scraper.py
file to the outer scraper folder, a main.py
file to my root folder, and quotes_spider.py
to the spiders folder.
我的主文件:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
我的run_scraper.py
文件:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
另外,请注意设置可能需要查看,因为路径需要根据根文件夹(my_project,而不是刮板).所以就我而言:
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
并重复您拥有的所有设置变量!
And repeat for all the settings variables you have!
这篇关于当脚本在根目录之外时获取scrapy项目设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!