Scrapy将返回的项目存储在变量中以在主脚本中使用 [英] Scrapy store returned items in variables to use in main script
问题描述
我对Scrapy还是陌生的,想尝试以下方法: 从网页中提取一些值,将其存储在变量中,然后在我的主脚本中使用它. 因此,我遵循了他们的教程并出于我的目的更改了代码:
I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:
import scrapy
from scrapy.crawler import CrawlerProcess
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
custom_settings = {
'LOG_ENABLED': 'False',
}
def parse(self, response):
global title # This would work, but there should be a better way
title = response.css('title::text').extract_first()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished
print(title) # Verify if it works and do some other actions later on...
到目前为止,这仍然可行,但是如果我将title变量定义为global的话,我可以肯定这不是一个好的样式,甚至会有一些不良的副作用. 如果我跳过那一行,那么我当然会收到未定义的变量"错误:/ 因此,我正在寻找一种返回变量并将其用于我的主脚本的方法.
This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.
我已经阅读了有关项目管道的信息,但是我无法使其工作.
I have read about item pipeline but I was not able to make it work.
非常感谢任何帮助/想法:) 预先感谢!
Any help/ideas are heavily appreciated :) Thanks in advance!
推荐答案
众所周知,使用global
并不是一种好的样式,尤其是当您需要扩展需求时.
using global
as you know is not a good style,especially while you need to extend your demand.
我的建议是将标题存储到文件或列表中,并在您的主要过程中使用它,或者如果您想使用其他脚本来处理标题,那么只需打开文件并在脚本中读取标题
My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script
(注意:请忽略缩进问题)
(Note: please ignore the indentation issue)
spider.py
spider.py
import scrapy
from scrapy.crawler import CrawlerProcess
namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')
try:
title_in_file = open(namefile,'r').readlines()
except:
title_in_file = open(namefile,'w')
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
custom_settings = {
'LOG_ENABLED': 'False',
}
def parse(self, response):
title = response.css('title::text').extract_first()
if title +'\n' not in title_in_file and title not in current_title_session:
file_append.write(title+'\n')
current_title_session.append(title)
if __name__=='__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished
这篇关于Scrapy将返回的项目存储在变量中以在主脚本中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!