Scrapy将返回的项目存储在变量中以在主脚本中使用 [英] Scrapy store returned items in variables to use in main script

查看:114
本文介绍了Scrapy将返回的项目存储在变量中以在主脚本中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Scrapy还是陌生的,想尝试以下方法: 从网页中提取一些值,将其存储在变量中,然后在我的主脚本中使用它. 因此,我遵循了他们的教程并出于我的目的更改了代码:

I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        global title # This would work, but there should be a better way
        title = response.css('title::text').extract_first()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished

print(title) # Verify if it works and do some other actions later on...

到目前为止,这仍然可行,但是如果我将title变量定义为global的话,我可以肯定这不是一个好的样式,甚至会有一些不良的副作用. 如果我跳过那一行,那么我当然会收到未定义的变量"错误:/ 因此,我正在寻找一种返回变量并将其用于我的主脚本的方法.

This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.

我已经阅读了有关项目管道的信息,但是我无法使其工作.

I have read about item pipeline but I was not able to make it work.

非常感谢任何帮助/想法:) 预先感谢!

Any help/ideas are heavily appreciated :) Thanks in advance!

推荐答案

众所周知,使用global并不是一种好的样式,尤其是当您需要扩展需求时.

using global as you know is not a good style,especially while you need to extend your demand.

我的建议是将标题存储到文件或列表中,并在您的主要过程中使用它,或者如果您想使用其他脚本来处理标题,那么只需打开文件并在脚本中读取标题

My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script

(注意:请忽略缩进问题)

(Note: please ignore the indentation issue)

spider.py

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess

namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')

try:
    title_in_file = open(namefile,'r').readlines()
except:
    title_in_file = open(namefile,'w')

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        title = response.css('title::text').extract_first()
        if title +'\n' not in title_in_file  and title not in current_title_session:
             file_append.write(title+'\n')
             current_title_session.append(title)
if __name__=='__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished

这篇关于Scrapy将返回的项目存储在变量中以在主脚本中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆