将抓取的项目导出到不同的文件 [英] Export scrapy items to different files

查看:67
本文介绍了将抓取的项目导出到不同的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从Moocs抓取评论,例如

I'm scraping review from moocs likes this one

从那里,我获得了所有课程详细信息,每条评论本身的5项和另外6项.

From there I'm getting all the course details, 5 items and another 6 items from each review itself.

这是我用于课程详细信息的代码:

This is the code I have for the course details:

def parse_reviews(self, response):
    l = ItemLoader(item=MoocsItem(), response=response)
    l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
    l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
    l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
    l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
    l.add_value('course_link', response.url)
    return l.load_item()

现在,我想包括评论详细信息,每个评论还有5个项目. 由于所有评论的课程数据都是通用的,因此我想将其存储在其他文件中,然后使用课程名称/id关联数据.

Now I want to include the review details, another 5 items for each review. Since the course data is common for all the reviews I want to store it in a different file and use course name/id to relate the data afterward.

这是我为评论内容提供的代码:

This is the code I have for the review's items:

for review in response.xpath('//*[@class="review-body"]'):
    review_body = review.xpath('.//div[@class="review-body__content"]//text()').extract()
    course_stage =  review.xpath('.//*[@class="review-body-info__course-stage--completed"]//text()').extract()
    user_name =  review.xpath('.//*[@class="review-body__username"]//text()').extract()
    review_date =  review.xpath('.//*[@itemprop="datePublished"]/@datetime').extract()
    score =  review.xpath('.//*[@class="sr-only"]//text()').extract()

我尝试使用一种临时解决方案,返回每种情况下的所有项目,但均不起作用:

I tried to work with a temporary solution, returning all the items for each case but is not working either:

def parse_reviews(self, response):
    #print response.body
    l = ItemLoader(item=MoocsItem(), response=response)
    #l = MyItemLoader(selector=response)
    l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
    l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
    l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
    l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
    l.add_value('course_link', response.url)

    for review in response.xpath('//*[@class="review-body"]'):
        l.add_xpath('review_body', './/div[@class="review-body__content"]//text()')
        l.add_xpath('course_stage', './/*[@class="review-body-info__course-stage--completed"]//text()')
        l.add_xpath('user_name', './/*[@class="review-body__username"]//text()')
        l.add_xpath('review_date', './/*[@itemprop="datePublished"]/@datetime')
        l.add_xpath('score', './/*[@class="sr-only"]//text()')

        yield l.load_item()

该脚本的输出文件已损坏,单元已移位并且字段大小不正确.

The output file for that script is corrupted, cells are displaced and the size of the fields is not correct.

我想在输出中有两个文件:

I want to have two files at the output:

第一个包含:

course_title,course_description,course_instructors,course_key_concepts,course_link

第二个是:

course_title,review_body,course_stage,user_name,review_date,score

推荐答案

问题是您将所有内容混合到一个项目中,这不是正确的方法.您应该创建两个项目MoocsItemMoocsReviewItem

The issue is you are mixing everything up into a single item, which is not the right way to do it. You should created two items MoocsItem and MoocsReviewItem

然后更新如下代码

def parse_reviews(self, response):
    #print response.body
    l = ItemLoader(item=MoocsItem(), response=response)
    l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
    l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
    l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
    l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
    l.add_value('course_link', response.url)

    item = l.load_item()

    for review in response.xpath('//*[@class="review-body"]'):
        r = ItemLoader(item=MoocsReviewItem(), response=response, selector=review)
        r.add_value('course_title', item['course_title'])
        r.add_xpath('review_body', './/div[@class="review-body__content"]//text()')
        r.add_xpath('course_stage', './/*[@class="review-body-info__course-stage--completed"]//text()')
        r.add_xpath('user_name', './/*[@class="review-body__username"]//text()')
        r.add_xpath('review_date', './/*[@itemprop="datePublished"]/@datetime')
        r.add_xpath('score', './/*[@class="sr-only"]//text()')

        yield r.load_item()

    yield item

现在,您要的是将不同的项目类型放入不同的csv文件中.下面的SO线程回答了什么

Now what you want is that different item type goes in different csv files. Which is what the below SO thread answers

如何抓取导出项目以进行分离每个项目的csv文件

尚未测试以下内容,但是代码将变为以下内容

Have not tested the below, but the code will become something like below

from scrapy.exporters import CsvItemExporter
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher


def item_type(item):
    return type(item).__name__.replace('Item','').lower()  # TeamItem => team

class MultiCSVItemPipeline(object):
    SaveTypes = ['moocs','moocsreview']

    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self, spider):
        self.files = dict([ (name, open(CSVDir+name+'.csv','w+b')) for name in self.SaveTypes ])
        self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
        [e.start_exporting() for e in self.exporters.values()]

    def spider_closed(self, spider):
        [e.finish_exporting() for e in self.exporters.values()]
        [f.close() for f in self.files.values()]

    def process_item(self, item, spider):
        what = item_type(item)
        if what in set(self.SaveTypes):
            self.exporters[what].export_item(item)
        return item

您需要确保将ITEM_PIPELINES更新为使用此MultiCSVItemPipeline

You need make sure the ITEM_PIPELINES is updated to use this MultiCSVItemPipeline class

ITEM_PIPELINES = {
    'mybot.pipelines.MultiCSVItemPipeline': 300,
}

这篇关于将抓取的项目导出到不同的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆