将抓取的项目导出到不同的文件 [英] Export scrapy items to different files
问题描述
I'm scraping review from moocs likes this one
从那里,我获得了所有课程详细信息,每条评论本身的5项和另外6项.
From there I'm getting all the course details, 5 items and another 6 items from each review itself.
这是我用于课程详细信息的代码:
This is the code I have for the course details:
def parse_reviews(self, response):
l = ItemLoader(item=MoocsItem(), response=response)
l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
return l.load_item()
现在,我想包括评论详细信息,每个评论还有5个项目. 由于所有评论的课程数据都是通用的,因此我想将其存储在其他文件中,然后使用课程名称/id关联数据.
Now I want to include the review details, another 5 items for each review. Since the course data is common for all the reviews I want to store it in a different file and use course name/id to relate the data afterward.
这是我为评论内容提供的代码:
This is the code I have for the review's items:
for review in response.xpath('//*[@class="review-body"]'):
review_body = review.xpath('.//div[@class="review-body__content"]//text()').extract()
course_stage = review.xpath('.//*[@class="review-body-info__course-stage--completed"]//text()').extract()
user_name = review.xpath('.//*[@class="review-body__username"]//text()').extract()
review_date = review.xpath('.//*[@itemprop="datePublished"]/@datetime').extract()
score = review.xpath('.//*[@class="sr-only"]//text()').extract()
我尝试使用一种临时解决方案,返回每种情况下的所有项目,但均不起作用:
I tried to work with a temporary solution, returning all the items for each case but is not working either:
def parse_reviews(self, response):
#print response.body
l = ItemLoader(item=MoocsItem(), response=response)
#l = MyItemLoader(selector=response)
l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
for review in response.xpath('//*[@class="review-body"]'):
l.add_xpath('review_body', './/div[@class="review-body__content"]//text()')
l.add_xpath('course_stage', './/*[@class="review-body-info__course-stage--completed"]//text()')
l.add_xpath('user_name', './/*[@class="review-body__username"]//text()')
l.add_xpath('review_date', './/*[@itemprop="datePublished"]/@datetime')
l.add_xpath('score', './/*[@class="sr-only"]//text()')
yield l.load_item()
该脚本的输出文件已损坏,单元已移位并且字段大小不正确.
The output file for that script is corrupted, cells are displaced and the size of the fields is not correct.
我想在输出中有两个文件:
I want to have two files at the output:
第一个包含:
course_title,course_description,course_instructors,course_key_concepts,course_link
第二个是:
course_title,review_body,course_stage,user_name,review_date,score
推荐答案
问题是您将所有内容混合到一个项目中,这不是正确的方法.您应该创建两个项目MoocsItem
和MoocsReviewItem
The issue is you are mixing everything up into a single item, which is not the right way to do it. You should created two items MoocsItem
and MoocsReviewItem
然后更新如下代码
def parse_reviews(self, response):
#print response.body
l = ItemLoader(item=MoocsItem(), response=response)
l.add_xpath('course_title', '//*[@class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[@class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[@class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[@class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
item = l.load_item()
for review in response.xpath('//*[@class="review-body"]'):
r = ItemLoader(item=MoocsReviewItem(), response=response, selector=review)
r.add_value('course_title', item['course_title'])
r.add_xpath('review_body', './/div[@class="review-body__content"]//text()')
r.add_xpath('course_stage', './/*[@class="review-body-info__course-stage--completed"]//text()')
r.add_xpath('user_name', './/*[@class="review-body__username"]//text()')
r.add_xpath('review_date', './/*[@itemprop="datePublished"]/@datetime')
r.add_xpath('score', './/*[@class="sr-only"]//text()')
yield r.load_item()
yield item
现在,您要的是将不同的项目类型放入不同的csv文件中.下面的SO线程回答了什么
Now what you want is that different item type goes in different csv files. Which is what the below SO thread answers
尚未测试以下内容,但是代码将变为以下内容
Have not tested the below, but the code will become something like below
from scrapy.exporters import CsvItemExporter
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
def item_type(item):
return type(item).__name__.replace('Item','').lower() # TeamItem => team
class MultiCSVItemPipeline(object):
SaveTypes = ['moocs','moocsreview']
def __init__(self):
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.files = dict([ (name, open(CSVDir+name+'.csv','w+b')) for name in self.SaveTypes ])
self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
[e.start_exporting() for e in self.exporters.values()]
def spider_closed(self, spider):
[e.finish_exporting() for e in self.exporters.values()]
[f.close() for f in self.files.values()]
def process_item(self, item, spider):
what = item_type(item)
if what in set(self.SaveTypes):
self.exporters[what].export_item(item)
return item
您需要确保将ITEM_PIPELINES
更新为使用此MultiCSVItemPipeline
类
You need make sure the ITEM_PIPELINES
is updated to use this MultiCSVItemPipeline
class
ITEM_PIPELINES = {
'mybot.pipelines.MultiCSVItemPipeline': 300,
}
这篇关于将抓取的项目导出到不同的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!