如何以正确的格式将抓取的数据导出到csv文件? [英] How can i export scraped data to csv file in the right format?
问题描述
我根据@paultrmbrth的
I made an improvement to my code according to
但是我的代码的csv输出有点混乱,像这样:
But my code's csv output is little messy, like this:
我有两个问题,无论如何,csv输出是否可以像第一张图片一样?我的第二个问题是,我也想删除电影标题,请给我一个提示或提供一个代码,我可以使用该代码来刮取电影标题和内容.
I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.
更新
塔伦·拉尔瓦尼(Tarun Lalwani)完美地解决了这个问题.但是,现在,csv文件的标题仅包含第一个抓取的url类别.例如,当我尝试抓取此网页时,具有References, Referenced in, Features, Featured in and Spoofed in
类别和此网页,其中Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in
类别,则csv输出文件标头将仅包含第一个网页的类别,即References, Referenced in, Features, Featured in and Spoofed in
,因此第二个网页中的某些类别(如Follows, Followed by, Edited from, Edited into and Spoofs
)将不在输出csv文件标头上,因此其内容也不在此.
这是我使用的代码:
UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in
categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in
categories then the csv output file header will only contain the first webpage's categories i.e References, Referenced in, Features, Featured in and Spoofed in
so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofs
will not be on the output csv file header so is its contents.
Here is the code i used:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["imdb.com"]
start_urls = (
'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
)
def parse(self, response):
item = {}
for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
key = h4.xpath('normalize-space()').get().strip()
if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']:
values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
'string(.//a)').getall(),
item[key] = values
yield item
这是exporters.py
文件:
try:
from itertools import zip_longest as zip_longest
except:
from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])
我想要实现的是我希望所有这些类别都在csv输出标头上.
What I'm trying to achieve is i want all these categories to be on the csv output header.
'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'
任何帮助将不胜感激.
推荐答案
您可以使用以下内容提取标题
You can extract the title using below
item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
对于CSV部分,您需要创建一个FeedExports,可以将每一行拆分成多行
For the CSV part you would need to create a FeedExports which can split each row into multiple rows
from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow(row)
然后,您需要在设置中分配供稿导出器
Then you need to assign the feed exporter in your settings
FEED_EXPORTERS = {
'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}
假设您将代码放在exporters.py
文件中.输出将是所需的
Assuming you put the code in exporters.py
file. The output will be as desired
编辑1
要设置字段及其顺序,您需要在settings.py
To set the fields and their order you will need to define FEED_EXPORT_FIELDS
in your settings.py
FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']
https://doc.scrapy .org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS
这篇关于如何以正确的格式将抓取的数据导出到csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!