如何以正确的格式将抓取的数据导出到csv文件? [英] How can i export scraped data to csv file in the right format?

查看:83
本文介绍了如何以正确的格式将抓取的数据导出到csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我根据@paultrmbrth的

I made an improvement to my code according to

但是我的代码的csv输出有点混乱,像这样:

But my code's csv output is little messy, like this:

我有两个问题,无论如何,csv输出是否可以像第一张图片一样?我的第二个问题是,我也想删除电影标题,请给我一个提示或提供一个代码,我可以使用该代码来刮取电影标题和内容.

I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.

更新
塔伦·拉尔瓦尼(Tarun Lalwani)完美地解决了这个问题.但是,现在,csv文件的标题仅包含第一个抓取的url类别.例如,当我尝试抓取此网页时,具有References, Referenced in, Features, Featured in and Spoofed in类别和此网页,其中Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in类别,则csv输出文件标头将仅包含第一个网页的类别,即References, Referenced in, Features, Featured in and Spoofed in,因此第二个网页中的某些类别(如Follows, Followed by, Edited from, Edited into and Spoofs)将不在输出csv文件标头上,因此其内容也不在此. 这是我使用的代码:

UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in categories then the csv output file header will only contain the first webpage's categories i.e References, Referenced in, Features, Featured in and Spoofed in so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofswill not be on the output csv file header so is its contents.
Here is the code i used:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["imdb.com"]
    start_urls = (
        'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
        'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
    )

    def parse(self, response):
        item = {}
        for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
            item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
            key = h4.xpath('normalize-space()').get().strip()
            if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']:
                values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
                    'string(.//a)').getall(),
                item[key] = values
        yield item

这是exporters.py文件:

try:
    from itertools import zip_longest as zip_longest
except:
    from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])

我想要实现的是我希望所有这些类别都在csv输出标头上.

What I'm trying to achieve is i want all these categories to be on the csv output header.

'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'   

任何帮助将不胜感激.

推荐答案

您可以使用以下内容提取标题

You can extract the title using below

item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()

对于CSV部分,您需要创建一个FeedExports,可以将每一行拆分成多行

For the CSV part you would need to create a FeedExports which can split each row into multiple rows

from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow(row)

然后,您需要在设置中分配供稿导出器

Then you need to assign the feed exporter in your settings

FEED_EXPORTERS = {
    'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}

假设您将代码放在exporters.py文件中.输出将是所需的

Assuming you put the code in exporters.py file. The output will be as desired

编辑1

要设置字段及其顺序,您需要在settings.py

To set the fields and their order you will need to define FEED_EXPORT_FIELDS in your settings.py

FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']

https://doc.scrapy .org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS

这篇关于如何以正确的格式将抓取的数据导出到csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆