如何以正确的格式将抓取的数据导出到 csv 文件? [英] How can i export scraped data to csv file in the right format?

查看:25
本文介绍了如何以正确的格式将抓取的数据导出到 csv 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我根据@paultrmbrth 的

但是我的代码的 csv 输出有点乱,像这样:

我有两个问题,无论如何,csv输出可以像第一张图片一样吗?我的第二个问题是,我也希望电影标题也被废弃,请给我一个提示或提供给我一个代码,我可以用它来抓取电影标题和内容.

更新
这个问题已经被 Tarun Lalwani 完美的解决了.但是现在,csv 文件的标题仅包含第一个抓取的 url 类别.例如,当我尝试抓取

编辑-1

要设置字段及其顺序,您需要在 settings.py

中定义 FEED_EXPORT_FIELDS

FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in','精选','欺骗','参考','欺骗','版本','重制为','编辑自','特征']

https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS

I made an improvement to my code according to

But my code's csv output is little messy, like this:

I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.

UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in categories then the csv output file header will only contain the first webpage's categories i.e References, Referenced in, Features, Featured in and Spoofed in so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofswill not be on the output csv file header so is its contents.
Here is the code i used:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["imdb.com"]
    start_urls = (
        'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
        'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
    )

    def parse(self, response):
        item = {}
        for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
            item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
            key = h4.xpath('normalize-space()').get().strip()
            if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']:
                values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
                    'string(.//a)').getall(),
                item[key] = values
        yield item

and here is exporters.py file:

try:
    from itertools import zip_longest as zip_longest
except:
    from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])

What I'm trying to achieve is i want all these categories to be on the csv output header.

'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'   

Any help would be appreciated.

解决方案

You can extract the title using below

item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()

For the CSV part you would need to create a FeedExports which can split each row into multiple rows

from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow(row)

Then you need to assign the feed exporter in your settings

FEED_EXPORTERS = {
    'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}

Assuming you put the code in exporters.py file. The output will be as desired

Edit-1

To set the fields and their order you will need to define FEED_EXPORT_FIELDS in your settings.py

FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']

https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS

这篇关于如何以正确的格式将抓取的数据导出到 csv 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆