如何以正确的格式将抓取的数据导出到 csv 文件? [英] How can i export scraped data to csv file in the right format?
问题描述
我根据@paultrmbrth 的
但是我的代码的 csv 输出有点乱,像这样:
我有两个问题,无论如何,csv输出可以像第一张图片一样吗?我的第二个问题是,我也希望电影标题也被废弃,请给我一个提示或提供给我一个代码,我可以用它来抓取电影标题和内容.
更新
这个问题已经被 Tarun Lalwani 完美的解决了.但是现在,csv 文件的标题仅包含第一个抓取的 url 类别.例如,当我尝试抓取
编辑-1
要设置字段及其顺序,您需要在 settings.py
FEED_EXPORT_FIELDS
FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in','精选','欺骗','参考','欺骗','版本','重制为','编辑自','特征']
https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS
I made an improvement to my code according to
But my code's csv output is little messy, like this:
I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.
UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in
categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in
categories then the csv output file header will only contain the first webpage's categories i.e References, Referenced in, Features, Featured in and Spoofed in
so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofs
will not be on the output csv file header so is its contents.
Here is the code i used:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["imdb.com"]
start_urls = (
'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
)
def parse(self, response):
item = {}
for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
key = h4.xpath('normalize-space()').get().strip()
if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']:
values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
'string(.//a)').getall(),
item[key] = values
yield item
and here is exporters.py
file:
try:
from itertools import zip_longest as zip_longest
except:
from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])
What I'm trying to achieve is i want all these categories to be on the csv output header.
'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'
Any help would be appreciated.
You can extract the title using below
item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
For the CSV part you would need to create a FeedExports which can split each row into multiple rows
from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow(row)
Then you need to assign the feed exporter in your settings
FEED_EXPORTERS = {
'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}
Assuming you put the code in exporters.py
file. The output will be as desired
Edit-1
To set the fields and their order you will need to define FEED_EXPORT_FIELDS
in your settings.py
FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']
https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS
这篇关于如何以正确的格式将抓取的数据导出到 csv 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!