从大型csv创建词典列表 [英] Creating list of dictionaries from big csv

查看:101
本文介绍了从大型csv创建词典列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的csv文件(10 gb),我想阅读它并创建一个词典列表,其中每个字典代表csv中的一行. 像

I've a very big csv file (10 gb) and I'd like to read it and create a list of dictionaries where each dictionary represent a line in the csv. Something like

[{'value1': '20150302', 'value2': '20150225','value3': '5', 'IS_SHOP': '1', 'value4': '0', 'value5': 'GA321D01H-K12'},
{'value1': '20150302', 'value2': '20150225', 'value3': '1', 'value4': '0', 'value5': '1', 'value6': 'GA321D01H-K12'}]

为了避免出现任何内存问题,我正在尝试使用生成器来实现它,我当前的代码如下:

I'm trying to achieve it using a generator in order to avoid any memories issues, my current code is the following:

def csv_reader():
    with open('export.csv') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield {key: value for key, value in row.items()}

generator = csv_reader() 
list = []
for i in generator:
    list.append(i)

问题在于,由于列表太大而导致进程基本终止,因此基本上耗尽了内存. 有没有一种方法可以有效地达到相同的结果(词典列表)?我对生成器/收益率还很陌生,所以我什至不知道我是否正确使用了它.

The problem is that basically it runs out of memory because of the list becoming too big and the process is killed. Is there a way to achieve the same result (list of dictonaries) in an efficient way? I'm very new to generators/yield so I don't even know if I'm using it correctly.

我还尝试将虚拟环境与pypy一起使用,但是内存仍然会中断(不过稍后).

I also tried to use a virtual environment with pypy but the memory breaks anyway (a little later though).

基本上是我想要字典列表的原因,我想尝试使用fastavro将csv转换为avro格式,以便任何有关使用fastavro的提示(

Basically the reason why I want a list of dictionaries it that I want to try to convert the csv into an avro format using fastavro so any hints on how using fastavro (https://pypi.python.org/pypi/fastavro) without creating a list of dictionaries would be appreciated

推荐答案

如果目标是从csv转换为avro,则没有理由存储输入值的完整列表.这违反了使用生成器的全部目的.设置模式后,看起来 fastavro writer被设计为可迭代,并且一次将其写出一条记录,因此您可以直接将其传递给生成器.例如,您的代码将简单地省略创建list的步骤(附带说明:命名变量list是一个坏主意,因为它隐藏/重载了内置名称list),而直接编写生成器:

If the goal is to convert from csv to avro, there is no reason to store a complete list of the input values. That's defeating the whole purpose of using the generator. It looks like, after setting up a schema, fastavro's writer is designed to take an iterable and write it out one record at a time, so you can just pass it the generator directly. For example, your code would simply omit the step of creating the list (side-note: Naming variables list is a bad idea, since it shadows/stomps the builtin name list), and just write the generator directly:

from fastavro import writer

def csv_reader():
    with open('export.csv') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

    # If this is Python 3.3+, you could simplify further to just:
    with open('export.csv') as f:
        yield from csv.DictReader(f)

# schema could be from the keys of the first row which gets manually written
# or you can provide an explicit schema with documentation for each field
schema = {...}  

with open('export.avro', 'wb') as out:
    writer(out, schema, csv_reader())

生成器然后一次生成一行,而writer一次写入一行.输入行在写入后被丢弃,因此内存使用量保持最小.

The generator then produces one row at a time, and writer writes one row at a time. The input rows are discarded after writing, so memory usage remains minimal.

如果需要修改行,请在yield -ing之前在csv_reader生成器中修改row.

If you need to modify the rows, you'd modify the row in the csv_reader generator before yield-ing it.

这篇关于从大型csv创建词典列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆