从大型csv创建词典列表 [英] Creating list of dictionaries from big csv
问题描述
我有一个很大的csv文件(10 gb),我想阅读它并创建一个词典列表,其中每个字典代表csv中的一行. 像
I've a very big csv file (10 gb) and I'd like to read it and create a list of dictionaries where each dictionary represent a line in the csv. Something like
[{'value1': '20150302', 'value2': '20150225','value3': '5', 'IS_SHOP': '1', 'value4': '0', 'value5': 'GA321D01H-K12'},
{'value1': '20150302', 'value2': '20150225', 'value3': '1', 'value4': '0', 'value5': '1', 'value6': 'GA321D01H-K12'}]
为了避免出现任何内存问题,我正在尝试使用生成器来实现它,我当前的代码如下:
I'm trying to achieve it using a generator in order to avoid any memories issues, my current code is the following:
def csv_reader():
with open('export.csv') as f:
reader = csv.DictReader(f)
for row in reader:
yield {key: value for key, value in row.items()}
generator = csv_reader()
list = []
for i in generator:
list.append(i)
问题在于,由于列表太大而导致进程基本终止,因此基本上耗尽了内存. 有没有一种方法可以有效地达到相同的结果(词典列表)?我对生成器/收益率还很陌生,所以我什至不知道我是否正确使用了它.
The problem is that basically it runs out of memory because of the list becoming too big and the process is killed. Is there a way to achieve the same result (list of dictonaries) in an efficient way? I'm very new to generators/yield so I don't even know if I'm using it correctly.
我还尝试将虚拟环境与pypy一起使用,但是内存仍然会中断(不过稍后).
I also tried to use a virtual environment with pypy but the memory breaks anyway (a little later though).
基本上是我想要字典列表的原因,我想尝试使用fastavro将csv转换为avro格式,以便任何有关使用fastavro的提示(
Basically the reason why I want a list of dictionaries it that I want to try to convert the csv into an avro format using fastavro so any hints on how using fastavro (https://pypi.python.org/pypi/fastavro) without creating a list of dictionaries would be appreciated
推荐答案
如果目标是从csv
转换为avro
,则没有理由存储输入值的完整列表.这违反了使用生成器的全部目的.设置模式后,看起来 fastavro
writer
被设计为可迭代,并且一次将其写出一条记录,因此您可以直接将其传递给生成器.例如,您的代码将简单地省略创建list
的步骤(附带说明:命名变量list
是一个坏主意,因为它隐藏/重载了内置名称list
),而直接编写生成器:
If the goal is to convert from csv
to avro
, there is no reason to store a complete list of the input values. That's defeating the whole purpose of using the generator. It looks like, after setting up a schema, fastavro
's writer
is designed to take an iterable and write it out one record at a time, so you can just pass it the generator directly. For example, your code would simply omit the step of creating the list
(side-note: Naming variables list
is a bad idea, since it shadows/stomps the builtin name list
), and just write the generator directly:
from fastavro import writer
def csv_reader():
with open('export.csv') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# If this is Python 3.3+, you could simplify further to just:
with open('export.csv') as f:
yield from csv.DictReader(f)
# schema could be from the keys of the first row which gets manually written
# or you can provide an explicit schema with documentation for each field
schema = {...}
with open('export.avro', 'wb') as out:
writer(out, schema, csv_reader())
生成器然后一次生成一行,而writer
一次写入一行.输入行在写入后被丢弃,因此内存使用量保持最小.
The generator then produces one row at a time, and writer
writes one row at a time. The input rows are discarded after writing, so memory usage remains minimal.
如果需要修改行,请在yield
-ing之前在csv_reader
生成器中修改row
.
If you need to modify the rows, you'd modify the row
in the csv_reader
generator before yield
-ing it.
这篇关于从大型csv创建词典列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!