如何在python中流入和操作大型数据文件 [英] How to stream in and manipulate a large data file in python
问题描述
我有一个相对较大的(1 GB)文本文件,我想通过在各个类别之间求和来缩小大小:
I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:
Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0
To:
Geography Count
County1 15
County2 23
如果整个文件可以适合内存但使用 pandas.read_csv()
给出 MemoryError
。所以我一直在研究其他方法,似乎有很多选择 - HDF5?使用 itertools
(这似乎很复杂 - 生成器?)或者只是使用标准文件方法读取第一个地理位置(70行),计算列数,并在之前写出加载另外70行。有没有人有任何建议,最好的方式来做到这一点?
This would be a simple matter if the whole file could fit in memory but using pandas.read_csv()
gives MemoryError
. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools
(which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.
我特别喜欢流式传输数据的想法,特别是因为我可以想到很多其他地方,这将是有用的。我最喜欢这种方法,或者同样使用最基本功能的方法。
Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.
编辑在这种小小的情况下,我只想要地理数据的总和。但是,如果我可以读取一个大块,指定任何函数(比如,一起添加2列,或者通过地理位置添加一列列表),这将是理想的,应用该函数,并在读取新的块之前写入输出。
In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.
推荐答案
您可以使用 dask.dataframe
,其语法上类似于 pandas
,但执行操作不核心,所以内存不应该是一个问题:
You can use dask.dataframe
, which is syntactically similar to pandas
, but performs manipulations out-of-core, so memory shouldn't be an issue:
import dask.dataframe as dd
df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
或者,如果大熊猫
是一个要求,你可以使用分块读取,如@chrisaycock所提到的。您可能需要尝试 chunksize
参数。
Alternatively, if pandas
is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize
parameter.
# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
data.append(chunk)
# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')
这篇关于如何在python中流入和操作大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!