如何在python中流入和操作大型数据文件 [英] How to stream in and manipulate a large data file in python

查看:119
本文介绍了如何在python中流入和操作大型数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相对较大的(1 GB)文本文件,我想通过在各个类别之间求和来缩小大小:

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1        M      1    12
County1   2        M      1    3
County1   2        M      2    0

To:

Geography Count
County1   15
County2   23

如果整个文件可以适合内存但使用 pandas.read_csv()给出 MemoryError 。所以我一直在研究其他方法,似乎有很多选择 - HDF5?使用 itertools (这似乎很复杂 - 生成器?)或者只是使用标准文件方法读取第一个地理位置(70行),计算列数,并在之前写出加载另外70行。有没有人有任何建议,最好的方式来做到这一点?

This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.

我特别喜欢流式传输数据的想法,特别是因为我可以想到很多其他地方,这将是有用的。我最喜欢这种方法,或者同样使用最基本功能的方法。

Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.

编辑在这种小小的情况下,我只想要地理数据的总和。但是,如果我可以读取一个大块,指定任何函数(比如,一起添加2列,或者通过地理位置添加一列列表),这将是理想的,应用该函数,并在读取新的块之前写入输出。

In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

推荐答案

您可以使用 dask.dataframe ,其语法上类似于 pandas ,但执行操作不核心,所以内存不应该是一个问题:

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

或者,如果大熊猫是一个要求,你可以使用分块读取,如@chrisaycock所提到的。您可能需要尝试 chunksize 参数。

Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

这篇关于如何在python中流入和操作大型数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆