如何在python中流入和操作大型数据文件 [英] How to stream in and manipulate a large data file in python

查看：119 发布时间：2017/3/26 2:08:09 python pandas dataframe itertools

本文介绍了如何在python中流入和操作大型数据文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相对较大的（1 GB）文本文件，我想通过在各个类别之间求和来缩小大小：

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1        M      1    12
County1   2        M      1    3
County1   2        M      2    0

To：

Geography Count
County1   15
County2   23

如果整个文件可以适合内存但使用 pandas.read_csv（）给出 MemoryError 。所以我一直在研究其他方法，似乎有很多选择 - HDF5？使用 itertools （这似乎很复杂 - 生成器？）或者只是使用标准文件方法读取第一个地理位置（70行），计算列数，并在之前写出加载另外70行。有没有人有任何建议，最好的方式来做到这一点？

This would be a simple matter if the whole file could fit in memory but using pandas.read_csv() gives MemoryError. So I have been looking into other methods, and there appears to be many options - HDF5? Using itertools (which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.

我特别喜欢流式传输数据的想法，特别是因为我可以想到很多其他地方，这将是有用的。我最喜欢这种方法，或者同样使用最基本功能的方法。

Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.

编辑在这种小小的情况下，我只想要地理数据的总和。但是，如果我可以读取一个大块，指定任何函数（比如，一起添加2列，或者通过地理位置添加一列列表），这将是理想的，应用该函数，并在读取新的块之前写入输出。

In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.

推荐答案

您可以使用 dask.dataframe ，其语法上类似于 pandas ，但执行操作不核心，所以内存不应该是一个问题：

You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

或者，如果大熊猫是一个要求，你可以使用分块读取，如@chrisaycock所提到的。您可能需要尝试 chunksize 参数。

Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

这篇关于如何在python中流入和操作大型数据文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python中流入和操作大型数据文件 [英] How to stream in and manipulate a large data file in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在python中流入和操作大型数据文件 [英] How to stream in and manipulate a large data file in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭