分块,处理和在Pandas/Python中合并数据集 [英] Chunking, processing & merging dataset in Pandas/Python

查看:325
本文介绍了分块,处理和在Pandas/Python中合并数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个很大的数据集,其中包含一个字符串. 我只想使用宽度通过read_fwf打开它,就像这样:

There is a large dataset, containing a strings. I just want to open it via read_fwf using widths, like this:

widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)

这将有助于我标记数据, 但是系统崩溃了(使用nrows = 20000可以工作).然后,我决定按块(例如20000行)进行操作,如下所示:

It would help me to mark the data, But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:

cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...:  <some code using chunk>

我的问题是:在对块进行某些处理(标记行,删除或修改列)之后,我应该在循环中将这些块合并(合并)到.csv文件中吗?还是有另一种方法?

My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?

推荐答案

我将假定,因为读取了整个文件

I'm going to assume that since reading the entire file

tp = pandas.read_fwf(file, widths=widths, header=None)

失败,但可以大块读取,文件太大而无法一次读取,并且遇到了MemoryError.

fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.

在这种情况下,如果您可以分块处理数据,然后将结果连接到CSV中,则可以使用chunk.to_csv来分块编写CSV:

In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv to write the CSV in chunks:

filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
    # process the chunk
    chunk.to_csv(filename, mode='a')

请注意,mode='a'以附加模式打开文件,因此每个文件的输出 chunk.to_csv调用将附加到同一文件.

Note that mode='a' opens the file in append mode, so that the output of each chunk.to_csv call is appended to the same file.

这篇关于分块,处理和在Pandas/Python中合并数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆