pandas 数据框-选择行并清除内存? [英] pandas data frame - select rows and clear memory?

查看:90
本文介绍了 pandas 数据框-选择行并清除内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大熊猫数据框(大小= 3 GB):

I have a large pandas dataframe (size = 3 GB):

x = read.table('big_table.txt', sep='\t', header=0, index_col=0)

因为我在内存限制下工作,所以我对数据帧进行了子集处理:

Because I'm working under memory constraints, I subset the dataframe:

rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.iloc[rows, cols]

计算行和列的功能并不重要,但它们绝对是原始行和列的较小子集.但是,当我执行此操作时,内存使用量会大量增加!最初的目标是将内存占用空间减少到3GB以下,但内存使用量却远远超过6GB.

The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.

我猜这是因为Python在内存中创建了数据帧的本地副本,但是没有清理它.可能还有其他事情正在发生...所以我的问题是如何子集大数据框并清理空间?我找不到在适当位置选择行/列的函数.

I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.

我已经阅读了很多Stack Overflow,但是在这个话题上找不到很多.可能是我没有使用正确的关键字,因此,如果您有建议,那也可能会有所帮助.谢谢!

I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!

推荐答案

您最好这样做:

指定usecols以便从子位置选择要首先放在read_csv中的列,请参见

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

然后分块读取文件,如果此处,请阅读选择所需的行,将其分流以关闭,最后合并结果.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

伪代码ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, 
                     index_col=0, usecols=the_columns_i_want_to_use, 
                     chunksize=10000)

df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

这将具有恒定的内存使用率(块的大小)

This will have a constant memory usage (the size of a chunk)

加上选定的行使用率x 2,这将在您合并行时发生 合并后,使用量将下降到所选行的使用量

plus the selected rows usage x 2, which will happen when you concat the rows after the concat the usage will go down to selected rows usage

这篇关于 pandas 数据框-选择行并清除内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆