pandas 集团通过内存释放 [英] Pandas GroupBy memory deallocation

查看:104
本文介绍了 pandas 集团通过内存释放的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到在遍历熊猫时分配的内存 GroupBy 对象在迭代后不会释放.我使用resource.getrusage(resource.RUSAGE_SELF).ru_maxrss(这篇文章中的第二个答案以获取详细信息)来衡量Python进程使用的活动内存总量.

I noticed that memory allocated while iterating through a Pandas GroupBy object is not deallocated after iteration. I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss (second answer in this post for details) to measure the total amount of active memory used by the Python process.

import resource
import gc

import pandas as pd
import numpy as np

i = np.random.choice(list(range(100)), 4000)
cols = list(range(int(2e4)))

df = pd.DataFrame(1, index=i, columns=cols)

gb = df.groupby(level=0)
# gb = list(gb)
for i in range(3):
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
    for idx, x in enumerate(gb):
        if idx == 0:
            print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
    # del idx, x
    # gc.collect()

打印以下总活动内存(以gb为单位)

prints the following total active memory (in gb)

0.671732
1.297424
1.297952
1.923288
1.923288
2.548624

解决方案

取消注释del idx, xgc.collect()可以解决此问题.但是,我确实必须del引用所有通过遍历groupby返回的DataFrames的变量(根据内部for循环中的代码,这可能会很麻烦).新的打印内存用法变为:

Solutions

Uncommenting del idx, x and gc.collect() fixes the problem. I do however have to del all variables that reference the DataFrames returned by iterating over the groupby (which can be a pain depending on the code in the inner for loop). The new printed memory usages become:

0.671768
1.297412
1.297992
1.297992
1.297992
1.297992

或者,我可以取消注释gb = list(gb).产生的内存使用情况与先前解决方案中的使用情况大致相同:

Alternatively I can uncomment gb = list(gb). The resulting memory usages are roughly the same as those from the previous solution:

1.32874
1.32874
1.32874
1.32874
1.32874
1.32874

问题

  1. 为什么迭代完成后通过groupby迭代产生的DataFrames内存没有被释放?
  2. 有没有比上述两个更好的解决方案?如果不是,那么这两种解决方案中的哪一种更好"?

推荐答案

记忆怪异

这很有趣!您不需要del idx, x.仅使用gc.collect()可以为我保持内存恒定.与在循环内包含del语句相比,这要干净得多.

Memory Weirdness

This is very interesting! You do not need del idx, x. Only using gc.collect() worked to keep memory constant for me. This is much cleaner that having the del statements inside the loop.

这篇关于 pandas 集团通过内存释放的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆