Pandas.groupby.apply()中的内存泄漏? [英] Memory leak in Pandas.groupby.apply()?

查看:389
本文介绍了Pandas.groupby.apply()中的内存泄漏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将Pandas用于一个csv源文件约为600mb的项目.在分析过程中,我正在读取csv中的数据框,在某些列上进行分组,然后将简单的函数应用于分组的数据框.我注意到在此过程中我要进入交换内存,因此进行了基本测试:

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:

我首先在外壳中创建了一个相当大的数据框:

I first created a fairly large dataframe in the shell:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])

我定义了一个无意义的函数,称为do_nothing():

I defined a pointless function called do_nothing():

def do_nothing(group):
    return group

并运行以下命令:

df = df.groupby('a').apply(do_nothing)

我的系统具有16gb的RAM,并且正在运行Debian(薄荷糖).创建数据帧后,我正在使用〜600mb的RAM.一旦apply方法开始执行,该值便开始飙升.在完成命令并稳定回落到5.4gb(外壳仍处于活动状态)之前,它稳定地上升到大约7gb(!).问题是,我的工作需要做的事情比'do_nothing'方法还多,因此在执行真实程序时,我为16gb的RAM设置了上限并开始交换,从而使程序无法使用.这是故意的吗?我看不出为什么熊猫必须需要7GB的RAM才能有效地不做任何事情",即使它必须存储分组的对象也是如此.

My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.

有关导致此问题/如何解决此问题的任何想法?

Any ideas on what's causing this/how to fix it?

干杯

.P

推荐答案

使用0.14.1,我认为它们不是内存泄漏(您的帧的1/3).

Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).

In [79]: df = DataFrame(np.random.randn(100000,3))

In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop

In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop

关于如何解决这样的问题的两个一般性评论:

Two general comments on how to approach a problem like this:

1)尽可能使用cython级别功能,速度将大大提高,并且将使用更少的内存. IOW,使用功能解耦groupby表达式和void几乎总是值得的(如果可能的话,有些事情太复杂了,但这就是要分解的意思).例如

1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.

代替:

df.groupby(...).apply(lambda x: x.sum() / x.mean())

做起来更好:

g = df.groupby(...)
g.sum() / g.mean()

2)您可以通过手动进行聚合来轻松地控制"组(此外,如果需要,这将允许定期输出和垃圾回收).

2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).

results = []
for i, (g, grp) in enumerate(df.groupby(....)):

    if i % 500 == 0:
        print "checkpoint: %s" % i
        gc.collect()


    results.append(func(g,grp))

# final result
pd.concate(results)

这篇关于Pandas.groupby.apply()中的内存泄漏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆