pandas :这里的内存泄漏在哪里? [英] Pandas: where's the memory leak here?
问题描述
我使用 python 中的 pandas
库遇到内存泄漏的问题.我在类中创建了pandas.dataframe
对象,并且有方法可以根据我的条件更改数据框的大小.更改数据框大小并创建新的pandas对象后,我在类中重写了原始的pandas.dataframe.但是,即使在显着减少初始表之后,内存使用量仍然很高.简短示例的一些代码(我没有编写流程管理器,请参阅任务管理器):
I face the problem of memory leaks using pandas
library in python. I create pandas.dataframe
objects in my class and I have method, that change dataframe size according my conditions. After changing dataframe size and creating new pandas object I rewrite original pandas.dataframe in my class. But memory usage is very high even after significally reducing of initial table. Some code for short example (I didn't write process manager,see task manager):
import time, string, pandas, numpy, gc
class temp_class ():
def __init__(self, nrow = 1000000, ncol = 4, timetest = 5):
self.nrow = nrow
self.ncol = ncol
self.timetest = timetest
def createDataFrame(self):
print('Check memory before dataframe creating')
time.sleep(self.timetest)
self.df = pandas.DataFrame(numpy.random.randn(self.nrow, self.ncol),
index = numpy.random.randn(self.nrow), columns = list(string.letters[0:self.ncol]))
print('Check memory after dataFrame creating')
time.sleep(self.timetest)
def changeSize(self, from_ = 0, to_ = 100):
df_new = self.df[from_:to_].copy()
print('Check memory after changing size')
time.sleep(self.timetest)
print('Check memory after deleting initial pandas object')
del self.df
time.sleep(self.timetest)
print('Check memory after deleting copy of reduced pandas object')
del df_new
gc.collect()
time.sleep(self.timetest)
if __name__== '__main__':
a = temp_class()
a.createDataFrame()
a.changeSize()
-
在创建数据框之前,我大约需要15 mb的内存使用量
Before dataframe creating I have approx. 15 mb of memory usage
创建后-67mb
更改大小后-67 mb
After changing size - 67 mb
删除原始数据帧后-35mb
After deleting original dataframe - 35mb
删除缩小的表格后-31 mb.
After deleting reduced table - 31 mb.
16 mb?
我在Windows 7(x64)计算机pandas上使用python 2.7.2(x32).版本为0.7.3. numpy.版本为1.6.1
I use python 2.7.2(x32) on Windows 7 (x64) machine, pandas.version is 0.7.3. numpy.version is 1.6.1
推荐答案
需要指出的几点:
-
在更改大小后检查内存"中,您尚未删除原始DataFrame,因此这将占用更多的内存
In "Check memory after changing size", you haven't deleted the original DataFrame yet, so this will be using strictly more memory
Python解释器对于保留OS内存有点贪婪.
The Python interpreter is a bit greedy about holding onto OS memory.
我调查了一下,可以向您保证熊猫没有泄漏内存.我正在使用memory_profiler(http://pypi.python.org/pypi/memory_profiler)包:
I looked into this and can assure you that pandas is not leaking memory. I'm using the memory_profiler (http://pypi.python.org/pypi/memory_profiler) package:
import time, string, pandas, numpy, gc from memory_profiler import LineProfiler, show_results import memory_profiler as mprof prof = LineProfiler() @prof def test(nrow=1000000, ncol = 4, timetest = 5): from_ = nrow // 10 to_ = 9 * nrow // 10 df = pandas.DataFrame(numpy.random.randn(nrow, ncol), index = numpy.random.randn(nrow), columns = list(string.letters[0:ncol])) df_new = df[from_:to_].copy() del df del df_new gc.collect() test() # for _ in xrange(10): # print mprof.memory_usage() show_results(prof)
这是输出
10:15 ~/tmp $ python profmem.py Line # Mem usage Increment Line Contents ============================================== 7 @prof 8 28.77 MB 0.00 MB def test(nrow=1000000, ncol = 4, timetest = 5): 9 28.77 MB 0.00 MB from_ = nrow // 10 10 28.77 MB 0.00 MB to_ = 9 * nrow // 10 11 59.19 MB 30.42 MB df = pandas.DataFrame(numpy.random.randn(nrow, ncol), 12 66.77 MB 7.58 MB index = numpy.random.randn(nrow), 13 90.46 MB 23.70 MB columns = list(string.letters[0:ncol])) 14 114.96 MB 24.49 MB df_new = df[from_:to_].copy() 15 114.96 MB 0.00 MB del df 16 90.54 MB -24.42 MB del df_new 17 52.39 MB -38.15 MB gc.collect()
因此,确实有比开始时更多的内存正在使用.但是它泄漏了吗?
So indeed, there is more memory in use than when we started. But is it leaking?
for _ in xrange(20): test() print mprof.memory_usage()
并输出:
10:19 ~/tmp $ python profmem.py [52.3984375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59375] [122.59765625] [122.59765625] [122.59765625]
因此实际上发生的是,Python进程一直使用内存池,因为它一直在使用内存池,以避免必须不断从主机OS请求更多的内存(然后释放它).我不知道其背后的所有技术细节,但这至少是正在发生的事情.
So actually what's gone on is that the Python process is holding on to a pool of memory given what it's been using to avoid having to keep requesting more memory (and then freeing it) from the host OS. I don't know all the technical details behind this, but that is at least what is going on.
这篇关于 pandas :这里的内存泄漏在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-