pandas :这里的内存泄漏在哪里? [英] Pandas: where's the memory leak here?

查看:79
本文介绍了 pandas :这里的内存泄漏在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 python 中的 pandas 库遇到内存泄漏的问题.我在类中创建了pandas.dataframe对象,并且有方法可以根据我的条件更改数据框的大小.更改数据框大小并创建新的pandas对象后,我在类中重写了原始的pandas.dataframe.但是,即使在显着减少初始表之后,内存使用量仍然很高.简短示例的一些代码(我没有编写流程管理器,请参阅任务管理器):

I face the problem of memory leaks using pandas library in python. I create pandas.dataframe objects in my class and I have method, that change dataframe size according my conditions. After changing dataframe size and creating new pandas object I rewrite original pandas.dataframe in my class. But memory usage is very high even after significally reducing of initial table. Some code for short example (I didn't write process manager,see task manager):

import time, string, pandas, numpy, gc
class temp_class ():

    def __init__(self, nrow = 1000000, ncol = 4, timetest = 5):

        self.nrow = nrow
        self.ncol = ncol
        self.timetest = timetest

    def createDataFrame(self):

        print('Check memory before dataframe creating')
        time.sleep(self.timetest)
        self.df = pandas.DataFrame(numpy.random.randn(self.nrow, self.ncol),
            index = numpy.random.randn(self.nrow), columns = list(string.letters[0:self.ncol]))
        print('Check memory after dataFrame creating')
        time.sleep(self.timetest)

    def changeSize(self, from_ = 0, to_ = 100):

        df_new = self.df[from_:to_].copy()
        print('Check memory after changing size')
        time.sleep(self.timetest)

        print('Check memory after deleting initial pandas object')
        del self.df
        time.sleep(self.timetest)

        print('Check memory after deleting copy of reduced pandas object')
        del df_new
        gc.collect()
        time.sleep(self.timetest)

if __name__== '__main__':

    a = temp_class()
    a.createDataFrame()
    a.changeSize()

  • 在创建数据框之前,我大约需要15 mb的内存使用量

    • Before dataframe creating I have approx. 15 mb of memory usage

      创建后-67mb

      更改大小后-67 mb

      After changing size - 67 mb

      删除原始数据帧后-35mb

      After deleting original dataframe - 35mb

      删除缩小的表格后-31 mb.

      After deleting reduced table - 31 mb.

      16 mb?

      我在Windows 7(x64)计算机pandas上使用python 2.7.2(x32).版本为0.7.3. numpy.版本为1.6.1

      I use python 2.7.2(x32) on Windows 7 (x64) machine, pandas.version is 0.7.3. numpy.version is 1.6.1

      推荐答案

      需要指出的几点:

      1. 在更改大小后检查内存"中,您尚未删除原始DataFrame,因此这将占用更多的内存

      1. In "Check memory after changing size", you haven't deleted the original DataFrame yet, so this will be using strictly more memory

      Python解释器对于保留OS内存有点贪婪.

      The Python interpreter is a bit greedy about holding onto OS memory.

      我调查了一下,可以向您保证熊猫没有泄漏内存.我正在使用memory_profiler(http://pypi.python.org/pypi/memory_profiler)包:

      I looked into this and can assure you that pandas is not leaking memory. I'm using the memory_profiler (http://pypi.python.org/pypi/memory_profiler) package:

      import time, string, pandas, numpy, gc
      from memory_profiler import LineProfiler, show_results
      import memory_profiler as mprof
      
      prof = LineProfiler()
      
      @prof
      def test(nrow=1000000, ncol = 4, timetest = 5):
          from_ = nrow // 10
          to_ = 9 * nrow // 10
          df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
                                index = numpy.random.randn(nrow),
                                columns = list(string.letters[0:ncol]))
          df_new = df[from_:to_].copy()
          del df
          del df_new
          gc.collect()
      
      test()
      # for _ in xrange(10):
      #     print mprof.memory_usage()
      
      show_results(prof)
      

      这是输出

      10:15 ~/tmp $ python profmem.py 
      Line #    Mem usage  Increment   Line Contents
      ==============================================
           7                           @prof
           8     28.77 MB    0.00 MB   def test(nrow=1000000, ncol = 4, timetest = 5):
           9     28.77 MB    0.00 MB       from_ = nrow // 10
          10     28.77 MB    0.00 MB       to_ = 9 * nrow // 10
          11     59.19 MB   30.42 MB       df = pandas.DataFrame(numpy.random.randn(nrow, ncol),
          12     66.77 MB    7.58 MB                             index = numpy.random.randn(nrow),
          13     90.46 MB   23.70 MB                             columns = list(string.letters[0:ncol]))
          14    114.96 MB   24.49 MB       df_new = df[from_:to_].copy()
          15    114.96 MB    0.00 MB       del df
          16     90.54 MB  -24.42 MB       del df_new
          17     52.39 MB  -38.15 MB       gc.collect()
      

      因此,确实有比开始时更多的内存正在使用.但是它泄漏了吗?

      So indeed, there is more memory in use than when we started. But is it leaking?

      for _ in xrange(20):
          test()
          print mprof.memory_usage()
      

      并输出:

      10:19 ~/tmp $ python profmem.py 
      [52.3984375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59375]
      [122.59765625]
      [122.59765625]
      [122.59765625]
      

      因此实际上发生的是,Python进程一直使用内存池,因为它一直在使用内存池,以避免必须不断从主机OS请求更多的内存(然后释放它).我不知道其背后的所有技术细节,但这至少是正在发生的事情.

      So actually what's gone on is that the Python process is holding on to a pool of memory given what it's been using to avoid having to keep requesting more memory (and then freeing it) from the host OS. I don't know all the technical details behind this, but that is at least what is going on.

      这篇关于 pandas :这里的内存泄漏在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆