为什么泡菜会吃记忆? [英] Why pickle eat memory?

查看:126
本文介绍了为什么泡菜会吃记忆?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图处理将少量腌制数据以小块形式写入磁盘的问题.这是示例代码:

I trying to deal with writing huge amount of pickled data to disk by small pieces. Here is the example code:

from cPickle import *
from gc import collect

PATH = r'd:\test.dat'
@profile
def func(item):
    for e in item:
        f = open(PATH, 'a', 0)
        f.write(dumps(e))
        f.flush()
        f.close()
        del f
        collect()

if __name__ == '__main__':
    k = [x for x in xrange(9999)]
    func(k)

open()和close()放在循环内部,以排除内存中数据积累的可能原因.

open() and close() placed inside loop to exclude possible causes of accumulation of data in memory.

为说明问题,我附上了使用Python 3d派对模块获得的内存分析结果 memory_profiler :

To illustrate problem I attach results of memory profiling gained with Python 3d party module memory_profiler:

   Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.02 MB    0.00 MB   def func(item):
    16      9.02 MB    0.00 MB       path= r'd:\test.dat'
    17
    18     10.88 MB    1.86 MB       for e in item:
    19     10.88 MB    0.00 MB           f = open(path, 'a', 0)
    20     10.88 MB    0.00 MB           f.write(dumps(e))
    21     10.88 MB    0.00 MB           f.flush()
    22     10.88 MB    0.00 MB           f.close()
    23     10.88 MB    0.00 MB           del f
    24                                   collect()

在执行循环期间,发生了奇怪的内存使用量增长.如何消除呢?有什么想法吗?

During execution of the loop strange memory usage growth occurs. How it can be eliminated? Any thoughts?

当输入数据量增加时,这些附加数据的容量可能会增加到比输入大得多的大小(更新:在实际任务中,我得到300 + Mb)

When amount of input data increases volume of this additional data can grow to size much greater then input (upd: in real task i get 300+Mb)

还有一个更广泛的问题-存在哪些方法可以正确处理Python中的大量IO数据?

And more wide question - which ways exist to properly work with big amounts of IO data in Python?

更新: 我重新编写了代码,只留下了循环主体,以查看具体何时发生增长,这里是结果:

upd: I rewrote the code leaving only the loop body to see when growth happens specifically, and here the results:

Line #    Mem usage  Increment   Line Contents
==============================================
    14                           @profile
    15      9.00 MB    0.00 MB   def func(item):
    16      9.00 MB    0.00 MB       path= r'd:\test.dat'
    17
    18                               #for e in item:
    19      9.02 MB    0.02 MB       f = open(path, 'a', 0)
    20      9.23 MB    0.21 MB       d = dumps(item)
    21      9.23 MB    0.00 MB       f.write(d)
    22      9.23 MB    0.00 MB       f.flush()
    23      9.23 MB    0.00 MB       f.close()
    24      9.23 MB    0.00 MB       del f
    25      9.23 MB    0.00 MB       collect()

好像dumps()吃了内存. (虽然我实际上以为会是write())

It seems like dumps() eats memory. (While I actually thought it will be write())

推荐答案

Pickle占用大量RAM,请参阅此处的说明:

Pickle consume a lot of RAM, see explanations here : http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

为什么Pickle会消耗更多的内存?原因是HDF是二进制数据管道,而Pickle是对象序列化协议. Pickle实际上由一个简单的虚拟机(VM)组成,该虚拟机将一个对象转换为一系列操作码并将其写入磁盘.为了解开某些东西,VM读取并解释操作码并重建对象.这种方法的缺点是VM必须先在内存中构造对象的完整副本,然后才能将其写入磁盘.

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk.

Pickle非常适合小型用例或测试,因为在大多数情况下,内存消耗并不重要.

Pickle is great for small use cases or testing because in most case the memory consumption doesn't matter a lot.

对于需要转储和加载大量文件和/或大文件的繁重工作,应考虑使用另一种存储数据的方法(例如:hdf,为对象编写了自己的序列化/反序列化方法,). ..)

For intensive work where you have to dump and load a lot of files and/or big files you should consider using another way to store your data (ex.: hdf, wrote your own serialize/deserialize methods for your object, ...)

这篇关于为什么泡菜会吃记忆?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆