如何使用成员变量来持久化Python类,这些成员变量也是具有较大的`numpy`数组变量的Python类(因此`pickle`不再有效)? [英] How to persist Python class with member variables that are also Python classes having large `numpy` array variables (so `pickle` no longer efficient)?

查看:144
本文介绍了如何使用成员变量来持久化Python类,这些成员变量也是具有较大的`numpy`数组变量的Python类(因此`pickle`不再有效)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用例:Python类在一个有用的结构中存储了大的numpy数组(大,但足够小以至于在内存中使用它们很容易).这是这种情况的动画片:

The use case: Python class stores large numpy arrays (large, but small enough that working with them in-memory is a breeze) in a useful structure. Here's a cartoon of the situation:

主类: Environment;存储与所有球有关的有用信息

main class: Environment; stores useful information pertinent to all balls

子级"课程: Ball;存储与该特定球有关的信息

"child" class: Ball; stores information pertinent to this particular ball

Environment成员变量:balls_in_environment(Ball s的列表)

Environment member variable: balls_in_environment (list of Balls)

Ball成员变量:large_numpy_array(NxN numpy数组很大,但是在内存中仍然很容易使用)

Ball member variable: large_numpy_array (NxN numpy array that is large, but still easy to work with in-memory)

我希望最好将Environment整体保留下来.

I would like to preferably persist Environment as whole.

一些选项:

  • pickle:太慢了,它产生的输出占用了硬盘驱动器上的很多空间

  • pickle: too slow, and it produces output that takes up a LOT of space on the hard drive

数据库:太多的工作;我可以将重要信息存储在该类的数据库中(要求我编写函数以从该类中获取信息,然后将其放入数据库中),然后通过创建一个新实例并用来自该类的数据重新填充该类来重建该类. DB(要求我编写函数来进行重建)

database: too much work; I could store the important information in the class in a database (requires me to write functions to take info from the class, and put it into the DB) and later rebuild the class by creating a new instance, and refilling it with data from the DB (requires me to write functions to do the rebuilding)

JSON:我对JSON不太熟悉,但是Python有一个处理它的标准库,并且

JSON: I am not very familiar with JSON, but Python has a standard library to deal with it, and it is the recommended solution of this article -- I don't see how JSON would be more compact than pickle though; more importantly, doesn't deal nicely with numpy

MessagePack:与上述文章相同的另一个推荐软件包;但是,我从未听说过它,也不想因为似乎是标准问题而突飞猛进

MessagePack: another recommended package by the same article mentioned above; however, I have never heard of it, and don't want to strike out into the unknown with what seems to be a standard problem

numpy.save +其他:使用numpy.save功能存储与每个Ball关联的numpy数组,并以某种方式(乏味)分别存储非numpy的东西?

numpy.save + something else: store the numpy arrays associated with each Ball, using numpy.save functionality, and store the non-numpy stuff separately somehow (tedious)?

对于我的用例来说,最好的选择是什么?

What is the best option for my use case?

推荐答案

正如我在评论中提到的,

As I mentioned in the comments, joblib.dump might be a good option. It uses np.save to efficiently store numpy arrays, and cPickle for everything else:

import numpy as np
import cPickle
import joblib
import os


class SerializationTest(object):
    def __init__(self):
        self.array = np.random.randn(1000, 1000)

st = SerializationTest()
fnames = ['cpickle.pkl', 'numpy_save.npy', 'joblib.pkl']

# using cPickle
with open(fnames[0], 'w') as f:
    cPickle.dump(st, f)

# using np.save
np.save(fnames[1], st)

# using joblib.dump (without compression)
joblib.dump(st, fnames[2])

# check file sizes
for fname in fnames:
    print('%15s: %8.2f KB' % (fname, os.stat(fname).st_size / 1E3))
#     cpickle.pkl: 23695.56 KB
#  numpy_save.npy:  8000.33 KB
#      joblib.pkl:     0.18 KB

一个潜在的缺点是,由于joblib.dump使用cPickle来序列化Python对象,因此生成的文件无法从Python 2移植到3.为获得更好的可移植性,您可以考虑使用HDF5,例如此处.

One potential downside is that because joblib.dump uses cPickle to serialize Python objects, the resulting files are not portable from Python 2 to 3. For better portability you could look into using HDF5, e.g. here.

这篇关于如何使用成员变量来持久化Python类,这些成员变量也是具有较大的`numpy`数组变量的Python类(因此`pickle`不再有效)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆