为什么 dill 比 numpy 数组的 pickle 更快,磁盘效率更高 [英] Why is dill much faster and more disk-efficient than pickle for numpy arrays

查看:158
本文介绍了为什么 dill 比 numpy 数组的 pickle 更快,磁盘效率更高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Ubuntu 16.04 上使用 Python 2.7 和 NumPy 1.11.2,以及最新版本的 dill(我刚刚做了 pip install dill).

I'm using Python 2.7 and NumPy 1.11.2, as well as the latest versions of dill ( I just did the pip install dill) , on Ubuntu 16.04.

当使用 pickle 存储 NumPy 数组时,我发现 pickle 非常慢,并且存储的数组几乎是必要"大小的三倍.

When storing a NumPy array using pickle, I find that pickle is very slow, and stores arrays at almost three times the 'necessary' size.

例如,在下面的代码中,pickle 大约慢了 50 倍(1 秒对 50 秒),并创建了一个 2.2GB 而不是 800MB 的文件.

For example, in the following code, pickle is approximately 50 times slower (1s versus 50s), and creates a file that is 2.2GB instead of 800MB.

 import numpy 
 import pickle
 import dill
 B=numpy.random.rand(10000,10000)
 with open('dill','wb') as fp:
    dill.dump(B,fp)
 with open('pickle','wb') as fp:
    pickle.dump(B,fp)

我以为莳萝只是泡菜的包装纸.如果这是真的,有没有办法可以自己提高泡菜的性能?NumPy 数组一般不建议使用 pickle 吗?

I thought dill was just a wrapper around pickle. If this is true, is there a way that I can improve the performance of pickle myself? Is it generally not advisable to use pickle for NumPy arrays?

使用 Python3,我获得了与 pickledill

Using Python3, I get the same performance for pickle and dill

PS:我知道 numpy.save,但我在一个框架中工作,我将许多不同的对象存储到一个文件中,所有这些对象都驻留在字典中.

PS: I know about numpy.save, but I am working in a framework where I store lots of different objects, all residing in a dictionary, to a file.

推荐答案

这应该是一个评论,但我没有足够的声誉...我猜这是由于使用了pickle协议.

This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

在 Python 2 上,默认协议为 0,支持的最高协议为 2.在 Python 3 上,默认协议是 3,支持的最高协议是 4(从 Python 3.6 开始).

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

每个协议版本都比前一个版本有所改进,但协议 0 对于较大的对象尤其慢.在大多数情况下应该避免使用它,除非您需要能够使用非常旧的 Python 版本读取泡菜.协议 2 已经好多了.

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

现在,我假设 dill 默认使用 pickle.HIGHEST_PROTOCOL,如果确实如此,这可能是导致速度很大的原因区别.您可以尝试使用 pickle.HIGHEST_PROTOCOL 来查看使用 dill 和标准 pickle 是否获得相似的性能.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)

这篇关于为什么 dill 比 numpy 数组的 pickle 更快,磁盘效率更高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆