高效的序列化numpy布尔数组 [英] Efficient serialization of numpy boolean arrays

查看:179
本文介绍了高效的序列化numpy布尔数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几十万个NumPy布尔数组,我想用作字典的键。 (这个字典的值是我们观察到每个这些数组的次数。)由于NumPy数组不可哈希,不能用作键本身。我想尽可能有效地序列化这些数组。



我们有两个解决效率的定义,如下所示:


  1. 内存使用效率高;更小更好

  2. 计算时间序列化和重构阵列的效率;更少的时间更好

我正在寻求在这两个竞争的兴趣之间取得良好的平衡,然而,高效的内存使用更重要对我来说,我愿意牺牲计算时间。



有两个属性,我希望这将使这个任务更容易:


  1. 我可以保证所有数组具有相同的大小和形状

  2. 数组是布尔值,这意味着可以简单地表示它们作为 1 0 s的序列,有序列

是否有一个高效的Python(2.7或可能的2.6)数据结构,我可以将其序列化(或许某种字节结构),你可以提供一个例子的数组和这个结构之间的转换,并从结构返回到原始数组?



请注意,没有必要存储关于每个索引是否为 True ;一个简单地存储索引的结构,其中数组是 True 将足以重构数组。



一个足够的解决方案将适用于一维数组,但是一个很好的解决方案也适用于二维数组,而一个很好的解决方案适用于更高维度的数组。

解决方案

最初,我建议使用 bitarray 。但是,正如@senderle所指出的那样,由于 bitarray 是可变的,所以不能直接键入 dict

这是一个修改后的解决方案(仍然基于 bitarray 内部):

  import bitarray 

class BoolArray(object):

#从ndarray $ b创建$ b def __init __(self,array):
ba = bitarray.bitarray()
ba.pack(array.tostring())
self.arr = ba.tostring()
self.shape = array.shape
self.size = array.size

#转换回ndarray
def to_array(self):
ba = bitarray.bitarray()
ba.fromstring(self.arr)
ret = np.fromstring(ba.unpack(),dtype = np.bool)[:self.size]
return (self.shape)

def __cmp __(self,other):
return cmp(self.arr,other.arr)

def __hash __( self):
return hash(self.arr)

import numpy as np

x =(np .random.random((2,3,2))> 0.5)
b1 = BoolArray(x)
b2 = BoolArray(x)
d = {b1:12}
d [b2] + = 1
打印d
打印b1.to_array()

这适用于Python 2.5+,每个数组元素需要一位,并支持任何形状/尺寸的数组。



编辑:在最近的版本中,您必须更换 ba.tostring ba.fromstring ba.tobytes ba.frombytes (自0.4.0以来已弃用)。


I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we've observed each of these arrays.) Since NumPy arrays are not hashable and can't be used as keys themselves. I would like to serialize these arrays as efficiently as possible.

We have two definitions for efficiency to address, here:

  1. Efficiency in memory usage; smaller is better
  2. Efficiency in computational time serializing and reconstituting the array; less time is better

I'm looking to strike a good balance between these two competing interests, however, efficient memory usage is more important to me and I'm willing to sacrifice computing time.

There are two properties that I hope will make this task easier:

  1. I can guarantee that all arrays have the same size and shape
  2. The arrays are boolean, which means that it is possible to simply represent them as a sequence of 1s and 0s, a bit sequence

Is there an efficient Python (2.7, or, if possible, 2.6) data structure that I could serialize these to (perhaps some sort of bytes structure), and could you provide an example of the conversion between an array and this structure, and from the structure back to the original array?

Note that it is not necessary to store information about whether each index was True or False; a structure that simply stored indices where the array was True would be sufficient to reconstitute the array.

A sufficient solution would work for a 1-dimensional array, but a good solution would also work for a 2-dimensional array, and a great solution would work for arrays of even higher dimensions.

解决方案

Initially, I suggested using bitarray. However, as rightly pointed out by @senderle, since bitarray is mutable, it can't be used to directly key into a dict.

Here is a revised solution (still based on bitarray internally):

import bitarray

class BoolArray(object):

  # create from an ndarray
  def __init__(self, array):
    ba = bitarray.bitarray()
    ba.pack(array.tostring())
    self.arr = ba.tostring()
    self.shape = array.shape
    self.size = array.size

  # convert back to an ndarray
  def to_array(self):
    ba = bitarray.bitarray()
    ba.fromstring(self.arr)
    ret = np.fromstring(ba.unpack(), dtype=np.bool)[:self.size]
    return ret.reshape(self.shape)

  def __cmp__(self, other):
    return cmp(self.arr, other.arr)

  def __hash__(self):
    return hash(self.arr)

import numpy as np

x = (np.random.random((2,3,2))>0.5)
b1 = BoolArray(x)
b2 = BoolArray(x)
d = {b1: 12}
d[b2] += 1
print d
print b1.to_array()

This works with Python 2.5+, requires one bit per array element and supports arrays of any shape/dimensions.

EDIT: In the recent versions, you have to replace the ba.tostring and ba.fromstring to ba.tobytes and ba.frombytes (Deprecated since version 0.4.0).

这篇关于高效的序列化numpy布尔数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆