高效的序列化numpy布尔数组 [英] Efficient serialization of numpy boolean arrays
问题描述
我们有两个解决效率的定义,如下所示:
- 内存使用效率高;更小更好
- 计算时间序列化和重构阵列的效率;更少的时间更好
我正在寻求在这两个竞争的兴趣之间取得良好的平衡,然而,高效的内存使用更重要对我来说,我愿意牺牲计算时间。
有两个属性,我希望这将使这个任务更容易:
- 我可以保证所有数组具有相同的大小和形状
- 数组是布尔值,这意味着可以简单地表示它们作为
1
和0
s的序列,有序列
是否有一个高效的Python(2.7或可能的2.6)数据结构,我可以将其序列化(或许某种字节结构),你可以提供一个例子的数组和这个结构之间的转换,并从结构返回到原始数组?
请注意,没有必要存储关于每个索引是否为 True
或假
;一个简单地存储索引的结构,其中数组是 True
将足以重构数组。
一个足够的解决方案将适用于一维数组,但是一个很好的解决方案也适用于二维数组,而一个很好的解决方案适用于更高维度的数组。
最初,我建议使用 bitarray
。但是,正如@senderle所指出的那样,由于 bitarray
是可变的,所以不能直接键入 dict
。
这是一个修改后的解决方案(仍然基于 bitarray
内部):
import bitarray
class BoolArray(object):
#从ndarray $ b创建$ b def __init __(self,array):
ba = bitarray.bitarray()
ba.pack(array.tostring())
self.arr = ba.tostring()
self.shape = array.shape
self.size = array.size
#转换回ndarray
def to_array(self):
ba = bitarray.bitarray()
ba.fromstring(self.arr)
ret = np.fromstring(ba.unpack(),dtype = np.bool)[:self.size]
return (self.shape)
def __cmp __(self,other):
return cmp(self.arr,other.arr)
def __hash __( self):
return hash(self.arr)
import numpy as np
x =(np .random.random((2,3,2))> 0.5)
b1 = BoolArray(x)
b2 = BoolArray(x)
d = {b1:12}
d [b2] + = 1
打印d
打印b1.to_array()
这适用于Python 2.5+,每个数组元素需要一位,并支持任何形状/尺寸的数组。
编辑:在最近的版本中,您必须更换 ba.tostring
和 ba.fromstring
到 ba.tobytes
和 ba.frombytes
(自0.4.0以来已弃用)。
I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we've observed each of these arrays.) Since NumPy arrays are not hashable and can't be used as keys themselves. I would like to serialize these arrays as efficiently as possible.
We have two definitions for efficiency to address, here:
- Efficiency in memory usage; smaller is better
- Efficiency in computational time serializing and reconstituting the array; less time is better
I'm looking to strike a good balance between these two competing interests, however, efficient memory usage is more important to me and I'm willing to sacrifice computing time.
There are two properties that I hope will make this task easier:
- I can guarantee that all arrays have the same size and shape
- The arrays are boolean, which means that it is possible to simply represent them as a sequence of
1
s and0
s, a bit sequence
Is there an efficient Python (2.7, or, if possible, 2.6) data structure that I could serialize these to (perhaps some sort of bytes structure), and could you provide an example of the conversion between an array and this structure, and from the structure back to the original array?
Note that it is not necessary to store information about whether each index was True
or False
; a structure that simply stored indices where the array was True
would be sufficient to reconstitute the array.
A sufficient solution would work for a 1-dimensional array, but a good solution would also work for a 2-dimensional array, and a great solution would work for arrays of even higher dimensions.
Initially, I suggested using bitarray
. However, as rightly pointed out by @senderle, since bitarray
is mutable, it can't be used to directly key into a dict
.
Here is a revised solution (still based on bitarray
internally):
import bitarray
class BoolArray(object):
# create from an ndarray
def __init__(self, array):
ba = bitarray.bitarray()
ba.pack(array.tostring())
self.arr = ba.tostring()
self.shape = array.shape
self.size = array.size
# convert back to an ndarray
def to_array(self):
ba = bitarray.bitarray()
ba.fromstring(self.arr)
ret = np.fromstring(ba.unpack(), dtype=np.bool)[:self.size]
return ret.reshape(self.shape)
def __cmp__(self, other):
return cmp(self.arr, other.arr)
def __hash__(self):
return hash(self.arr)
import numpy as np
x = (np.random.random((2,3,2))>0.5)
b1 = BoolArray(x)
b2 = BoolArray(x)
d = {b1: 12}
d[b2] += 1
print d
print b1.to_array()
This works with Python 2.5+, requires one bit per array element and supports arrays of any shape/dimensions.
EDIT: In the recent versions, you have to replace the ba.tostring
and ba.fromstring
to ba.tobytes
and ba.frombytes
(Deprecated since version 0.4.0).
这篇关于高效的序列化numpy布尔数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!