如何在 Python 中散列一个大对象(数据集)? [英] How to hash a large object (dataset) in Python?
问题描述
我想计算包含机器学习数据集的 Python 类的哈希值.散列旨在用于缓存,所以我在考虑 md5
或 sha1
.问题是大部分数据都存储在 NumPy 数组中;这些不提供 __hash__()
成员.目前我为每个成员做一个 pickle.dumps()
并根据这些字符串计算一个散列.但是,我发现以下链接表明相同的对象可能会导致不同的序列化字符串:
为包含 Numpy 数组的 Python 类计算散列的最佳方法是什么?
感谢 John Montgomery 我想我已经找到了一个解决方案,而且我认为它比转换每个可能巨大的数字的开销更少数组到字符串:
我可以创建数组的字节视图并使用它们来更新哈希.不知何故,这似乎提供了与使用数组直接更新相同的摘要:
<预><代码>>>>导入哈希库>>>导入 numpy>>>a = numpy.random.rand(10, 100)>>>b = a.view(numpy.uint8)>>>打印 a.dtype, b.dtype # a 和 b 具有不同的数据类型float64 uint8>>>hashlib.sha1(a).hexdigest() # 字节视图 sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'>>>hashlib.sha1(b).hexdigest() # 数组 sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5
or sha1
.
The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__()
member. Currently I do a pickle.dumps()
for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:
What would be the best method to calculate a hash for a Python class containing Numpy arrays?
Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:
I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:
>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
这篇关于如何在 Python 中散列一个大对象(数据集)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!