如何在 Python 中散列一个大对象(数据集)? [英] How to hash a large object (dataset) in Python?

查看:24
本文介绍了如何在 Python 中散列一个大对象(数据集)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算包含机器学习数据集的 Python 类的哈希值.散列旨在用于缓存,所以我在考虑 md5sha1.问题是大部分数据都存储在 NumPy 数组中;这些不提供 __hash__() 成员.目前我为每个成员做一个 pickle.dumps() 并根据这些字符串计算一个散列.但是,我发现以下链接表明相同的对象可能会导致不同的序列化字符串:

为包含 Numpy 数组的 Python 类计算散列的最佳方法是什么?

解决方案

感谢 John Montgomery 我想我已经找到了一个解决方案,而且我认为它比转换每个可能巨大的数字的开销更少数组到字符串:

我可以创建数组的字节视图并使用它们来更新哈希.不知何故,这似乎提供了与使用数组直接更新相同的摘要:

<预><代码>>>>导入哈希库>>>导入 numpy>>>a = numpy.random.rand(10, 100)>>>b = a.view(numpy.uint8)>>>打印 a.dtype, b.dtype # a 和 b 具有不同的数据类型float64 uint8>>>hashlib.sha1(a).hexdigest() # 字节视图 sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'>>>hashlib.sha1(b).hexdigest() # 数组 sha1'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

What would be the best method to calculate a hash for a Python class containing Numpy arrays?

解决方案

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

这篇关于如何在 Python 中散列一个大对象(数据集)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆