如何在Python中散列大对象(数据集)? [英] How to hash a large object (dataset) in Python?

查看:94
本文介绍了如何在Python中散列大对象(数据集)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算一个包含机器学习数据集的Python类的散列。散列意味着用于缓存,所以我在考虑 md5 sha1
问题是大部分数据都存储在NumPy数组中;这些不提供 __ hash __()成员。目前我为每个成员做一个 pickle.dumps(),并根据这些字符串计算一个散列。但是,我发现以下链接指出同一个对象可能会导致不同的序列化字符串:


  • 每一台机器都没有变化

  • Pickle.dumps不适合哈希



  • 对于包含Numpy数组的Python类来说,计算散列值的最佳方法是什么?

    解决方案

    感谢John Montgomery,我想我已经找到了一个解决方案,我认为它比将可能的巨大的数组中的每个数字转换为字符串的开销少:



    我可以创建数组的字节视图并使用它们来更新哈希。不知何故,这似乎给出了与使用数组直接更新相同的摘要:

     >>> import hashlib 
    >>> import numpy
    >>> a = numpy.random.rand(10,100)
    >>> b = a.view(numpy.uint8)
    >>>打印a.dtype,b.dtype#a和b具有不同的数据类型
    float64 uint8
    >>> hashlib.sha1(a).hexdigest()#byte view sha1
    '794de7b1316b38d989a9040e6e26b9256ca3b5eb'
    >>> hashlib.sha1(b).hexdigest()#array sha1
    '794de7b1316b38d989a9040e6e26b9256ca3b5eb'


    I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

    What would be the best method to calculate a hash for a Python class containing Numpy arrays?

    解决方案

    Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

    I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

    >>> import hashlib
    >>> import numpy
    >>> a = numpy.random.rand(10, 100)
    >>> b = a.view(numpy.uint8)
    >>> print a.dtype, b.dtype # a and b have a different data type
    float64 uint8
    >>> hashlib.sha1(a).hexdigest() # byte view sha1
    '794de7b1316b38d989a9040e6e26b9256ca3b5eb'
    >>> hashlib.sha1(b).hexdigest() # array sha1
    '794de7b1316b38d989a9040e6e26b9256ca3b5eb'
    

    这篇关于如何在Python中散列大对象(数据集)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆