每次都为Pandas DataFrame获取相同的哈希值 [英] Get the same hash value for a Pandas DataFrame each time

查看:83
本文介绍了每次都为Pandas DataFrame获取相同的哈希值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是获取DataFrame的唯一哈希值.我从.csv文件中获取了它. 重点是每次我在其上调用hash()时都获得相同的哈希.

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

我的想法是我创建函数

def _get_array_hash(arr):
    arr_hashable = arr.values
    arr_hashable.flags.writeable = False
    hash_ = hash(arr_hashable.data)
    return hash_

正在调用基础numpy数组,将其设置为不可变状态并获取缓冲区的哈希.

that is calling underlying numpy array, set it to immutable state and get hash of the buffer.

INLINE UPD.

自2016年11月8日起,该功能的该版本不再起作用.相反,您应该使用

As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use

hash(df.values.tobytes())

请参见有关要进行哈希处理的最有效属性的评论用于numpy数组.

内联UPD结束.

它适用于常规的熊猫数组:

It works for regular pandas array:

In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})

In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165

In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165 

但是随后我尝试将其应用于从.csv文件获得的DataFrame:

But then I try to apply it to DataFrame obtained from a .csv file:

In [15]: fpath = 'foo/bar.csv'

In [16]: data_from_file = pd.read_csv(fpath)

In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085

In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730

有人可以向我解释,那怎么可能?

Can somebody explain me, how's that possible?

我可以从中创建新的DataFrame,例如

I can create new DataFrame out of it, like

new_data = pd.DataFrame(data=data_from_file.values, 
            columns=data_from_file.columns, 
            index=data_from_file.index)

它再次起作用

In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241

In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241

但是我的目标是在应用程序启动期间为数据帧保留相同的哈希值,以便从缓存中检索一些值.

But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

推荐答案

从Pandas 0.20.1开始,您可以使用鲜为人知(且文献记载很少)hash_pandas_object(pandas.util中//://pandas.pydata.org/pandas-docs/stable/whatsnew.html#modules-privacy-has-changed"rel =" noreferrer>公开.它为数据框的到达行返回一个哈希值(并且也适用于序列等)

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

如果想要所有行的整体哈希,则可以始终执行hash_pandas_object(df).sum().

You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

这篇关于每次都为Pandas DataFrame获取相同的哈希值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆