用python命名缓存文件中最短的哈希 [英] Shortest hash in python to name cache files

查看:131
本文介绍了用python命名缓存文件中最短的哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中可用的最短散列(以文件名可用的形式,如十六进制表示)是什么?我的应用程序想为某些对象保存缓存文件。这些对象必须具有唯一的repr(),以便它们用于种子文件名。我想为每个对象生成一个可能唯一的文件名(不是那么多)。他们不应该碰撞,但如果他们做我的应用程序将简单地缺乏该对象的缓存(并将不得不重新索引该对象的数据,应用程序的一个小的成本)。



因此,如果发生一次冲突,我们会丢失一个缓存文件,但这是缓存所有对象所节省的时间,使得应用程序启动速度更快,所以没有关系。



现在我实际上使用abs(hash(repr(obj)));没错,字符串hash!还没有发现任何碰撞,但我想有一个更好的散列函数。 hashlib.md5在Python库中可用,但如果放入文件名中,十六进制文件非常长。替代方案,具有合理的碰撞阻力?

编辑:
用例如下:
数据加载器获取数据的新实例目的。独特的类型有独特的repr。因此,如果存在用于 hash(repr(obj))的缓存文件,则取消该缓存文件并将obj替换为unpickled对象。如果发生冲突并且缓存是错误匹配,我会注意到。因此,如果我们没有缓存或错误匹配,我会改为初始化obj(重新加载其数据)。


$ b 结论(?)

str hash可能已经足够好了,我只是担心它的碰撞阻力。但是如果我可以使用它散列 2 ** 16 对象,那就足够了。



我发现如何从一个哈希源获得一个十六进制哈希值并将其与base64紧凑地存储在一起:

 #'h'是一个十六进制数字字符串
bytes =.join(chr(int(h [i:i + 2],16))for i in xrange(0,len(h),2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip(=)


解决方案

生日悖论适用:给定一个好的散列函数,期望的发生冲突之前的散列数大约是sqrt(N),其中N是散列函数可以采用的不同值的数量。 (我指出的维基百科条目给出了确切的公式)。因此,例如,如果你想使用不超过32位,你的碰撞担忧对于大约64K的对象是严重的(即, 2 ** 16 对象 - 方块 2 ** 32 您的散列函数可以采用不同的值)。你预计有多少物体,数量级?



既然你提到碰撞是一个小麻烦,我建议你的目标是哈希长度是大概是你将拥有的对象数量的平方,或者少一点,但不会少于这个数量。



你想创建一个文件名 - 是在一个区分大小写的文件系统,就像Unix上的典型文件系统一样,或者您是否也需要兼容不区分大小写的系统呢?这很重要,因为你的目标是短文件名,但是每个字符可以用来表示你的散列作为文件名的位数会在大小写敏感或不敏感的系统上发生巨大变化。



在区分大小写的系统上,您可以使用标准库的 base64 模块(我推荐使用urlsafe版本的编码,即这个函数,因为避免可能存在于普通base64中的'/'字符在Unix文件名中很重要)。这给了你每个字符6个可用位,比16位的4位/字符好得多。



即使在不区分大小写的系统上,仍然可以比hex - 使用base64.b32encode并获得每个字符5位。



这些函数接受和返回字符串;如果你选择的哈希函数产生数字,使用 struct 模块将数字转换为字符串。



如果你有成千上万的对象,我认为你可以使用内置散列(32位,所以6-7个字符,取决于你选择的编码)。对于一百万个物体,你需要40位左右(7或8个字符) - 你可以将sha256折叠(xor,不要截断;-)一个合理的位数,比如说128位左右,然后在编码之前使用运算符将其进一步剪切为所需的长度。


What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).

So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.

Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?

Edit: Use case is like this: The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).

Conclusions (?)

The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.

I found out how to take a hex hash (from any hash source) and store it compactly with base64:

# 'h' is a string of hex digits 
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")

解决方案

The birthday paradox applies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16 objects -- the square root of the 2**32 different values your hash function can take). How many objects do you expect to have, as an order of magnitude?

Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.

You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.

On a case-sensitive system, you can use the standard library's base64 module (I recommend the "urlsafe" version of the encoding, i.e. this function, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.

Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.

These functions take and return strings; use the struct module to turn numbers into strings if your chosen hash function generates numbers.

If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the % operator to cut it further to your desired length before encoding.

这篇关于用python命名缓存文件中最短的哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆