用一个64位数唯一标识URL [英] Uniquely identifying URLs with one 64-bit number

查看:444
本文介绍了用一个64位数唯一标识URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这基本上是一个数学问题,但与编程有关:如果我有10亿个包含URL的字符串,并且我将它们每个的MD5哈希的前64位,我应该期望什么样的碰撞频率? / p>

如果我只有1亿个网址,答案会如何变化?



在我看来,碰撞会是非常罕见,但这些事情往往令人困惑。



我会更好地使用MD5以外的东西吗?请注意,我不是在寻找安全性,只是一个很好的快速哈希函数。此外,在MySQL的本地支持是很好的。



编辑不完全重复

解决方案

如果MD5的前64位构成理想分布的哈希,生日悖论仍然意味着你会为每个2 ^ 32的URL产生冲突。换句话说,冲突的概率是URL的数量除以4,294,967,296。有关详细信息,请参见 http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem

我不想把MD5中的一半丢掉; XOR高位和低位的64位字让他们有机会混合会更好。再说一次,MD5绝不是快速或安全的,所以我根本不会为此而烦恼。如果你想要盲目快速的分发,但是没有伪装的安全性,你可以试试64位版本的MurmurHash。有关详细信息和代码,请参阅 http://en.wikipedia.org/wiki/MurmurHash


This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?

How does the answer change if I only have 100 million URLs?

It seems to me that collisions will be extremely rare, but these things tend to be confusing.

Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.

EDIT: not quite a duplicate

解决方案

If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.

I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.

这篇关于用一个64位数唯一标识URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆