64 位哈希码冲突的概率 [英] Probability of 64bit Hash Code Collisions

查看:70
本文介绍了64 位哈希码冲突的概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Numerical Recipes 一书提供了一种计算 64 位哈希码以减少冲突次数的方法.

The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.

该算法显示在 http://www.javamex.com/tutorials/collections/strong_hash_code_implementation_2.shtml 复制到这里供参考:

The algorithm is shown at http://www.javamex.com/tutorials/collections/strong_hash_code_implementation_2.shtml and is copied here for reference:

private static final createLookupTable() {
  byteTable = new long[256];
  long h = 0x544B2FBACAAF1684L;
  for (int i = 0; i < 256; i++) {
    for (int j = 0; j < 31; j++) {
      h = (h >>> 7) ^ h;
      h = (h << 11) ^ h;
      h = (h >>> 10) ^ h;
    }
    byteTable[i] = h;
  }
  return byteTable;
}

public static long hash(CharSequence cs) {
  long h = HSTART;
  final long hmult = HMULT;
  final long[] ht = byteTable;
  final int len = cs.length();
  for (int i = 0; i < len; i++) {
    char ch = cs.charAt(i);
    h = (h * hmult) ^ ht[ch & 0xff];
    h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
  }
  return h;
}

我的问题:

1) 有没有考虑到所谓的生日悖论来估计碰撞概率的公式?

1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

2) 你能估计碰撞的概率(即散列到相同值的两个键)吗?假设有 1,000 个密钥和 10,000 个密钥?

2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

编辑:改写/更正问题 3

3) 假设合理数量的键(例如,少于 10,000 个键)发生冲突是不可能的,因此如果 2 个哈希码相同,我们可以说键是相同的,而没有任何键,这是否安全?进一步检查?例如

3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking? e.g.

static boolean equals(key1, key2) {

  if (key1.hash64() == key2.hash64())
    return true;  // probability of collision so low we don't need further check

  return false;
}

这不是为了安全,但执行速度是必要的,因此避免进一步检查密钥将节省时间.如果概率如此之低,比如小于(100,000 个密钥的十亿分之一),它可能是可以接受的.

This is not for security, but execution speed is imperative so avoiding further checks of the keys will save time. If the probability is so low, say less than (1 in 1 billion for 100,000 keys) it will probably be acceptable.

TIA!

推荐答案

考虑到所谓的生日悖论,是否有估算碰撞概率的公式?

Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?

使用生日悖论公式只是告诉您什么时候需要开始担心发生碰撞.这大约在 Sqrt[n] 处,其中 n 是可能的哈希值的总数.在这种情况下 n = 2^64 所以生日悖论公式告诉你只要键的数量明显小于 Sqrt[n] = Sqrt[2^64] = 2^32 或大约 40 亿,您无需担心冲突.n 越高,这种估计就越准确.事实上,当 n 变大时,k 键发生碰撞的概率 p(k) 接近阶梯函数在 k=Sqrt[n].

Using the Birthday Paradox formula simply tells you at what point you need to start worrying about a collision happening. This is at around Sqrt[n] where n is the total number of possible hash values. In this case n = 2^64 so the Birthday Paradox formula tells you that as long as the number of keys is significantly less than Sqrt[n] = Sqrt[2^64] = 2^32 or approximately 4 billion, you don't need to worry about collisions. The higher the n, the more accurate this estimation. In fact the probability p(k) that a collision will occur with k keys approaches a step function as n gets larger, where the step occurs at k=Sqrt[n].

您能估计碰撞的概率(即散列为相同值的两个键)吗?假设有 1,000 个密钥和 10,000 个密钥?

Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?

假设散列函数是均匀分布的,则可以直接推导出公式.

Assuming the hash function is uniformly distributed it's straightforward to derive the formula.

p(no collision for k keys) = 1 * (n-1)/n * (n-2)/n * (n-3)/n * ... * (n-(k-1))/n

那个公式直接从1个key开始:1个key不冲突的概率当然是1,2个key不冲突的概率是1 * (n-1)/n.对于所有 k 键,依此类推.方便的是,Mathematica 有一个 Pochhammer[] 函数用于此目的来简洁地表达这一点:

That formula directly follows from starting with 1 key: The probability of no collision with 1 key is of course 1. The probability of no collision with 2 keys is 1 * (n-1)/n. And so on for all k keys. Conveniently, Mathematica has a Pochhammer[] function for this purpose to express this succinctly:

p(no collision for k keys) = Pochhammer[n-(k-1),k]/n^k

然后,要计算 k 个键至少发生 1 次碰撞的概率,将其从 1 中减去:

Then, to calculate the probability that there is at least 1 collision for k keys, subtract it from 1:

p(k) = 1 - p(no collision for k keys) = 1 - Pochhammer[n-(k-1),k]/n^k

使用 Mathematica,可以计算 n=2^64:

Using Mathematica, one can calculate for n=2^64:

假设合理数量的键(例如,少于 10,000 个键)发生冲突是不可能的,因此如果 2 个哈希码相同,我们可以说键相同而无需进一步检查,这是否安全??

Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking?

要准确回答这个问题,取决于 10,000 个密钥中有 2 个相同的概率.我们正在寻找的是:

To answer this precisely depends upon the probability that 2 of the 10,000 keys were identical. What we are looking for is:

p(a=b|h(a)=h(b)) = The probability that a=b given h(a)=h(b)

其中 ab 是键(可能相同),h() 是散列函数.我们可以直接应用贝叶斯定理:

where a and b are keys (possibly identical) and h() is the hashing function. We can apply Bayes' Theorem directly:

p(a=b|h(a)=h(b)) = p(h(a)=h(b)|a=b) * p(a=b) / p(h(a)=h(b))

我们立即看到 p(h(a)=h(b)|a=b) = 1 (如果 a=b 那么当然 h(a)=h(b)) 所以我们得到

We immediately see that p(h(a)=h(b)|a=b) = 1 (if a=b then of course h(a)=h(b)) so we get

p(a=b|h(a)=h(b)) = p(a=b) / p(h(a)=h(b))

正如你所看到的,这取决于 p(a=b) 这是 ab 实际上是同一个键的概率.这取决于最初是如何选择 10,000 个键组的.前两个问题的计算假设所有键都是不同的,因此需要有关此场景的更多信息才能完整回答.

As you can see this depends upon p(a=b) which is the probability that a and b are actually the same key. This depends upon how the group of 10,000 keys were selected in the first place. The calculations for the previous two questions assume all keys are distinct, so more information on this scenario is needed to fully answer it.

这篇关于64 位哈希码冲突的概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆