哈希上的很多迭代:它不会减少熵吗? [英] many iterations on a hash: doesn't it reduces entropy?

查看:151
本文介绍了哈希上的很多迭代:它不会减少熵吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到这个技术在很多地方都被推荐(包括堆栈),我无法摆脱我的头脑,这会减少熵!毕竟,你再次哈希了一些已经被哈希并碰撞的机会。碰撞机会不会碰撞机会导致更多的碰撞机会?经过研究,看起来我错了,但是为什么? 解决方案

既然你标记了md5,我会用它作为例。从维基百科


如果可以构造具有相同哈希的两个前缀,则可以向两者添加通用后缀以使碰撞更可能被应用程序使用它作为有效数据接受。此外,当前的碰撞查找技术允许指定任意前缀:攻击者可以创建两个相同的内容开始的冲突文件。所有攻击者需要生成两个碰撞文件,它是一个带128字节数据块的模板文件,在64字节的边界上对齐,可以通过碰撞寻找算法自由更改。一个MD5冲突的例子,其中两个消息有6位不同,它们是:

然后他们给出的例子明文长度为256字节。由于碰撞攻击依赖于一个128字节的数据块,并且哈希摘要仅为128位,因此碰撞攻击的成功范围并不会增加第一次迭代 - 也就是说,你不能真正影响超出第一个散列的碰撞的可能性。



还要考虑散列的熵是前述的128位。即使考虑到总碰撞机会只有2 ^ 20.96(同样来自维基百科),它会花费大量迭代导致两个输入发生冲突。我认为自己成为受害者的第一眼推理是:任何两个任意输入都有碰撞概率x%。b
$ b



  • 第一个散列的输出本身就是两个这样的输入。
  • 因此,每次迭代都会增加碰撞几率x%。 >


这很容易被反例所证实。再考虑MD5:


  • 两个输入碰撞的机会是1:2 ^ 21(考虑到维基百科的密码学分析的最坏情况MD5)

  • 再次进行散列会导致相同的碰撞概率复合,因此第二轮碰撞的可能性为1:2 ^ 20

  • 因此,对于任何两个输入散列的次数等于摘要的熵都会保证相互冲突。


MD5任意两个连续输入128次,你会发现这不是真的。你可能不会在它们之间找到一个重复的散列 - 毕竟,你只创建了256个可能的2 ^ 128散列值,留下了2 ^ 120个可能性。每轮回合之间发生碰撞的概率是所有其他回合中的独立



有两种方法可以理解为什么会这样。首先是每次迭代本质上都试图击中一个移动目标。我认为你可以根据生日悖论构建一个证明,其中有一个令人惊讶的低数量的哈希迭代,你可能会看到一个输入的哈希摘要匹配不同输入的哈希摘要。但是他们几乎肯定会发生在迭代的不同的步骤中。一旦发生这种情况,它们在同一迭代中永远不会有相同的输出,因为散列算法本身是确定性的。另一种方法是认识到散列函数实际上运行时增加熵。考虑一个空字符串与其他输入一样有一个128位摘要;这在算法步骤中没有添加熵的情况下不会发生。这实际上是密码散列函数的必要组成部分:必须销毁数据或者可以从摘要中恢复输入。对于比摘要更长的输入,是的,熵总体上失去了;它必须是为了适应摘要长度。但是也添加了一些熵。



我没有其他散列算法的确切数字,但我认为所有我已经概括的其他散列函数和单向/映射功能。

I see this technique recommended in many places (including stack), and i can't get out of my head that this would reduce entropy! After all, you are hashing something again, that has already been hashed and has a collision chance. Wouldn't collision chance over collision chance results in more collision chances? After researching, it seems I'm wrong, but why?

解决方案

Since you tagged md5, I'll use that as an example. From wikipedia:

if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more likely to be accepted as valid data by the application using it. Furthermore, current collision-finding techniques allow to specify an arbitrary prefix: an attacker can create two colliding files that both begin with the same content. All the attacker needs to generate two colliding files is a template file with a 128-byte block of data, aligned on a 64-byte boundary that can be changed freely by the collision-finding algorithm. An example MD5 collision, with the two messages differing in 6 bits, is:

And then the example plaintexts they give are 256 bytes long. Since the collision attack relies on a 128 byte block of data, and the hash digest is only 128 bits, there really isn't an increased risk of a collision attack succeeding beyond the first iteration - that is to say that you can't really influence the likelihood of a collision beyond the first hash.

Also consider that the entropy of the hash is the aforementioned 128 bits. Even considering that the total collision chance is only 2^20.96 (again from wikipedia), it would take a great number of iterations to cause two inputs to collide. The first-glance reasoning that I think you're falling victim to is:

  • Any two arbitrary inputs have a chance of collision x%.
  • The outputs of the first hash are two such inputs themselves.
  • Therefore, every iteration increases the chance of collision by x%.

This can be disproven by counterexample fairly easily. Consider again MD5:

  • The chance for collision of two inputs is 1:2^21 (taking the worst case scenario from wikipedia's cryptography analysis of MD5)
  • Hashing again causes an equally likely chance of collision to compound, therefore the chance of collision at the second round is 1:2^20
  • Therfore, for any two inputs hashed a number of times equal to the entropy of the digest are guaranteed to collide.

MD5 any two inputs 128 times in a row and you will see that this is not true. You probably won't find a single repeated hash between them - after all, you've only created 256 out of a possible 2^128 hash values, leaving 2^120 possibilities. The probabilities of collisions between each round is independent of all other rounds.

There are two approaches to understand why this is so. The first is that each iteration is essentially trying to hit a moving target. I think you could construct a proof based on the birthday paradox that there is a surprisingly low number of iterations of hashing where you will likely see one hash digest from one input match the hash digest of a different input. But they would almost certainly have occurred at different steps of the iteration. And once that occurs, they can never have the same output on the same iteration because the hash algorithm itself is deterministic.

The other approach is to realize that the hash function actually adds entropy while it runs. Consider that an empty string has a 128 bit digest just like any other input; that cannot occur without entropy being added during the algorithm steps. This is actually a necessarily part of a cryptographic hash function: data must be destroyed or else the input could be recovered from the digest. For inputs longer than the digest, yes, entropy is lost on the whole; it has to be in order to fit into the digest length. But some entropy is also added.

I don't have as exact numbers for other hash algorithms, but I think all the points I've made generalize to other hash functions and one-way / mapping functions.

这篇关于哈希上的很多迭代:它不会减少熵吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆