如何评估哈希冲突概率? [英] How do I assess the hash collision probability?

查看:469
本文介绍了如何评估哈希冲突概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为搜索系统开发后端应用程序.搜索系统将文件复制到一个临时目录中,并为它们提供随机名称.然后它将临时文件的名称传递给我的应用程序.我的应用程序必须在有限的时间内处理每个文件,否则它将被关闭-这就像看门狗一样的安全措施.处理文件可能会花费很长时间,因此我需要设计能够处理这种情况的应用程序.如果我的应用程序下次关闭时,搜索系统希望为同一文件建立索引,则可能会给它一个不同的临时名称.

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.

显而易见的解决方案是在搜索系统和后端之间提供一个中间层.它将请求排队到后端,然后等待结果到达.如果请求在中间层超时-没问题,后端将继续工作,只有中间层会重新启动,并且稍后搜索系统重复请求时,它可以从后端检索结果.

The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.

问题是如何识别文件.他们的名字随机变化.我打算使用像MD5这样的哈希函数来哈希文件内容.我很了解生日悖论,并使用链接文章中的估算值来计算概率.如果我假设我的文件不超过100个,那么两个文件具有相同的MD5(128位)的可能性约为1,47x10 -29 .

The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10-29.

我应该考虑这样的冲突概率,还是只是假设相等的哈希值意味着相等的文件内容?

Should I care of such collision probability or just assume that equal hash values mean equal file contents?

推荐答案

相等的哈希表示相等的文件,除非有人恶意乱搞您的文件并注入冲突. (如果他们正在从Internet下载内容,则可能是这种情况).如果是这种情况,请使用基于SHA2的功能.

Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.

没有意外的MD5冲突,1,47x10 -29 确实很小.

There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.

要克服重新哈希大文件的问题,我将采用3个阶段的身份方案.

To overcome the issue of rehashing big files I would have a 3 phased identity scheme.

  1. 仅文件大小
  2. 文件大小+文件中不同位置的64K * 4的哈希值
  3. 完整哈希

因此,如果您看到一个具有新大小的文件,则可以肯定您没有重复的文件.等等.

So if you see a file with a new size you know for certain you do not have a duplicate. And so on.

这篇关于如何评估哈希冲突概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆