可交换的,基于累加器的函数,用于计算多个哈希的摘要 [英] Commutative, accumulator-based function for calculating a digest of multiple hashes

查看:152
本文介绍了可交换的,基于累加器的函数,用于计算多个哈希的摘要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一些通过散列其内容样本来汇总文件系统中的文件的东西。它构建了一个目录和文件树。每个文件条目都有文件内容的散列。对于每个目录条目,我想存储目录中所有文件内容的哈希值,包括子目录中的所有文件的哈希值 - 我将这称为目录内容哈希值。

关于目录内容哈希的棘手事情是,我希望它独立于目录的结构。 I.E.如果两个目录包含相同的文件,但使用不同的子目录结构进行组织,则散列值应该相同。



我能想到的两种方法是:


  1. 计算所有文件内容哈希的连接的MD5。为了获得所需的散列属性,我必须列出目录中的所有文件,按照它们的散列进行排序,连接排序后的散列,然后在连接上运行MD5。这似乎比我想要的慢。我可以通过使用合并排序非常有效地进行排序,同时计算整个树中的目录内容散列值,但是我无法绕过计算大量输入中的大量MD5散列。

  2. li>

    使用XOR组合文件内容哈希。每个目录只需要XOR其直接子节点的文件内容散列和目录内容散列。这是非常快速和简单的,但不是非常耐碰撞。它甚至不能区分包含一个文件实例的目录和包含同一文件的三个实例的目录。

如果有一个函数可以类似于方法# 2,但更具抗碰撞性。我认为方法#1对于这个特定情况将足够快,但为了探索所有选项/知识型好奇/未来应用程序,我想知道是否有满足描述的函数标题(我有一个模糊的记忆,想要在过去几次这样的功能)。



谢谢。

解决方案

基本上你在找什么,不是?)

这听起来像任何顺序独立操作(如加法或乘法)都会为你解决问题。除此之外,还有一个很好的方法。我不记得增殖是否也能奏效。

简而言之:添加所有值,忽略溢出,并且应该得到一些有用的东西。任何其他类似的功能应该做的伎俩,如果添加不足够的碰撞抵抗。


I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the hash of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.

The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.

The only two methods I can think of are:

  1. Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.

  2. Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.

It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).

Thanks.

解决方案

Order independent hashing of collections of hashes (is essentially what you're looking for, non?)

It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.

In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.

这篇关于可交换的,基于累加器的函数,用于计算多个哈希的摘要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆