并行计算大文件的哈希码 [英] Calculating a hash code for a large file in parallel

查看:295
本文介绍了并行计算大文件的哈希码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提高散列大文件的性能,例如在数十GB的大小。



通常,文件使用散列函数(例如,SHA-256,虽然我很可能使用Skein,因此散列会比从[快速] SSD读取文件所需的时间更慢)。让我们称之为方法1。



这个想法是在8个CPU上并行散列文件的多个1 MB块,然后将连接的散列散列为单个最终散列。让我们称为方法2.



描述此方法的图片如下:





$ b b

>






我想知道这个想法是否正确,以及在整个文件的整个范围内有多少安全丢失(在冲突更可能的情况下)。



例如:



让我们使用SHA-2的SHA-256变体, 2 ^ 34 = 34,359,738,368字节。因此,使用一个简单的单遍(方法1),我会得到一个256位散列的整个文件。



与:



使用并行散列法(即方法2),我会将文件分成32,768个1 MB的块,使用SHA-256将这些块散列为256位(32字节)的32,768个散列,连接散列,并对结果连接的1,048,576字节数据集进行最终散列,以获得整个文件的最终256位散列。



方法2不太安全比方法1,在冲突更可能和/或可能的方面?也许我应该改写这个问题为:方法2使攻击者更容易创建一个散列与原始文件相同的哈希值的文件,当然除了一个琐碎的事实,暴力攻击会更便宜,因为哈希可以并行计算在N cpus?



更新:我刚刚发现我的方法2中的结构非常类似于概念的哈希列表。然而,前面句子中链接引用的维基百科文章没有详细讨论散列表相对于方法1的冲突机会的优越性或劣势,方法1是文件的简单旧散列,当只有

基于块的哈希(你的方法2)是一个使用哈希表的哈希表。>




  • Hash树,Merkle树,Tiger树散列

  • eDonkey2000文件散列(大约9 MiB块大小的单级树)



    • 就像你在做什么,这些方法再次使用块散列和散列的列表,直到一个短的散列。由于这是一个很好的做法,我假设它像顺序哈希一样安全。


      I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.

      Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slower when compared to the time it takes to read the file from a [fast] SSD). Let's call this Method 1.

      The idea is to hash multiple 1 MB blocks of the file in parallel on 8 CPUs and then hash the concatenated hashes into a single final hash. Let's call this Method 2.

      A picture depicting this method follows:



      I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.

      For example:

      Let's use the SHA-256 variant of SHA-2 and set the file size to 2^34=34,359,738,368 bytes. Therefore, using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.

      Compare this with:

      Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.

      Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?

      Update: I have just discovered that my construction in Method 2 is very similar to the notion of a hash list. However the Wikipedia article referenced by the link in the preceding sentence does not go into detail about a hash list's superiority or inferiority with regard to the chance of collisions as compared to Method 1, a plain old hashing of the file, when only the top hash of the hash list is used.

      解决方案

      Block-based hashing (your method 2) is a well known technique that is used in practice:

      Just like what you're doing, these methods takes the list of block hashes and hashes that again, down to a single short hash. Since this is a well established practice, I would assume that it is as secure as sequential hashing.

      这篇关于并行计算大文件的哈希码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆