通过多个并行TransformBlocks哈希计算导致混乱 [英] Parallel hash computing via multiple TransformBlocks results in a disarray

查看:145
本文介绍了通过多个并行TransformBlocks哈希计算导致混乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图来计算整个目录的哈希值,以便稍后监测变化。这是相对容易的。但是,如果有大的文件,计算需要太多的时间,所以我结束了使用一些多线程。

I'm trying to compute hashes for a whole directory, in order to monitor changes later. It's relatively easy. However, if there are big files, the computing takes too much time, so I wound up using some multithreading.

由于I / O瓶颈,我要读取一个文件有一个线程,但我可以计算出在多线程文件哈希与调用TransformBlock方法的一次。现在的问题是,各计算的结果是不同的 - '造成的所有线程更新的HashAlgorithm的一个实例,但它们确实是不正常

Thanks to I/O bottlenecks, I should read a file with one thread, but I can calculate hash for that file in multiple threads with calling TransformBlock methods all at once. The problem is, the result of each calculation is different - 'cause all the threads update one instance of a hashAlgorithm, they do it erratically.

  public delegate void CalculateHashDelegate(byte[] buffer);
  private MD5 md5;        
  private long completed_threads_hash;
  private object lock_for_hash = new object();

 `private string getMd5Hash(string file_path)
  {
        string file_to_be_hashed = file_path;
        byte[] hash;

        try
        {
            CalculateHashDelegate CalculateHash = AsyncCalculateHash;
            md5 = MD5.Create();

            using (Stream input = File.OpenRead(file_to_be_hashed))
            {
                int buffer_size = 0x4096;
                byte[] buffer = new byte[buffer_size];

                long part_count = 0;
                completed_threads_hash = 0;
                int bytes_read;
                while ((bytes_read = input.Read(buffer, 0, buffer.Length)) == buffer_size)
                {
                    part_count++;
                    IAsyncResult ar_hash = CalculateHash.BeginInvoke(buffer, CalculateHashCallback, CalculateHash);
                }

                // Wait for completing all the threads
                while (true)
                {
                    lock (completed_threads_lock)
                    {
                        if (completed_threads_hash == part_count)
                        {  
                            md5.TransformFinalBlock(buffer, 0, bytes_read);
                            break;
                        }
                    }
                }

                hash = md5.Hash;

            }

            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < hash.Length; i++)
            {
                sb.Append(hash[i].ToString("x2"));
            }
            md5.Clear();
            return sb.ToString();
        }
        catch (Exception ex)
        {
            Console.WriteLine("An exception was encountered during hashing file {0}. {1}.", file_to_be_hashed, ex.Message);
            return ex.Message;
        }
    }

    public void AsyncCalculateHash(byte[] buffer)
    {
        lock (lock_for_hash)
        {
            md5.TransformBlock(buffer, 0, buffer.Length, null, 0);
        }
    }

    private void CalculateHashCallback(IAsyncResult ar_hash)
    {
        try
        {
            CalculateHashDelegate CalculateHash = ar_hash.AsyncState as CalculateHashDelegate;
            CalculateHash.EndInvoke(ar_hash);
        }
        catch (Exception ex)
        {
            Console.WriteLine("Callback exception: ", ex.Message);
        }
        finally
        {
            lock (completed_threads_lock)
            {
                completed_threads_hash++;
            }
        }
    }



有没有办法来组织散列过程?我不能净更新超过3.5并且这种类作为BackroundWorker和线程池使用。或者,也许有并行计算哈希值的另一种方法?

Is there a way to organize the hashing process? I can't use .Net newer than 3.5 and such classes as BackroundWorker and ThreadPool. Or maybe there is another method for parallel hash calculating?

推荐答案

一般来说,你可以不是多线程代码中使用的密码对象。用散列方法的问题是,它们是完全线性 - 散列的各块依赖于当前的状态,并使用所有先前的块被计算的状态。 。因此,基本上,可以不为MD5要这样做

Generally you cannot use cryptographic objects within multi-threaded code. The problem with hash methods is that they are fully linear - each block of hashing depends on the current state, and the state is calculated using all the previous blocks. So basically, you cannot do this for MD5.

有是可使用的另一种方法,并且它被称为一个哈希树或梅克尔树。基本上你决定一个块大小和计算块的哈希值。这些哈希放在一起,并再次散列。如果你有一个非常大的数字哈希作为维基百科的文章链接到前面讲述的你实际上可能创建一个树。当然所得到的散列是仅从MD5不同并依赖于哈希树的配置参数。

There is another process that can be used, and it is called a hash tree or Merkle tree. Basically you decide on a block size and calculate the hashes for the blocks. These hashes are put together and hashed again. If you have a very large number of hashes you may actually create a tree as described in the Wikipedia article linked to earlier. Of course the resulting hash is different from just MD5 and depends on the configuration parameters of the hash tree.

注意,MD5已被打破。你应该用代替SHA-256和SHA-512 / XXX(更快的64位处理器)。还要注意的是经常在IO速度大于散列算法的速度的障碍物,否定的哈希树任何速度的优点。如果你有很多文件,你也可以在并行文件级别的散列。

Note that MD5 has been broken. You should be using SHA-256 or SHA-512/xxx (faster on 64 bit processors) instead. Also note that often the IO speed is more of an obstruction than the speed of the hash algorithm, negating any speed advantages of hash trees. If you have many files, you could also parallelize the hashing on file level.

这篇关于通过多个并行TransformBlocks哈希计算导致混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆