创建文件校验和时的性能问题 [英] Performance issues while creating file checksums

查看:104
本文介绍了创建文件校验和时的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个控制台应用程序,该应用程序遍历二叉树并根据其md5校验和搜索新文件或更改过的文件. 整个过程很快就可以接受(大约70.000个文件需要14秒),但是生成校验和大约需要5分钟,这太慢了...

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums. The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...

对改进此过程有何建议?我的哈希函数如下:

Any suggestions for improving this process? My hash function is the following:

private string getMD5(string filename)
        {
            using (var md5 = new MD5CryptoServiceProvider())
            {
                if (File.Exists(@filename))
                {
                    try
                    {
                        var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
                        var sb = new StringBuilder();
                        for (var i = 0; i < buffer.Length; i++)
                        {
                            sb.Append(buffer[i].ToString("x2"));
                        }
                        return sb.ToString();
                    }
                    catch (Exception)
                    {
                        Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
                        return "";
                    }
                }
                else
                {
                    return "";
                }
            }
        } 

推荐答案

好的,可接受的答案是无效的,因为,当然,有一种方法可以提高代码性能.但是,对于其他一些想法也是有效的)

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)

除了磁盘I/O之外,这里的主要停止器是内存分配.这里有一些应该提高速度的想法:

Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:

  • 请勿读取内存中的整个文件进行计算,这很慢,并且会通过LOH对象产生很大的内存压力.而是将文件作为流打开,并按块计算哈希.
  • 之所以使用ComputeHash流覆盖时会变慢,是因为它内部使用了非常小的缓冲区(4kb),因此请选择适当的缓冲区大小(256kb或更大,可以通过实验找到最佳值)
  • 使用 TransformBlock
  • Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
  • The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
  • Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
  • Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
  • Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
  • And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
  • As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.

以上所有内容均适用于中型到大型文件.相反,如果您有很多非常小的文件,则可以通过并行处理文件来加快计算速度.实际上,并行化还可以帮助处理较大的文件,但这要进行衡量.

All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.

最后,如果冲突不会给您带来太多麻烦,您可以选择价格较低的哈希算法,例如CRC.

And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

这篇关于创建文件校验和时的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆