检查文件哈希是否有大批文件 [英] checking file hashes for large batches of files

查看:238
本文介绍了检查文件哈希是否有大批文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一种软​​件,该软件需要根据哈希列表检查大量文件,以查看自生成列表以来是否有文件已更改.我目前正在使用以下内容:


I''m working on software that needs to check a large amount of files against a list of hashes to see if any of the files have changed since the list was generated. I''m currently using the following:


Public Shared Function md5(ByVal data As Stream) As String

Dim encryptor As New System.Security.Cryptography.MD5CryptoServiceProvider()

Dim ByteHash() As Byte = encryptor.ComputeHash(data)

Return Convert.ToBase64String(ByteHash)
End Function



我生成哈希的代码如下所示:



And my code to generate the hash looks like this:

Dim myfile as new System.IO.FileStream(filename)
dim hash as string = md5(myfile)



除了速度很慢之外,其他都可以正常工作.许多文件的大小在2GB范围内,每个文件需要20秒或更长时间.我知道这非常快,但是我很好奇是否有诸如API调用之类的方法,或者我可以做些什么来使其更快.我想知道是否有人知道它将是The Code Project的好伙伴.

谢谢,

www.StudyX.com
www.PlazBackup.com



Which works just fine except that it''s slow. Many of the files are in the 2GB range and take 20 seconds a piece or more. I know this is pretty fast, but I''m curious if there''s anything like an API call or something I can do to make it faster. I figured if anyone knew it''d be the fine folks at The Code Project.

Thanks,
Ray
www.StudyX.com
www.PlazBackup.com

推荐答案

如果您每次对文件进行哈希处理时都不会实例化新的MD5CryptoServiceProvider ,那么您将获得非常小的优势.

创建一个实例,然后多次调用它.例如;
Well you will gain a very small advantage by not instantiating a new MD5CryptoServiceProvider every time you hash a file.

Create one instance and then call it multiple times. e.g.;
Private _hasher as new System.Security.MD5CryptoServiceProvider()

Public Shared Function md5(ByVal data As Stream) As String
   Return Convert.ToBase64String(_hasher.ComputeHash(data))
End Function


问题是有序的要获取每个文件的哈希值,您必须重新读取每个文件,以查看文件是否已更改.您已经快要完成了.

相反,对于数据库中尚未存在的每个文件,请从文件系统中获取上次修改的日期"时间,对文件进行哈希处理,然后将其存储在数据库中.然后,在随后的遍历中,对照数据库中存储的时间检查新的上次修改时间".如果它们不同,则从文件中生成一个新的哈希值,并将其与数据库中的哈希值进行比较.如果它们相同,则文件永远不会更改,只需更新数据库中的上次修改时间"数据即可.如果它们不同,则需要执行任何操作并保存新的哈希和上次修改的数据.
The problem is in order to get the hash for each file, you have to completely read each and every file all over again, just to see if the file changed. You''re already doing it as fast as it''s going to go.

Instead, for each file you don''t have in the database yet, get the Last Modified datetime from the file system, hash the file, and store those in the database. Then, on subsequent passes, check the new Last Modified time against the one stored in the database. If their different, generate a new hash from the file and compare to the one in the database. If they''re the same, the file never changed, just update the Last Modified data in the database. If they''re different, do whatever you need to and save the new hash and Last Modified data.


也许是您不得不经常重新计算哈希的问题?您可以使用System.IO.FileSystemWatcher来获取有关任何更改的通知.修改文件后,您可以触发哈希的重新计算.

您是否更改为使用"SHA"系列的类似密码哈希函数?您可以使用.NET中可用的SHA-256函数,请参见 http://msdn .microsoft.com/en-us/library/system.security.cryptography.sha256.aspx [ http://en.wikipedia .org/wiki/MD5 [ ^ ]. "SHA"家族请参见 http://en.wikipedia.org/wiki/SHA-2 [ ^ ].)


—SA
Maybe the problem is that you have to re-calculate hash to often? You can use System.IO.FileSystemWatcher to get notification on any changes. When a file is modified, you can trigger re-calculation of the hash.

Do you have a change to use a similar cryptographic hash function from "SHA" family instead? You could use SHA-256 function available in .NET, see http://msdn.microsoft.com/en-us/library/system.security.cryptography.sha256.aspx[^].

(By the way, MD5 algorithm is considered "broken", not to be used for any security purposes, see http://en.wikipedia.org/wiki/MD5[^]. About "SHA" family see http://en.wikipedia.org/wiki/SHA-2[^].)


—SA


这篇关于检查文件哈希是否有大批文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆