C#线程化 - 同时读取和散列多个文件,最简单的方法? [英] C# Threading - Reading and hashing multiple files concurrently, easiest method?

查看:110
本文介绍了C#线程化 - 同时读取和散列多个文件,最简单的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图获得我认为是我的应用程序中最简单的线程形式,但我无法做到。



我想要做的是:我有一个带有状态栏和进度条的主窗体。我必须读取3到99个文件之间的内容,并将它们的哈希值添加到一个字符串[]中,我将其添加到具有各自哈希的所有文件的列表中。之后,我必须将该列表中的项目与数据库(以文本文件形式)进行比较。
完成这一切后,我必须将主窗体和进度条中的文本框更新为33%;大多数情况下我只是不希望主要表单在处理期间冻结。



我使用的文件总是总计为1.2GB(+/-几MB ),这意味着我应该能够将它们读入字节[]并从那里处理它们(我必须计算每个文件的CRC32,MD5和SHA1,这样应该比从HDD读取所有文件快3倍)。

另外我应该注意一些文件可能是1MB,而另一个可能是1GB。我最初想要为99个文件创建99个线程,但这似乎并不明智,我想最好是在更大的文件线程仍在运行时重新使用小文件的线程。但是,这听起来很复杂,所以我不确定这是否明智。



到目前为止,我已经尝试过workerThreads和backgroundWorkers,但都似乎没有太好我;至少背景工作者在某些时候工作过,但我甚至无法弄清楚为什么他们不会在其他时候......无论哪种方式,主要形式仍然冻结。
现在我已经阅读了.NET 4.0中的任务并行库,但我想我应该更好地问一个知道自己在做什么的人,然后再浪费更多时间。





  List< string []> fileSpecifics = new List< string []>(); 

int fileMaxNumber = 42; //根据文件集

(int i = 1; i <= fileMaxNumber; i ++)
{
string fileName =C: \\\\\\\\\\\\\\\\\\\\\\\\\\\' // file01.ext - file99.ext
string fileSize = new FileInfo(fileName).Length.ToString();
byte [] file = File.ReadAllBytes(fileName);
//散列计算(使用SHA1CryptoServiceProvider()等,没有问题,所以我会饶你,返回字符串)
file = null; //我还没有检查这是否有任何实际差异,但我认为它不会损害
fileSpecifics.Add(新字符串[] {fileName,fileSize,fileCRC,fileMD5,fileSHA1});
}

//在上面提到的文本数据库中查找文件,即首先检查文件数量与我在这里相同的文件包;然后比较文件大小,然后再次哈希
//,没有问题,所以我会免费的;数据库文本文件非常小,因此解析它们不需要在额外的线程中完成。

有人会指出我的方向吗?我正在寻找最简单的方法来快速读取和散列这些文件(我相信哈希需要一段时间才能读取其他文件)并将输出保存为字符串[],而不会冻结主窗体,仅此而已,无所谓。

我很感谢任何输入。



编辑说明:通过backgroundWorkers working有些时候我的意思是(对于同一组文件),也许我的代码的第一次和第四次执行会产生正确的输出,并且UI会在5秒内解冻,对于第二次,第三次和第五次执行它会冻结窗体(60秒之后,我收到一条错误信息,说某个线程在那段时间内没有响应),我必须通过VS停止执行。



感谢所有你的建议和指针,因为你们都已经正确地猜到了我对线程是全新的,并且必须阅读你们发布的伟大链接。
然后,我会给这些方法一个尝试,并标记最能帮助我的答案。再次感谢!

解决方案

使用.NET Framework 4.X


  1. 使用 Directory.EnumerateFiles 有效/惰性文件枚举的方法

  2. 使用 Parallel.For()将并行性工作委托给PLINQ框架或使用 TPL 委托每个管道单个任务阶段

  3. 使用 Pipelines pattern to pipeline following stages:计算哈希码,与模式进行比较,更新UI

  4. 避免UI冻结使用适当的技术:对于WPF使用 Dispatcher.BeginInvoke(),对于WinForms使用Invoke(),查看这个答案

  5. 考虑到所有这些东西都有UI,它可能会如果需要添加一些取消功能以放弃长时间运行的操作,请查看 CreateLinkedTokenSource 类,它允许从外部作用域触发 CancellationToken
    我可以尝试添加一个示例,但值得自己去做,所以你会学习所有这些东西,而不是简单地复制/粘贴 - >让它工作 - >忘了它。

PS:必须阅读 - MSDN上的管道文件






TPL特定于管道的实现




  • 管道模式实施:三个阶段:计算哈希,匹配和更新UI
  • 三个任务,每个阶段一个

  • 两个阻止队列



//

$ b)
$ b

  // 1)CalculateHashesImpl()应该在这里存储所有计算的散列值
// 2)CompareMatchesImpl()应该从这个队列读取输入散列值
// Tuple.Item1 - hash,Typle.Item2 - 文件路径
var calculatedHashes = new BlockingCollection< Tuple< string,string>>();

$ b // 1)CompareMatchesImpl()应该在这里存储所有的模式匹配结果
// 2)SyncUiImpl()方法应该从这个集合中读取并更新
/ / UI带有可用结果
var comparisonMatches = new BlockingCollection< string>();

var factory = new TaskFactory(TaskCreationOptions.LongRunning,
TaskContinuationOptions.None);


var calculateHashesWorker = factory.StartNew(()=> CalculateHashesImpl(...));
var comparisonMagchesWorker = factory.StartNew(()=> CompareMatchesImpl(...));
var syncUiWorker = factory.StartNew(()=> SyncUiImpl(...));

Task.WaitAll(calculateHashesWorker,comparisonMatchSWorker,syncUiWorker);

CalculateHashesImpl():

  private void CalculateHashesImpl(string directoryPath)
{
foreach(Directory.EnumerateFiles(directoryPath)中的var文件)
{
var hash = CalculateHashTODO(file);
calculatedHashes.Add(new Tuple< string,string>(hash,file.Path));




CompareMatchesImpl():

  private void CompareMatchesImpl()
{
foreach(calculatedHashes.GetConsumingEnumerable()中的var hashEntry)
{
// TODO:显然返回类型取决于你
string matchResult = GetMathResultTODO(hashEntry.Item1,hashEntry.Item2);
compareMatches.Add(matchResult);




SyncUiImpl():

  private void UpdateUiImpl()
{
foreach(comparison matchMatches.GetConsumingEnumerable() b $ b {
// TODO:使用UI框架特定功能跟踪UI的进度
//不冻结它
}
}

TODO:考虑使用 CancellationToken 作为所有 GetConsumingEnumerable()调用,以便您可以在需要时轻松停止管道执行。


I've been trying to get what I believe to be the simplest possible form of threading to work in my application but I just can't do it.

What I want to do: I have a main form with a status strip and a progress bar on it. I have to read something between 3 and 99 files and add their hashes to a string[] which I want to add to a list of all files with their respective hashes. Afterwards I have to compare the items on that list to a database (which comes in text files). Once all that is done, I have to update a textbox in the main form and the progressbar to 33%; mostly I just don't want the main form to freeze during processing.

The files I'm working with always sum up to 1.2GB (+/- a few MB), meaning I should be able to read them into byte[]s and process them from there (I have to calculate CRC32, MD5 and SHA1 of each of those files so that should be faster than reading all of them from a HDD 3 times).

Also I should note that some files may be 1MB while another one may be 1GB. I initially wanted to create 99 threads for 99 files but that seems not wise, I suppose it would be best to reuse threads of small files while bigger file threads are still running. But that sounds pretty complicated to me so I'm not sure if that's wise either.

So far I've tried workerThreads and backgroundWorkers but neither seem to work too well for me; at least the backgroundWorkers worked SOME of the time, but I can't even figure out why they won't the other times... either way the main form still froze. Now I've read about the Task Parallel Library in .NET 4.0 but I thought I should better ask someone who knows what he's doing before wasting more time on this.

What I want to do looks something like this (without threading):

List<string[]> fileSpecifics = new List<string[]>();

int fileMaxNumber = 42; // something between 3 and 99, depending on file set

for (int i = 1; i <= fileMaxNumber; i++)
{
    string fileName = "C:\\path\\to\\file" + i.ToString("D2") + ".ext"; // file01.ext - file99.ext
    string fileSize = new FileInfo(fileName).Length.ToString();
    byte[] file = File.ReadAllBytes(fileName);
    // hash calculations (using SHA1CryptoServiceProvider() etc., no problems with that so I'll spare you that, return strings)
    file = null; // I didn't yet check if this made any actual difference but I figured it couldn't hurt
    fileSpecifics.Add(new string[] { fileName, fileSize, fileCRC, fileMD5, fileSHA1 });
}

// look for files in text database mentioned above, i.e. first check for "file bundles" with the same amount of files I have here; then compare file sizes, then hashes
// again, no problems with that so I'll spare you that; the database text files are pretty small so parsing them doesn't need to be done in an extra thread.

Would anybody be kind enough to point me in the right direction? I'm looking for the easiest way to read and hash those files quickly (I believe the hashing takes some time in which other files could already be read) and save the output to a string[], without the main form freezing, nothing more, nothing less.

I'm thankful for any input.

EDIT to clarify: by "backgroundWorkers working some of the time" I meant that (for the very same set of files), maybe the first and fourth execution of my code produces the correct output and the UI unfreezes within 5 seconds, for the second, third and fifth execution it freezes the form (and after 60 seconds I get an error message saying some thread didn't respond within that time frame) and I have to stop execution via VS.

Thanks for all your suggestions and pointers, as you all have correctly guessed I'm completely new to threading and will have to read up on the great links you guys posted. Then I'll give those methods a try and flag the answer that helped me the most. Thanks again!

解决方案

With .NET Framework 4.X

  1. Use Directory.EnumerateFiles Method for efficient/lazy files enumeration
  2. Use Parallel.For() to delegate parallelism work to PLINQ framework or use TPL to delegate single Task per pipeline Stage
  3. Use Pipelines pattern to pipeline following stages: calculating hashcodes, compare with pattern, update UI
  4. To avoid UI freeze use appropriate techniques: for WPF use Dispatcher.BeginInvoke(), for WinForms use Invoke(), see this SO answer
  5. Considering that all this stuff has UI it might be useful adding some cancellation feature to abandon long running operation if needed, take a look at the CreateLinkedTokenSource class which allows triggering CancellationToken from the "external scope" I can try adding an example but it's worth do it yourself so you would learn all this stuff rather than simply copy/paste - > got it working -> forgot about it.

PS: Must read - Pipelines paper at MSDN


TPL specific pipeline implementation

  • Pipeline pattern implementation: three stages: calculate hash, match, update UI
  • Three tasks, one per stage
  • Two Blocking Queues

//

// 1) CalculateHashesImpl() should store all calculated hashes here
// 2) CompareMatchesImpl() should read input hashes from this queue
// Tuple.Item1 - hash, Typle.Item2 - file path
var calculatedHashes = new BlockingCollection<Tuple<string, string>>();


// 1) CompareMatchesImpl() should store all pattern matching results here
// 2) SyncUiImpl() method should read from this collection and update 
//    UI with available results
var comparedMatches = new BlockingCollection<string>();

var factory = new TaskFactory(TaskCreationOptions.LongRunning,
                              TaskContinuationOptions.None);


var calculateHashesWorker = factory.StartNew(() => CalculateHashesImpl(...));
var comparedMatchesWorker = factory.StartNew(() => CompareMatchesImpl(...));
var syncUiWorker= factory.StartNew(() => SyncUiImpl(...));

Task.WaitAll(calculateHashesWorker, comparedMatchesWorker, syncUiWorker);

CalculateHashesImpl():

private void CalculateHashesImpl(string directoryPath)
{
   foreach (var file in Directory.EnumerateFiles(directoryPath))
   {
       var hash = CalculateHashTODO(file);
       calculatedHashes.Add(new Tuple<string, string>(hash, file.Path));
   }
}

CompareMatchesImpl():

private void CompareMatchesImpl()
{
   foreach (var hashEntry in calculatedHashes.GetConsumingEnumerable())
   {
      // TODO: obviously return type is up to you
      string matchResult = GetMathResultTODO(hashEntry.Item1, hashEntry.Item2);
      comparedMatches.Add(matchResult);
   }
}

SyncUiImpl():

private void UpdateUiImpl()
{
    foreach (var matchResult in comparedMatches.GetConsumingEnumerable())
    {
        // TODO: track progress in UI using UI framework specific features
        // to do not freeze it
    }
}

TODO: Consider using CancellationToken as a parameter for all GetConsumingEnumerable() calls so you easily can stop a pipeline execution when needed.

这篇关于C#线程化 - 同时读取和散列多个文件,最简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆