多线程文件处理与.NET [英] Multi threaded file processing with .NET

查看:134
本文介绍了多线程文件处理与.NET的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有是包含小文本文件1000的文件夹。我的目标是解析和处理所有的人,而更多的文件被填充到该文件夹​​。我的目的是多线程这个操作作为单线程的原型花了六个月的时间来处理1000个文件。

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.

我喜欢有读者和作家线程(S),如下。虽然读线程(S)正在阅读的文件,我想有写线程(S)来处理它们。一旦读者开始阅读的文件,我倒要标记为被处理,例如通过重命名它。一旦它的读取,并将其重命名为已完成。

I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.

我如何处理这一多线程应用程序?

How do I approach such a multithreaded application?

是更好地使用分布式哈希表或队列?

Is it better to use a distributed hash table or a queue?

哪些数据结构做我用这将避免锁?

Which data structure do I use that would avoid locks?

有没有这个计划更好的办法?

Is there a better approach to this scheme?

推荐答案

由于有关于如何.NET 4 works。这在评论,下面的方法好奇心。对不起,很可能不是为OP的选项。的免责声明:这不是一个非常科学的分析,只是显示,有一个明显的性能优势。基于硬件的,您的情况可能有很大的不同。

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.

下面是一个快速测试(如果你看到在这个简单的测试了一个大错误,这只是一个例子。请评论,我们可以修复它​​更有用/准确)。对于这一点,我只是下降了12000〜60  KB文件放到一个目录作为一个样本(火起来 LINQPad ,你可以玩!用它自己,自由 - 一定要得到LINQPad 4虽然):

Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):

var files = 
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();

var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration

sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");

稍微改变你的循环并行查询的是所有的需要
简单的情况下。通过简单的我主要是说一个人行动的结果不会影响下一个。事情要记住最常见的是,一些集合,例如我们方便的 名单,LT; T> 不线程安全 ,因此使用它在一个平行的方案是不是一个好主意:)幸运的是,还有的在.NET 4 增加并发集合是线程安全的。记还留着,如果您使用的是锁定集合,这可能是一个瓶颈,以及,视情况而定。

Slightly changing your loop to parallelize the query is all that's needed in most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.

本使用<一个href=\"http://msdn.microsoft.com/en-us/library/dd413602.aspx\"><$c$c>.AsParallel<T>(IEnumeable<T>)和<一个href=\"http://msdn.microsoft.com/en-us/library/dd383744%28v=VS.100%29.aspx\"><$c$c>.ForAll<T>(ParallelQuery<T>)在.NET 4.0中可用的扩展。在<一个href=\"http://msdn.microsoft.com/en-us/library/dd413602%28v=VS.100%29.aspx\"><$c$c>.AsParallel()调用封装了的IEnumerable&LT; T&GT; ParallelEnumerableWrapper&LT; T&GT; (内部类),它实现<一个href=\"http://msdn.microsoft.com/en-us/library/dd383736.aspx\"><$c$c>ParallelQuery<T>.现在,这可以让你href=\"http://msdn.microsoft.com/en-us/library/dd268389%28v=VS.100%29.aspx\">并行扩展方法的使用<$c$c>.ForAll().

This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().

<一个href=\"http://msdn.microsoft.com/en-us/library/dd383744%28v=VS.100%29.aspx\"><$c$c>.ForAll()内部包装箱一个 ForAllOperator&LT; T&GT;(查询,动作)和同步运行它。这种处理线程的线程和合并它的运行后...有相当多的在那里怎么回事,我建议的从这里开始,如果你想了解更多,包括附加选项

.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.


  • 串行: 1288 - 1333ms

  • 并行: 461 - 503ms

  • Serial: 1288 - 1333ms
  • Parallel: 461 - 503ms

电脑规格 - 进行比较:

Computer specs - for comparison:

  • Quad Core i7 920 @ 2.66 GHz
  • 12 GB RAM (DDR 1333)
  • 300 GB 10k rpm WD VelociRaptor

  • 串行: 545 - 601&NBSP; MS

  • 并行: 248 - 278&NBSP; MS

  • Serial: 545 - 601 ms
  • Parallel: 248 - 278 ms

电脑规格 - 进行比较:

Computer specifications - for comparison:

  • Quad Core 2 Quad Q9100 @ 2.26 GHz
  • 8 GB RAM (DDR 1333)
  • 120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)

我没有对CPU / RAM链接这个时候,这些预装了。这是戴尔M6400笔记本电脑(<一个href=\"http://www.dell.com/us/en/slgov/notebooks/$p$pcision-m6500/pd.aspx?refid=$p$pcision-m6500\">here's到M6500 一个链接...戴尔自己的链接到6400 的<一个href=\"http://www.dell.com/content/products/productdetails.aspx/workstation-$p$pcision-m6400?c=us&l=en&s=biz\">broken).

I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).

这些数字是从10次,取内8结果的最小值/最大值(除去对每个为可能的异常的原始最小值/最大值)。我们在这里打一个I / O瓶颈,尤其是物理驱动器上,但想想串行方法做了什么。它读取,流程,内容,流程,冲洗重复。与并行的方式,你是(即使有I / O瓶颈)读取和处理的同时的。在最坏的瓶颈的情况,你处理一个文件,而读取下一个。这本身(在任何当前计算机上!)应导致的部分的性能提升。你可以看到,我们可以得到超过一在上述结果的时间去,给我们一个健康的提升多一点。

These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.

另一个免责声明:四核+ .NET 4水货是不会给你四倍的性能,它并不成线性比例......还有其他的因素和瓶颈在游戏的。

我希望这是在展示方式和可能带来的好处的兴趣。随意批评或改善人民生活作为在评论中指出这个答案只对那些好奇的存在:)

I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

这篇关于多线程文件处理与.NET的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆