使用.NET进行多线程文件处理 [英] Multi threaded file processing with .NET

查看:203
本文介绍了使用.NET进行多线程文件处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个包含1000个小文本文件的文件夹。我的目的是解析和处理所有这些,而更多的文件被填充到文件夹。我的意图是多线程这个操作,因为单线程原型花了六分钟处理1000个文件。



我喜欢读写器线程如下。读者线程正在读取文件时,我想要编写线程来处理它们。一旦阅读器开始阅读文件,我就喜欢把它标记为被处理,比如重命名它。一旦阅读完毕,将其重命名为完成。



如何处理这样的多线程应用程序?



是更好地使用分布式哈希表或队列?



我使用哪种数据结构可以避免锁?



是否有更好的方法?

解决方案

由于有趣的是,.NET 4在评论中如何工作,这是这个方法。对不起,OP可能不是一个选择。免责声明:这不是一个高度科学的分析,只是表明有明显的业绩好处。基于硬件,您的里程可能会有很大差异。



这是一个快速测试(如果您在这个简单的测试中看到一个大错误,这只是一个例子。请评论,我们可以解决它更有用/准确)。为此,我将12,000〜60 KB文件放入目录中作为示例(启动 LINQPad ;你可以自己玩,免费! - 一定要获得LINQPad 4虽然) :

  var files = 
Directory.GetFiles(C:\\\temp,*。* ,SearchOption.AllDirectories).ToList();

var sw = Stopwatch.StartNew(); // start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); // do work - serial
sw.Stop(); // stop
sw.ElapsedMilliseconds.Dump(运行MS - 串行); //显示持续时间

sw.Restart();
files.AsParallel()。ForAll(f => File.ReadAllBytes(f).GetHashCode()); // parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump(运行MS - 并行);

稍微改变循环并行化查询是
最需要的 简单情况。 简单我主要是指一个动作的结果不会影响下一个动作。最常记住的是,有些集合,例如我们方便的 列表< T> 不线程安全 ,所以在并行场景中使用它不是一个好主意:)幸运的是,有一个在.NET 4中添加的并发集合。还要记住,如果你使用锁定集合,这可能是一个瓶颈,根据情况。



这使用 .AsParallel< T>(IEnumeable< T>) / a>和 .ForAll< ; .NET 4.0中可用的T>(ParallelQuery< T) 扩展。 .AsParallel() 调用在 ParallelEnumerableWrapper< T> 中包含 IEnumerable< T> 内部类),其实现 ParallelQuery< T> 。现在,您可以使用并行扩展方法,在这种情况下,我们使用的是 .ForAll()



.ForAll() 内部包装一个 ForAllOperator< T>(query,action)并同步运行。这可以处理线程运行后的线程和合并...在那里有很多事情,我建议如果您想了解更多信息,包括其他选项,请从...开始。






< h2>结果(计算机1 - 物理硬盘):


  • 序列号: 1288 - 1333ms

  • 并行: 461 - 503ms



计算机规格 - 用于比较: / p>



结果(计算机2 - 固态硬盘):




  • 系列: 545 - 601  ms

  • 并行: 248 - 278  ms



计算机规格 - 用于比较:





这次我没有CPU / RAM的链接,这些都是安装的。这是戴尔M6400笔记本电脑(这里是M6500的链接 ...戴尔的拥有6400的链接 broken






这些数字来自10次运行,取内部8个结果的最小/最大值(删除原始最小/最大为每个可能的异常值)。我们在这里遇到了I / O瓶颈,特别是在物理驱动器上,但是考虑串行方法的功能。它读取,处理,读取,处理,冲洗重复。通过并行方法,您(即使有I / O瓶颈)同时读取和处理。在最糟糕的瓶颈情况下,您正在阅读下一个文件时处理一个文件。一个人(在任何当前的计算机上!)应该导致一些性能增益。你可以看到我们可以在上面的结果中得到一个以上的结果,给我们一个健康的提振。



另一个免责声明:四核+ .NET 4并行不会给你四倍的表现,它不会线性缩放...还有其他考虑和瓶颈在玩。


$ b $我希望这是有兴趣展示的方法和可能的好处。随意批评或改进...这个答案只存在于那些好奇的话题,如意见所示:)


There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.

I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.

How do I approach such a multithreaded application?

Is it better to use a distributed hash table or a queue?

Which data structure do I use that would avoid locks?

Is there a better approach to this scheme?

解决方案

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.

Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):

var files = 
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();

var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration

sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");

Slightly changing your loop to parallelize the query is all that's needed in most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.

This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().

.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.


The results (Computer 1 - Physical Hard Disk):

  • Serial: 1288 - 1333ms
  • Parallel: 461 - 503ms

Computer specs - for comparison:

The results (Computer 2 - Solid State Drive):

  • Serial: 545 - 601 ms
  • Parallel: 248 - 278 ms

Computer specifications - for comparison:

  • Quad Core 2 Quad Q9100 @ 2.26 GHz
  • 8 GB RAM (DDR 1333)
  • 120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)

I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).


These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.

Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.

I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

这篇关于使用.NET进行多线程文件处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆