使用 .NET 进行多线程文件处理 [英] Multi threaded file processing with .NET

查看:43
本文介绍了使用 .NET 进行多线程文件处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个文件夹包含 1000 多个小文本文件.我的目标是在将更多文件填充到文件夹中的同时解析和处理所有这些文件.我的目的是将这个操作多线程化,因为单线程原型需要 6 分钟来处理 1000 个文件.

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.

我喜欢阅读器和编写器线程如下.当读取器线程正在读取文件时,我希望有写入器线程来处理它们.一旦阅读器开始读取文件,我想将其标记为正在处理,例如重命名.读取后,将其重命名为完成.

I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.

如何处理这样的多线程应用程序?

How do I approach such a multithreaded application?

使用分布式哈希表好还是队列好?

Is it better to use a distributed hash table or a queue?

我使用哪种数据结构可以避免锁?

Which data structure do I use that would avoid locks?

有没有更好的方法来处理这个方案?

Is there a better approach to this scheme?

推荐答案

由于在评论中对 .NET 4 如何处理这个问题很好奇,这里是那个方法.抱歉,这可能不是 OP 的选项.免责声明:这不是一个高度科学的分析,只是表明有明显的性能优势.根据硬件,您的里程可能会有很大差异.

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.

这是一个快速测试(如果你在这个简单的测试中看到一个大错误,这只是一个例子.请评论,我们可以修复它以使其更有用/准确).为此,我只是将 12,000 ~60 KB 文件作为示例放入一个目录中(启动 LINQPad;你可以自己玩,免费!-一定要安装 LINQPad 4):

Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):

var files = 
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();

var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration

sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");

稍微改变你的循环来并行化查询是所有需要的大多数 简单的情况.我所说的简单"主要是指一个动作的结果不会影响下一个动作.最常记住的是某些集合,例如我们方便的 List不是线程安全,所以在并行场景中使用它不是一个好主意:) 幸运的是有 在 .NET 4 中添加的并发集合 是线程安全的.还要记住,如果您使用锁定集合,这也可能是一个瓶颈,具体取决于具体情况.

Slightly changing your loop to parallelize the query is all that's needed in most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.

这使用 .AsParallel(IEnumeable)).ForAll(ParallelQuery) 扩展在 .NET 4.0 中可用..AsParallel() 调用将 IEnumerable 包装在实现 ParallelQuery.这现在允许您使用并行扩展方法,在这种情况下,我们使用 .ForAll().

This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().

.ForAll() 在内部创建一个 ForAllOperator(query, action) 并同步运行它.它在运行后处理线程的线程化和合并......那里有很多事情发生,我建议 如果您想了解更多信息(包括其他选项),请从这里开始.

.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.

  • 串行:1288 - 1333 毫秒
  • 并行:461 - 503 毫秒

计算机规格 - 用于比较:

Computer specs - for comparison:

  • 序列号:545 - 601 毫秒
  • 并行:248 - 278 毫秒

计算机规格 - 用于比较:

Computer specifications - for comparison:

  • Quad Core 2 Quad Q9100 @ 2.26 GHz
  • 8 GB RAM (DDR 1333)
  • 120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)

这次我没有 CPU/RAM 的链接,这些都是安装好的.这是戴尔 M6400 笔记本电脑 (这是 M6500 的链接...戴尔的自己的 6400 链接损坏).

I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).

这些数字来自 10 次运行,取内部 8 个结果的最小值/最大值(去除每个可能的异常值的原始最小值/最大值).我们在这里遇到了 I/O 瓶颈,尤其是在物理驱动器上,但想想串行方法的作用.它读取、处理、读取、处理、冲洗重复.使用并行方法,您(即使存在 I/O 瓶颈)可以同时读取和处理.在最糟糕的瓶颈情况下,您正在处理一个文件,同时读取下一个文件.仅此一项(在任何当前计算机上!)应该会导致一些性能提升.您可以看到,在上面的结果中,我们一次可以得到不止一个结果,这给我们带来了健康的提升.

These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.

另一个免责声明:四核 + .NET 4 并行不会给你四倍的性能,它不会线性扩展......还有其他考虑因素和瓶颈.

我希望这是对展示方法和可能的好处的兴趣.随意批评或改进......这个答案仅供那些好奇的人在评论中指出:)

I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

这篇关于使用 .NET 进行多线程文件处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆