多线程文件比较性能 [英] Multithreaded File Compare Performance

查看:511
本文介绍了多线程文件比较性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚绊到这个问题想知道是否会有任何性能改进:

I just stumbled onto this SO question and was wondering if there would be any performance improvement if:


  1. 文件在不大于硬盘扇区大小的块中进行比较(1 / 2KB ,2KB或4KB)

  2. 并且比较是通过多线程完成的(或者甚至可以与.NET 4并行执行)

我想象有2个线程:一个从文件的开头读取,另一个从末尾读取,直到它们在中间相遇。

I imagine there being 2 threads: one that reads from the beginning of the file and another that reads from the end until they meet in the middle.

我理解在这种情况下磁盘IO将是最慢的部分,但如果读取从来没有必须跨越扇区边界(在我扭曲的想象力以某种方式消除任何可能的碎片开销),那么它可能潜在地减少磁头移动,从而导致更好的性能(可能?)。

I understand in this situation the disk IO is going to be the slowest part but if the reads never have to cross sector boundries (which in my twisted imagination somehow eliminates any possible fragmentation overhead) then it may potentially reduce head moves hence resulting in better performance (maybe?).

当然,其他因素也可以起到作用,例如单个vs多个处理器/核心或SSD vs非SSD,是磁盘IO速度+潜在共享处理器时间不可逾越?或者我的计算机理论的概念是完全偏离...

Of course other factors could play in as well, such as, single vs multiple processors/cores or SSD vs non-SSD, but with those aside; is the disk IO speed + potentially sharing processor time insurmountable? Or perhaps my concept of computer theory is completely off-base...

推荐答案

如果你比较两个文件同一驱动器,您可以从多线程接收的唯一的好处是有一个线程读取 - 填充下一个缓冲区 - 而另一个线程正在比较以前读取的缓冲区。

If you're comparing two files that are on the same drive, the only benefit you could receive from multi-threading is to have one thread reading--populating the next buffers--while another thread is comparing the previously-read buffers.

如果您正在比较的文件在不同的物理驱动器上,那么您可以同时执行两个异步读取 - 每个驱动器上有一个。

If the files you're comparing are on different physical drives, then you can have two asynchronous reads going concurrently--one on each drive.

从一开始读一个线程,从另一个读取结束的想法会使事情变慢,因为寻找时间会杀了你。磁盘驱动器磁头将连续地从文件的一端寻找到另一端。以这种方式考虑:你认为从开始顺序读取文件会更快,或者从前面读取64K更快,然后从末端读取64K,然后再回到文件的开头要读取下一个64K,等等?

But your idea of having one thread reading from the beginning and another reading from the end will make things slower because seek time is going to kill you. The disk drive heads will continually be seeking from one end of the file to the other. Think of it this way: do you think it would be faster to read a file sequentially from the start, or would it be faster to read 64K from the front, then read 64K from the end, then seek back to the start of the file to read the next 64K, etc?

碎片是一个问题,可以肯定,但过度碎片是异常,而不是规则。大多数文件将被拆分,或只是部分碎片。

Fragmentation is an issue, to be sure, but excessive fragmentation is the exception, not the rule. Most files are going to be unfragmented, or only partially fragmented. Reading alternately from either end of the file would be like reading a file that's pathologically fragmented.

请记住,典型的磁盘驱动器一次只能满足一个I / O请求。

Remember, a typical disk drive can only satisfy one I/O request at a time.

进行单扇区读取可能会减慢速度。在我的.NET I / O速度测试中,一次读取32K比一次读取4K快得多(在10%和20%之间)。我记得(这是一段时间,因为我这样做),在我的机器在当时,顺序读取的最佳缓冲区大小是256K。根据处理器速度,磁盘控制器,硬盘驱动器和操作系统版本,这对于每台机器无疑是不同的。

Making single-sector reads will probably slow things down. In my tests of .NET I/O speed, reading 32K at a time was significantly faster (between 10 and 20 percent) than reading 4K at a time. As I recall (it's been some time since I did this), on my machine at the time, the optimum buffer size for sequential reads was 256K. That will undoubtedly differ for each machine, based on processor speed, disk controller, hard drive, and operating system version.

这篇关于多线程文件比较性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆