许多小文件还是一个大文件? (或者,打开和关闭文件句柄的开销)(C ++) [英] Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

查看:246
本文介绍了许多小文件还是一个大文件? (或者,打开和关闭文件句柄的开销)(C ++)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个执行以下操作的应用程序:

I have created an application that does the following:

  1. 进行一些计算,将计算的数据写入到一个文件中-重复500,000次(总共,一次又一次写入500,000个文件)-再重复2次(总共,150万个文件书面).
  2. 读取文件中的数据,并对文件中的数据进行大量计算-重复1,500,000次迭代(遍历在第1步中写入的所有文件).
  3. 重复步骤2进行200次迭代.
  1. Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
  2. Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
  3. Repeat step 2 for 200 iterations.

每个文件都是〜212k,所以总的来说,我有〜300Gb的数据.看起来整个过程使用2.8 GHz的Core 2 Duo CPU大约需要40天.

Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.

我的问题是(您可能会猜到)完成整个过程所花费的时间.所有计算都是串行的(每个计算都取决于之前的计算),因此我无法将此进程并行处理到不同的CPU或PC.我正在尝试思考如何使该过程更有效,并且我可以确定大部分开销都用于文件系统访问(duh ...).每次访问文件时,我都会打开文件的句柄,然后在读取完数据后将其关闭.

My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.

提高运行时间的一个主意是使用一个300Gb的大文件(或每个50Gb的几个大文件),然后我将只使用一个打开的文件句柄并简单地查找每个相关数据并读取它,但我不是打开和关闭文件句柄的开销是多少.有人可以对此有所启发吗?

One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?

我的另一个想法是尝试将文件分组为更大的〜100Mb文件,然后我每次将读取100Mb而不是读取许多212k,但这比上述想法实现起来要复杂得多.

Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.

无论如何,如果有人可以给我一些建议或有任何想法如何改善运行时间,我将不胜感激!

Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!

谢谢.

Profiler更新:

我在进程上运行了探查器,看来计算占用了62%的运行时间,而读取的文件占用了34%的时间.这意味着,即使我奇迹般地将文件I/O成本削减了34倍,我仍然还有24天的时间,这是一个很大的进步,但仍需很长时间:)

I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)

推荐答案

打开文件句柄可能不会成为瓶颈;实际的磁盘IO是.如果您可以并行化磁盘访问(例如通过使用多个磁盘,更快的磁盘,一个RAM磁盘等),则可能会受益更多.另外,请确保IO不会阻止应用程序:从磁盘读取,并在等待IO时进行处理.例如.带有读取器和处理器线程.

Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.

另一件事:如果下一步取决于当前的计算,为什么还要花很多精力将其保存到磁盘上呢?也许从另一个角度看待流程的依赖性,您可以重新处理数据流并摆脱大量的IO.

Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.

哦,是的,然后测量:)

这篇关于许多小文件还是一个大文件? (或者,打开和关闭文件句柄的开销)(C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆