在NTFS上打开许多小文件太慢了 [英] Opening many small files on NTFS is way too slow

查看:387
本文介绍了在NTFS上打开许多小文件太慢了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个程序,该程序应该处理许多小文件,例如成千上万个. 我已经在500k文件中测试了该部分,第一步只是迭代其中包含约45k目录(包括subdirs的子目录等)和500k小文件的目录.遍历所有目录和文件(包括获取文件大小和计算总大小)大约需要6秒钟.现在,如果我尝试在遍历时打开每个文件并立即将其关闭,则看起来它永远不会停止.实际上,它花费的时间太长(数小时...).由于我是在Windows上执行此操作的,因此我尝试使用CreateFileW,_wfopen和_wopen打开文件.尽管在最终实现中,我只需要阅读,但我没有在文件上读取或写入任何内容.但是,在任何尝试中我都没有看到明显的改善.

I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.

我想知道是否存在一种更有效的方法来打开具有任何可用功能的文件,无论是C,C ++还是Windows API,还是唯一更有效的方法是直接读取MFT并直接读取磁盘块,我想避免这种情况?

I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?

更新:我正在处理的应用程序正在使用版本控制进行备份快照.因此,它也具有增量备份.为了进行版本控制(例如scm),在一个巨大的源代码存储库中完成了500k文件的测试.因此,所有文件都不在一个目录中.大约还有4.5万个目录(如上所述).

Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).

因此,建议的压缩文件的解决方案无济于事,因为备份完成后,即访问了所有文件.因此,我不会从中受益,甚至会产生一些性能成本.

So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.

推荐答案

对于任何操作系统来说,您试图做的事情本质上是困难的.无论如何分割,45,000个子目录都需要大量磁盘访问权限.

What you are trying to do is intrinsically difficult for any operating system to do efficiently. 45,000 subdirectories requires a lot of disk access no matter how it is sliced.

就NTFS而言,任何超过1,000字节的文件都是大"文件.如果有一种方法可以使大多数数据文件小于约900个字节,则可以通过将文件数据存储在MFT中来实现显着的效率.这样一来,获取数据就不会比获取文件的时间戳或大小更昂贵.

Any file over about 1,000 bytes is "big" as far as NTFS is concerned. If there were a way to make most data files less than about 900 bytes, you could realize a major efficiency by having the file data stored inside the MFT. Then it would be no more expensive to obtain the data than it is to obtain the file's timestamps or size.

我怀疑是否有任何方法可以优化程序的参数,进程选项甚至操作系统的调整参数,以使应用程序正常运行.除非您能以完全不同的方式重新设计,否则您将面临数小时的操作.

I doubt there is any way to optimize the program's parameters, process options, or even the operating system's tuning parameters to make the application work well. You are faced with multi-hour operation unless you can rearchitect it in a radically different way.

一种策略是将文件分布在多台计算机上-可能是数千台计算机-并且在每个进程中都有一个子应用程序来处理本地文件,将任何结果提供给主应用程序.

One strategy would be to distribute the files across multiple computers—probably thousands of them—and have a sub-application on each process the local files, feeding whatever results to a master application.

另一种策略是将所有文件重新构建为几个较大的文件,例如@felicepollano建议的大.zip文件,以有效地虚拟化文件集.与访问40亿个1 MB的文件相比,随机访问4000 GB的文件在本质上更有效地利用了资源.同样,将所有数据移动到合适的数据库管理器(MySQL,SQL Server等)中也可以实现此目的,并可能提供其他好处,例如轻松的搜索和轻松的存档策略.

Another strategy would be to re-architect all the files into a few larger files, like big .zip files as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently far more efficient and effective use of resources than accessing 4 billion 1 MB files. Also moving all the data into a suitable database manager (MySQL, SQL Server, etc.) would accomplish this and perhaps provide other benefits like easy searches and an easy archival strategy.

这篇关于在NTFS上打开许多小文件太慢了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆