快(最快?)的方式获得的文件数在目录中有超过20万的文件 [英] Quicker (quickest?) way to get number of files in a directory with over 200,000 files

查看:160
本文介绍了快(最快?)的方式获得的文件数在目录中有超过20万的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含测试数据,通常在每个目录20万小(〜4K)文件的一些目录。

I have some directories containing test data, typically over 200,000 small (~4k) files per directory.

我现在用的是下面的C#code得到的文件目录中的数量:

I am using the following C# code to get the number of files in a directory:

int fileCount = System.IO.Directory.GetFiles(@"C:\SomeDirectory").Length;

这是非常,非常缓慢的但是 - 在没有任何替代品,我可以使用

This is very, very slow however - are there any alternatives that I can use?

每个文件夹包含数据的一天,我们将有18个月左右的目录(〜550目录)。我也很感兴趣的性能增强人民以自己的返工平的目录结构,以多个嵌套的人发现。

Each folder contains data for one day, and we will have around 18 months of directories (~550 directories). I am also very interested in performance enhancements people have found by reworking flat directory structures to more nested ones.

推荐答案

不使用System.IO.Directory命名空间,没有。你必须找到查询目录的方式,不涉及创建文件的一个巨大的名单。

Not using the System.IO.Directory namespace, there isn't. You'll have to find a way of querying the directory that doesn't involve creating a massive list of files.

这看起来有点微软一时的疏忽,该的Win32 API一直有功能,可以在目录数的文件。

This seems like a bit of an oversight from Microsoft, the Win32 APIs have always had functions that could count files in a directory.

您可能还需要考虑拆分了目录。你如何管理一个20万文件目录是超越我: - )

You may also want to consider splitting up your directory. How you manage a 200,000-file directory is beyond me :-)

更新:

约翰·桑德斯提出的意见好点。我们已经知道,(通用)的文件系统根本没有能力来处理存储这个水平。一件事的的装备来处理小文件庞大的数字是一个数据库。

John Saunders raises a good point in the comments. We already know that (general purpose) file systems are not well equipped to handle this level of storage. One thing that is equipped to handle huge numbers of small "files" is a database.

如果可以识别一个键为每个(含有,例如,日期,时间及客户号),这些文件应注入的数据库。 4K容量的记录尺寸和108万行(200,000行/天* 30天/月* 18个月)应该由最专业的数据库,可以很容易地处理。我知道,DB2 / Z就啃了吃早饭。

If you can identify a key for each (containing, for example, date, hour and customer number), these files should be injected into a database. The 4K record size and 108 million rows (200,000 rows/day * 30 days/month * 18 months) should be easily handled by most professional databases. I know that DB2/z would chew on that for breakfast.

然后,当你需要提取到文件的一些测试数据,你有一个脚本/程序,只提取相关记录到文件系统。然后运行你的测试成功完成,并删除该文件。

Then, when you need some test data extracted to files, you have a script/program which just extracts the relevant records onto the file system. Then run your tests to successful completion and delete the files.

这应该让你的具体问题,很容易做到:

That should make your specific problem quite easy to do:

select count(*) from test_files where directory_name = '/SomeDirectory'

假设你已经在目录名的指标,当然。

assuming you have an index on directory_name, of course.

这篇关于快(最快?)的方式获得的文件数在目录中有超过20万的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆