NTFS目录有100K条目。如果传播超过100个子目录,性能提升多少? [英] NTFS directory has 100K entries. How much performance boost if spread over 100 subdirectories?

查看:197
本文介绍了NTFS目录有100K条目。如果传播超过100个子目录,性能提升多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文
我们有一个本地文件系统支持的缓存库。由于大量的条目(例如高达100,000),我们目前在一次安装中存在性能问题。问题是:我们将所有fs条目存储在一个缓存目录中。非常大的目录表现不佳。



我们正在研究将这些条目分散到子目录中 - 就像git所做的那样。 100个子目录,每个有1000个条目。





较小的目录大小将有助于文件系统访问。但是,传播到子目录将加速遍历所有条目,例如,枚举/读取所有100,000个条目?即当我们从FS商店初始化/加热缓存时,我们需要遍历所有100,000个条目(并删除旧条目)可能需要10分钟以上。

传播数据会减少这个遍历时间。另外,这个遍历实际上可以/删除陈旧的条目(例如大于N天)传播数据会改善删除时间吗?

/ strong>
-NTFS
-Windows系列操作系统(Server 2003,2008)
$ b $ -Java J2ee应用程序

我/我们希望在文件系统可伸缩性问题上进行任何学校教育。



预先感谢。

PS我应该评论说,我有自己的测试工具和能力,但我认为我会首先选择理论和经验的蜂巢思想。 解决方案

我也相信,跨子目录传播文件将加速操作。所以我进行了测试:我已经生成了从AAAA到ZZZZ(26 ^ 4个文件,大约为450K)的文件,并将它们放到一个NTFS目录中。我也把相同的文件放到从AA到ZZ的子目录中(即按其名称的前两个字母来分组文件)。然后我进行了一些测试 - 枚举和随机访问。创建之后和测试之间重新启动系统。

扁平结构比子目录的性能略好。我相信这是因为目录缓存和NTFS索引目录内容,所以查找速度很快。
$ b 请注意,完整枚举(在这两种情况下)对于400K文件花了大约3分钟。这是非常重要的时间,但是子目录使它更糟。



结论:特别是在NTFS上,如果可以访问任何文件,将文件分组到子目录是没有意义的文件。如果你有一个缓存,我还会测试按日期或按域对文件进行分组,假设有些文件比其他文件更频繁访问,操作系统不需要将所有目录保存在内存中。但是,对于您的文件数量(100K以下),这可能不会提供显着的好处。我想你需要自己衡量这些特定的场景。



更新:我减少了随机存取的测试,只能访问一半文件(从AA到OO)。假设这将涉及一个平面目录和只有一半的子目录(给子目录情况奖金)。仍然平坦的目录表现更好。所以我认为,除非你有几百万个文件,否则将它们保存在NTFS上的一个平面目录将比将它们分组到子目录中要快。

Context We have a homegrown filesystem-backed caching library. We currently have performance problems with one installation due to large number of entries (e.g. up to 100,000). The problem: we store all fs entries in one "cache directory". Very large directories perform poorly.

We're looking at spreading those entries over subdirectories--as git does, e.g. 100 subdirectories with ~ 1,000 entries each.

The question

I understand that smaller directories sizes will help with filesystem access.

But will "spreading into subdirectories" speed up traversing all entries, e.g. enumerating/reading all 100,000 entries? I.e. When we initialize/warm the cache from the FS store, we need to traversing all 100,000 entries (and deleting old entries) can take 10+ minutes.

Will "spreading the data" decrease this "traversal time". Additionally this "traversal" actually can/does delete stale entries (e.g older then N days) Will "spreading the data" improve delete times?

Additional Context -NTFS -Windows Family OS (Server 2003, 2008)

-Java J2ee application.

I/we would appreciate any schooling on filesystem scalability issues.

Thanks in advance.

will

p.s. I should comment that I have the tools and ability to test this myself, but figured I'd pick the hive mind for the theory and experience first.

解决方案

I also believed that spreading files across subdirectories will speed-up operations.

So I conducted the tests: I've generated files from AAAA to ZZZZ (26^4 files, it's about 450K) and placed them into one NTFS directory. I also placed the identical files to subdirectories from AA to ZZ (i.e. grouped files by first 2 letters of their names). Then I performed some tests - enumeration and random access. I rebooted the system after creation and between tests.

Flat structure exposed slightly better performance than subdirectories. I believe this is because the directories are cached and NTFS indexes directory contents, so lookup is fast.

Note, that full enumeration (in both cases) took about 3 minutes for 400K files. This is significant time, but subdirectories make it even worse.

Conclusion: on NTFS in particular it makes no sense to group files into subdirectories if access is possible to any of those files. If you have a cache, I would also test grouping the files by date or by domain, assuming that some files are accessed more frequently than others, and the OS doesn't need to keep all directories in memory. However, for your number of files (under 100K) this probably wouldn't provide significant benefits either. You need to measure such specific scenarios yourself, I think.

Update: I've reduced my test for random access to only access half of the files (from AA to OO). The assumption was that this will involve one flat directory and only half of subdirectories (giving a bonus to subdirectory case). Still flat directory performed better. So I assume that unless you have millions of files, keeping them in one flat directory on NTFS will be faster than grouping them into subdirectories.

这篇关于NTFS目录有100K条目。如果传播超过100个子目录,性能提升多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆