迭代文件夹中的大量文件 [英] Iterate over a very large number of files in a folder

查看:33
本文介绍了迭代文件夹中的大量文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当目录中的文件数大于 2.500.000 时,使用 NTFS 和 Windows 7 遍历目录中所有文件的最快方法是什么?所有文件都平放在顶级目录下.

What is the fastest way to iterate over all files in a directory using NTFS and Windows 7, when the filecount in the directory is bigger than 2.500.000? All Files are flat under the top-level directory.

目前我使用

for root, subFolders, files in os.walk(rootdir):
    for file in files:
        f = os.path.join(root,file)
        with open(f) as cf:
            [...]

但是它非常非常慢.该进程已经运行了大约一个小时,仍然没有处理一个文件,但仍然以每秒大约 2kB 的内存使用量增长.

but it is very very slow. The process has been running for about an hour and still has not processed a single file but still grows with about 2kB of Memory Usage per second.

推荐答案

默认 os.walk 自底向上遍历目录树.如果你有一棵有很多叶子的深树,我这可能会给性能带来惩罚——或者至少是为了增加状态"时间,因为 walk 必须阅读处理第一个文件之前的大量数据.

By default os.walk walk the directory tree bottom-up. If you have a deep tree with many leafs, I guess this could leave to performances penalties -- or at least for an increased "statup" time, since walk has to read lots of data before processing the first file.

所有这些都是推测性的,您是否试图强制进行自上而下的探索:

All of this being speculative, have you tried to force a topdown explorations:

for root, subFolders, files in os.walk(rootdir, topdown=True):
    ...

<小时>

由于文件似乎在一个平面目录中,可能 glob.iglob 可以通过返回迭代器来获得更好的性能(而其他方法如 os.walkos.listdir>glob.glob 首先构建所有文件的 list).你能不能试试这样的:

As the files appear to be in a flat directory, maybe glob.iglob could leave to better performance by returning an iterator (whereas other method like os.walk, os.listdir or glob.glob build first the list of all files). Could you try something like that:

import glob

# ...
for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
    # ...

这篇关于迭代文件夹中的大量文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆