os.walk 很慢,有什么优化的办法吗? [英] os.walk very slow, any way to optimise?

查看:37
本文介绍了os.walk 很慢,有什么优化的办法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 os.walk 构建数据存储的地图(此地图稍后用于我正在构建的工具中)

I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)

这是我目前使用的代码:

This is the code I currently use:

def find_children(tickstore):
    children = []
    dir_list = os.walk(tickstore)
    for i in dir_list:
        children.append(i[0])
    return children

我对此做了一些分析:

dir_list = os.walk(tickstore) 立即运行,如果我对 dir_list 什么都不做,那么这个函数会立即完成.

dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.

迭代 dir_list 需要很长时间,即使我没有 append 任何东西,只是迭代它需要时间.

It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.

Tickstore 是一个大数据存储,有大约 10,000 个目录.

Tickstore is a big datastore, with ~10,000 directories.

目前完成此功能大约需要 35 分钟.

Currently it takes approx 35minutes to complete this function.

有什么办法可以加快速度吗?

Is there any way to speed it up?

我已经研究了 os.walk 的替代方案,但它们似乎都没有在速度方面提供太多优势.

I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.

推荐答案

是:使用 Python 3.5(目前仍然是 RC,但 应该暂时退出).在 Python 3.5 中,重写了 os.walk 以提高效率.

Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.

这项工作是作为 PEP 471 的一部分完成的.

This work done as part of PEP 471.

从 PEP 中提取:

Python 的内置 os.walk() 比它需要的慢得多是,因为——除了在每个目录上调用 os.listdir()-- 它对每个文件执行 stat() 系统调用或 GetFileAttributes() 以确定条目是否为目录.

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

但是底层系统调用——FindFirstFile/FindNextFile onWindows 和 POSIX 系统上的 readdir -- 已经告诉你返回的文件是否为目录,因此没有进一步的系统调用需要.此外,Windows 系统调用返回所有信息对于目录条目上的 stat_result 对象,例如文件大小和上次修改时间.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

简而言之,您可以减少一个系统调用所需的次数os.walk() 之类的树函数从大约 2N 到 N,其中 N 是树中文件和目录的总数.(而且因为目录树通常比深度更宽,通常是很多比这更好.)

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

实际上,删除所有这些额外的系统调用会使 os.walk()在 Windows 上大约快 8-9 倍在 Windows 上大约快 2-3 倍POSIX 系统.所以我们不是在谈论微优化.看此处提供更多基准.

In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.

这篇关于os.walk 很慢,有什么优化的办法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆