os.walk 很慢,有什么优化的办法吗? [英] os.walk very slow, any way to optimise?
问题描述
我正在使用 os.walk
构建数据存储的地图(此地图稍后用于我正在构建的工具中)
I am using os.walk
to build a map of a data-store (this map is used later in the tool I am building)
这是我目前使用的代码:
This is the code I currently use:
def find_children(tickstore):
children = []
dir_list = os.walk(tickstore)
for i in dir_list:
children.append(i[0])
return children
我对此做了一些分析:
dir_list = os.walk(tickstore)
立即运行,如果我对 dir_list
什么都不做,那么这个函数会立即完成.
dir_list = os.walk(tickstore)
runs instantly, if I do nothing with dir_list
then this function completes instantly.
迭代 dir_list
需要很长时间,即使我没有 append
任何东西,只是迭代它需要时间.
It is iterating over dir_list
that takes a long time, even if I don't append
anything, just iterating over it is what takes the time.
Tickstore
是一个大数据存储,有大约 10,000 个目录.
Tickstore
is a big datastore, with ~10,000 directories.
目前完成此功能大约需要 35 分钟.
Currently it takes approx 35minutes to complete this function.
有什么办法可以加快速度吗?
Is there any way to speed it up?
我已经研究了 os.walk
的替代方案,但它们似乎都没有在速度方面提供太多优势.
I've looked at alternatives to os.walk
but none of them seemed to provide much of an advantage in terms of speed.
推荐答案
是:使用 Python 3.5(目前仍然是 RC,但 应该暂时退出).在 Python 3.5 中,重写了 os.walk
以提高效率.
Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk
was rewritten to be more efficient.
这项工作是作为 PEP 471 的一部分完成的.
This work done as part of PEP 471.
从 PEP 中提取:
Python 的内置 os.walk()
比它需要的慢得多是,因为——除了在每个目录上调用 os.listdir()
-- 它对每个文件执行 stat()
系统调用或 GetFileAttributes()
以确定条目是否为目录.
Python's built-in
os.walk()
is significantly slower than it needs to be, because -- in addition to callingos.listdir()
on each directory -- it executes thestat()
system call orGetFileAttributes()
on each file to determine whether the entry is a directory or not.
但是底层系统调用——FindFirstFile
/FindNextFile
onWindows 和 POSIX 系统上的 readdir
-- 已经告诉你返回的文件是否为目录,因此没有进一步的系统调用需要.此外,Windows 系统调用返回所有信息对于目录条目上的 stat_result
对象,例如文件大小和上次修改时间.
But the underlying system calls -- FindFirstFile
/ FindNextFile
on
Windows and readdir
on POSIX systems -- already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information
for a stat_result
object on the directory entry, such as file size and
last modification time.
简而言之,您可以减少一个系统调用所需的次数os.walk()
之类的树函数从大约 2N 到 N,其中 N 是树中文件和目录的总数.(而且因为目录树通常比深度更宽,通常是很多比这更好.)
In short, you can reduce the number of system calls required for a
tree function like os.walk()
from approximately 2N to N, where N is
the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
实际上,删除所有这些额外的系统调用会使 os.walk()
在 Windows 上大约快 8-9 倍,在 Windows 上大约快 2-3 倍POSIX 系统.所以我们不是在谈论微优化.看此处提供更多基准.
In practice, removing all those extra system calls makes os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on
POSIX systems. So we're not talking about micro-optimizations. See
more benchmarks here.
这篇关于os.walk 很慢,有什么优化的办法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!