os.walk很慢,有什么方法可以优化吗? [英] os.walk very slow, any way to optimise?
问题描述
我正在使用os.walk
来构建数据存储区的映射(此映射稍后将在我正在构建的工具中使用)
I am using os.walk
to build a map of a data-store (this map is used later in the tool I am building)
这是我当前使用的代码:
This is the code I currently use:
def find_children(tickstore):
children = []
dir_list = os.walk(tickstore)
for i in dir_list:
children.append(i[0])
return children
我已经对其进行了一些分析:
I have done some analysis on it:
dir_list = os.walk(tickstore)
立即运行,如果我对dir_list
不执行任何操作,则此功能将立即完成.
dir_list = os.walk(tickstore)
runs instantly, if I do nothing with dir_list
then this function completes instantly.
在dir_list
上进行迭代需要花费很长时间,即使我什么都没有append
,只是在其上进行迭代也是需要时间的.
It is iterating over dir_list
that takes a long time, even if I don't append
anything, just iterating over it is what takes the time.
Tickstore
是一个大数据存储区,拥有约10,000个目录.
Tickstore
is a big datastore, with ~10,000 directories.
当前完成此功能大约需要35分钟.
Currently it takes approx 35minutes to complete this function.
有什么办法可以加快速度吗?
Is there any way to speed it up?
我已经研究了os.walk
的替代方案,但是它们似乎都没有在速度方面提供很多优势.
I've looked at alternatives to os.walk
but none of them seemed to provide much of an advantage in terms of speed.
推荐答案
Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk
was rewritten to be more efficient.
这项工作是 PEP 471 的一部分.
This work done as part of PEP 471.
从PEP中提取:
Python的内置
os.walk()
明显比其所需的速度慢 之所以会这样,是因为-除了在每个目录上调用os.listdir()
-在每个文件上执行stat()
系统调用或GetFileAttributes()
以确定条目是否为目录.
Python's built-in
os.walk()
is significantly slower than it needs to be, because -- in addition to callingos.listdir()
on each directory -- it executes thestat()
system call orGetFileAttributes()
on each file to determine whether the entry is a directory or not.
但是底层系统会调用-FindFirstFile
/FindNextFile
Windows和POSIX系统上的readdir
-已经告诉您是否
返回的文件是否为目录,因此不再进行其他系统调用
需要.此外,Windows系统调用将返回所有信息
目录条目上的stat_result
对象,例如文件大小和
最后修改时间.
But the underlying system calls -- FindFirstFile
/ FindNextFile
on
Windows and readdir
on POSIX systems -- already tell you whether the
files returned are directories or not, so no further system calls are
needed. Further, the Windows system calls return all the information
for a stat_result
object on the directory entry, such as file size and
last modification time.
简而言之,您可以减少
树函数,例如os.walk()
从大约2N到N,其中N是
树中文件和目录的总数. (因为
目录树通常比其深更宽,通常很多
比这更好.)
In short, you can reduce the number of system calls required for a
tree function like os.walk()
from approximately 2N to N, where N is
the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
在实践中,删除所有这些额外的系统调用会使os.walk()
在Windows上快约8-9倍,在Windows上快约2-3倍
POSIX系统.因此,我们不是在谈论微优化.看
这里有更多基准.
In practice, removing all those extra system calls makes os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on
POSIX systems. So we're not talking about micro-optimizations. See
more benchmarks here.
这篇关于os.walk很慢,有什么方法可以优化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!