os.walk很慢,有什么方法可以优化吗? [英] os.walk very slow, any way to optimise?

查看:783
本文介绍了os.walk很慢,有什么方法可以优化吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用os.walk来构建数据存储区的映射(此映射稍后将在我正在构建的工具中使用)

I am using os.walk to build a map of a data-store (this map is used later in the tool I am building)

这是我当前使用的代码:

This is the code I currently use:

def find_children(tickstore):
    children = []
    dir_list = os.walk(tickstore)
    for i in dir_list:
        children.append(i[0])
    return children

我已经对其进行了一些分析:

I have done some analysis on it:

dir_list = os.walk(tickstore)立即运行,如果我对dir_list不执行任何操作,则此功能将立即完成.

dir_list = os.walk(tickstore) runs instantly, if I do nothing with dir_list then this function completes instantly.

dir_list上进行迭代需要花费很长时间,即使我什么都没有append,只是在其上进行迭代也是需要时间的.

It is iterating over dir_list that takes a long time, even if I don't append anything, just iterating over it is what takes the time.

Tickstore是一个大数据存储区,拥有约10,000个目录.

Tickstore is a big datastore, with ~10,000 directories.

当前完成此功能大约需要35分钟.

Currently it takes approx 35minutes to complete this function.

有什么办法可以加快速度吗?

Is there any way to speed it up?

我已经研究了os.walk的替代方案,但是它们似乎都没有在速度方面提供很多优势.

I've looked at alternatives to os.walk but none of them seemed to provide much of an advantage in terms of speed.

推荐答案

是:使用Python 3.5(目前仍是RC,但

Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk was rewritten to be more efficient.

这项工作是 PEP 471 的一部分.

This work done as part of PEP 471.

从PEP中提取:

Python的内置os.walk()明显比其所需的速度慢 之所以会这样,是因为-除了在每个目录上调用os.listdir() -在每个文件上执行stat()系统调用或GetFileAttributes()以确定条目是否为目录.

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

但是底层系统会调用-FindFirstFile/FindNextFile Windows和POSIX系统上的readdir-已经告诉您是否 返回的文件是否为目录,因此不再进行其他系统调用 需要.此外,Windows系统调用将返回所有信息 目录条目上的stat_result对象,例如文件大小和 最后修改时间.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

简而言之,您可以减少 树函数,例如os.walk()从大约2N到N,其中N是 树中文件和目录的总数. (因为 目录树通常比其深更宽,通常很多 比这更好.)

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

在实践中,删除所有这些额外的系统调用会使os.walk() 在Windows上快约8-9倍,在Windows上快约2-3倍 POSIX系统.因此,我们不是在谈论微优化.看 这里有更多基准.

In practice, removing all those extra system calls makes os.walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.

这篇关于os.walk很慢,有什么方法可以优化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆