使用Python查找大文件的更快方法? [英] Faster way to find large files with Python?

查看:82
本文介绍了使用Python查找大文件的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python寻找一种更快的方法来筛选包含大约9个其他目录的大目录(大约1.1TB),并查找大于200GB或多个Linux服务器上类似文件的文件,必须是Python.

I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200GB or something like that on multiple linux servers, and it has to be Python.

我已经尝试了很多事情,例如使用脚本调用du -h,但是du太慢了,无法通过一个1TB的目录. 我也尝试过像find ./+ 200G这样的find命令,但这也将永远成为现实.

I have tried many things like calling du -h with the script but du is just way too slow to go through a directory as large as 1TB. I've also tried the find command like find ./ +200G but that is also going to take foreeeever.

我也尝试了os.walk()和.getsize(),但这是同样的问题-太慢了. 所有这些方法都需要花费数小时和数小时的时间,如果有人可以帮助我,我需要帮助找到其他解决方案.因为我不仅需要在一台服务器上搜索大文件,而且还必须通过SSH切换近300台服务器并输出包含200GB以上所有文件的庞大列表,而我尝试过的三种方法都不会能够做到这一点. 任何帮助表示赞赏,谢谢!

I have also tried os.walk() and doing .getsize() but it's the same problem- too slow. All of these methods take hours and hours and I need help finding another solution if anyone is able to help me. Because not only do I have to do this search for large files on one server, but I will have to ssh through almost 300 servers and output a giant list of all the files > 200GB, and the three methods that i have tried will not be able to get that done. Any help is appreciated, thank you!

推荐答案

那是不对的,因为你做不到os.walk()

That's not true that you cannot do better than os.walk()

scandir据说快2到20倍.

来自 https://pypi.python.org/pypi/scandir

Python的内置os.walk()明显慢于所需速度,因为–除了在每个目录上调用listdir()之外,它还在每个文件上调用stat()以确定文件名是否为目录或不.但是Windows上的FindFirstFile/FindNextFile和Linux/OS X上的readdir都已经告诉您返回的文件是否是目录,因此不需要进一步的stat系统调用.简而言之,您可以将系统调用的数量从大约2N减少到N,其中N是树中文件和目录的总数.

Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

在实践中,删除所有这些额外的系统调用会使os.walk()在Windows上的运行速度约为7-50倍,而在Linux和Mac OS X上的运行速度约为3-10倍.微观优化.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.

从python 3.5开始,感谢 PEP 471 scandir现在是内置的,在os包中提供.小(未试用)示例:

From python 3.5, thanks to PEP 471, scandir is now built-in, provided in the os package. Small (untested) example:

for dentry in os.scandir("/path/to/dir"):
    if dentry.stat().st_size > max_value:
       print("{} is biiiig".format(dentry.name))

(当然有时需要stat,但是使用os.walk时,使用该函数时会隐式调用stat.此外,如果文件具有某些特定的扩展名,则只能执行stat当扩展名匹配时,保存更多)

(of course you need stat at some point, but with os.walk you called stat implicitly when using the function. Also if the files have some specific extensions, you could perform stat only when the extension matches, saving even more)

还有更多内容:

因此,除了提供用于直接调用的scandir()迭代器函数外,Python的现有os.walk()函数还可以大大提高速度.

So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.

因此,迁移到Python 3.5+可以神奇地提高os.walk的速度,而无需重写代码.

So migrating to Python 3.5+ magically speeds up os.walk without having to rewrite your code.

根据我的经验,在网络驱动器上增加stat调用会带来灾难性的性能,因此,如果您的目标是网络驱动器,则比起本地磁盘用户,您将从此增强功能中受益更多.

From my experience, multiplying the stat calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.

但是,获得网络驱动器性能的最佳方法是在本地安装驱动器的计算机上运行扫描工具(例如,使用ssh).它不太方便,但是值得.

The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh for instance). It's less convenient, but it's worth it.

这篇关于使用Python查找大文件的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆