很快得到文件夹的总大小 [英] very quickly getting total size of folder

查看:20
本文介绍了很快得到文件夹的总大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用python快速找到任何文件夹的总大小.

导入操作系统from os.path import join, getsize, isfile, isdir, splitextdef GetFolderSize(路径):总大小 = 0对于 os.walk(path) 中的项目:对于项目 [2] 中的文件:尝试:TotalSize = TotalSize + getsize(join(item[0], file))除了:打印(文件错误:"+加入(项目[0],文件))返回总大小打印(浮动(GetFolderSize(C:\"))/1024/1024/1024)

这是我为获取文件夹总大小而编写的简单脚本,大约需要 60 秒(+-5 秒).通过使用多处理,我在四核机器上将时间缩短到 23 秒.

使用 Windows 文件资源管理器只需要大约 3 秒(右键单击 -> 属性自己查看).那么是否有一种更快的方法可以找到接近 Windows 可以做到的速度的文件夹的总大小?

Windows 7,python 2.6(确实进行了搜索,但大多数时候人们使用的方法与我自己的方法非常相似)提前致谢.

解决方案

你处于劣势.

Windows 资源管理器几乎肯定会使用 FindFirstFile/FindNextFile 一次遍历目录结构收集大小信息(通过lpFindFileData),这基本上是每个文件的单个系统调用.>

不幸的是,在这种情况下,Python 不是您的朋友.因此,

  1. os.walk 首先调用 os.listdir(内部调用 FindFirstFile/FindNextFile)
    • 从此时起进行的任何其他系统调用只会使您的速度比 Windows 资源管理器慢
  2. os.walk 然后为 os.listdir 返回的每个文件调用 isdir(内部调用 GetFileAttributesEx -- 或, 在 Win2k 之前,一个 GetFileAttributes+FindFirstFile 组合) 来重新确定是否递归
  3. os.walkos.listdir 将执行额外的内存分配、字符串和数组操作等来填充它们的返回值
  4. 然后为os.walk返回的每个文件调用getsize(它再次调用GetFileAttributesEx)

每个文件的系统调用数是 Windows 资源管理器的 3 倍,此外还有内存分配和操作开销.

您可以使用 Anurag 的解决方案,也可以尝试直接递归调用 FindFirstFile/FindNextFile(这应该与 cygwin 或其他 win32 端口 du -s some_directory.)

参考os.py os.walk 的实现,posixmodule.c 用于 listdirwin32_stat 的实现(由 isdirgetsize.)

请注意,Python 的 os.walk 在所有平台上都不是最佳的(Windows 和 *nices),直到并包括 Python3.1.在 Windows 和 *nices 上,os.walk 可以在不调用 isdir 的情况下实现一次遍历,因为 FindFirst/FindNext (Windows) 和 opendir/readdir (*nix) 已经通过 lpFindFileData->dwFileAttributes (Windows) 和 dirent::d_type (*nix).

也许与直觉相反,在大多数现代配置(例如 Win7 和 NTFS,甚至一些 SMB 实现)上,GetFileAttributesEx 的速度是 FindFirstFile两倍单个文件(可能比使用 FindNextFile 遍历目录还要慢.)

更新:Python 3.5 包含新的 PEP 471 os.scandir() 函数通过返回文件属性和文件名来解决这个问题.这个新函数用于加速内置的 os.walk()(在 Windows 和 Linux 上).您可以使用 PyPI 上的scandir 模块 为旧 Python 版本(包括 2.x)获取此行为.

I want to quickly find the total size of any folder using python.

import os
from os.path import join, getsize, isfile, isdir, splitext
def GetFolderSize(path):
    TotalSize = 0
    for item in os.walk(path):
        for file in item[2]:
            try:
                TotalSize = TotalSize + getsize(join(item[0], file))
            except:
                print("error with file:  " + join(item[0], file))
    return TotalSize

print(float(GetFolderSize("C:\")) /1024 /1024 /1024)

That's the simple script I wrote to get the total size of the folder, it took around 60 seconds (+-5 seconds). By using multiprocessing I got it down to 23 seconds on a quad core machine.

Using the Windows file explorer it takes only ~3 seconds (Right click-> properties to see for yourself). So is there a faster way of finding the total size of a folder close to the speed that windows can do it?

Windows 7, python 2.6 (Did searches but most of the time people used a very similar method to my own) Thanks in advance.

解决方案

You are at a disadvantage.

Windows Explorer almost certainly uses FindFirstFile/FindNextFile to both traverse the directory structure and collect size information (through lpFindFileData) in one pass, making what is essentially a single system call per file.

Python is unfortunately not your friend in this case. Thus,

  1. os.walk first calls os.listdir (which internally calls FindFirstFile/FindNextFile)
    • any additional system calls made from this point onward can only make you slower than Windows Explorer
  2. os.walk then calls isdir for each file returned by os.listdir (which internally calls GetFileAttributesEx -- or, prior to Win2k, a GetFileAttributes+FindFirstFile combo) to redetermine whether to recurse or not
  3. os.walk and os.listdir will perform additional memory allocation, string and array operations etc. to fill out their return value
  4. you then call getsize for each file returned by os.walk (which again calls GetFileAttributesEx)

That is 3x more system calls per file than Windows Explorer, plus memory allocation and manipulation overhead.

You can either use Anurag's solution, or try to call FindFirstFile/FindNextFile directly and recursively (which should be comparable to the performance of a cygwin or other win32 port du -s some_directory.)

Refer to os.py for the implementation of os.walk, posixmodule.c for the implementation of listdir and win32_stat (invoked by both isdir and getsize.)

Note that Python's os.walk is suboptimal on all platforms (Windows and *nices), up to and including Python3.1. On both Windows and *nices os.walk could achieve traversal in a single pass without calling isdir since both FindFirst/FindNext (Windows) and opendir/readdir (*nix) already return file type via lpFindFileData->dwFileAttributes (Windows) and dirent::d_type (*nix).

Perhaps counterintuitively, on most modern configurations (e.g. Win7 and NTFS, and even some SMB implementations) GetFileAttributesEx is twice as slow as FindFirstFile of a single file (possibly even slower than iterating over a directory with FindNextFile.)

Update: Python 3.5 includes the new PEP 471 os.scandir() function that solves this problem by returning file attributes along with the filename. This new function is used to speed up the built-in os.walk() (on both Windows and Linux). You can use the scandir module on PyPI to get this behavior for older Python versions, including 2.x.

这篇关于很快得到文件夹的总大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆