如何使用python以编程方式计算档案中的文件数 [英] How to programmatically count the number of files in an archive using python

查看:125
本文介绍了如何使用python以编程方式计算档案中的文件数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我维护的程序中,它是按照以下步骤完成的:

In the program I maintain it is done as in:

# count the files in the archive
length = 0
command = ur'"%s" l -slt "%s"' % (u'path/to/7z.exe', srcFile)
ins, err = Popen(command, stdout=PIPE, stdin=PIPE,
                 startupinfo=startupinfo).communicate()
ins = StringIO.StringIO(ins)
for line in ins: length += 1
ins.close()

  1. 这真的是唯一的方法吗?我似乎找不到其他命令,但我似乎有点奇怪不能只问文件数量
  2. 那么错误检查呢?将其修改为:

  1. Is it really the only way ? I can't seem to find any other command but it seems a bit odd that I can't just ask for the number of files
  2. What about error checking ? Would it be enough to modify this to:

proc = Popen(command, stdout=PIPE, stdin=PIPE,
             startupinfo=startupinfo)
out = proc.stdout
# ... count
returncode = proc.wait()
if returncode:
    raise Exception(u'Failed reading number of files from ' + srcFile)

还是我应该真正解析Popen的输出?

or should I actually parse the output of Popen ?

对7z,rar,zip归档文件(由7z.exe支持)感兴趣-但是7z和zip对于初学者来说就足够了

interested in 7z, rar, zip archives (that are supported by 7z.exe) - but 7z and zip would be enough for starters

推荐答案

要计算Python的zip归档文件中的归档成员数:

To count the number of archive members in a zip archive in Python:

#!/usr/bin/env python
import sys
from contextlib import closing
from zipfile import ZipFile

with closing(ZipFile(sys.argv[1])) as archive:
    count = len(archive.infolist())
print(count)

如果可用,它可以使用zlibbz2lzma模块解压缩存档.

It may use zlib, bz2, lzma modules if available, to decompress the archive.

要计算tar归档文件中常规文件的数量,请执行以下操作:

To count the number of regular files in a tar archive:

#!/usr/bin/env python
import sys
import tarfile

with tarfile.open(sys.argv[1]) as archive:
    count = sum(1 for member in archive if member.isreg())
print(count)

根据Python版本的不同,它可能支持gzipbz2lzma压缩.

It may support gzip, bz2 and lzma compression depending on version of Python.

您可以找到一个第三方模块,该模块可以为7z归档文件提供类似的功能.

You could find a 3rd-party module that would provide a similar functionality for 7z archives.

要使用7z实用程序获取存档中的文件数:

To get the number of files in an archive using 7z utility:

import os
import subprocess

def count_files_7z(archive):
    s = subprocess.check_output(["7z", "l", archive], env=dict(os.environ, LC_ALL="C"))
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders$', s).group(1))

如果存档中有很多文件,则此版本可能使用较少的内存:

Here's version that may use less memory if there are many files in the archive:

import os
import re
from subprocess import Popen, PIPE, CalledProcessError

def count_files_7z(archive):
    command = ["7z", "l", archive]
    p = Popen(command, stdout=PIPE, bufsize=1, env=dict(os.environ, LC_ALL="C"))
    with p.stdout:
        for line in p.stdout:
            if line.startswith(b'Error:'): # found error
                error = line + b"".join(p.stdout)
                raise CalledProcessError(p.wait(), command, error)
    returncode = p.wait()
    assert returncode == 0
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders', line).group(1))

示例:

import sys

try:
    print(count_files_7z(sys.argv[1]))
except CalledProcessError as e:
    getattr(sys.stderr, 'buffer', sys.stderr).write(e.output)
    sys.exit(e.returncode)


要计算通用子流程的输出中的行数:


To count the number of lines in the output of a generic subprocess:

from functools import partial
from subprocess import Popen, PIPE, CalledProcessError

p = Popen(command, stdout=PIPE, bufsize=-1)
with p.stdout:
    read_chunk = partial(p.stdout.read, 1 << 15)
    count = sum(chunk.count(b'\n') for chunk in iter(read_chunk, b''))
if p.wait() != 0:
    raise CalledProcessError(p.returncode, command)
print(count)

它支持无限输出.

您能解释一下为什么buffsize = -1的原因吗(与之前的答案中的buffsize = 1相对:stackoverflow.com/a/30984882/281545)

Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: stackoverflow.com/a/30984882/281545)

bufsize=-1表示在Python 2上使用默认的I/O缓冲区大小,而不是bufsize=0(未缓冲).这在Python 2上性能有所提高.在最新的Python 3版本上是默认设置.如果在某些早期的Python 3版本中未将bufsize更改为bufsize=-1,您可能会读一读(丢失数据).

bufsize=-1 means use the default I/O buffer size instead of bufsize=0 (unbuffered) on Python 2. It is a performance boost on Python 2. It is default on the recent Python 3 versions. You might get a short read (lose data) if on some earlier Python 3 versions where bufsize is not changed to bufsize=-1.

此答案分块读取,因此流已完全缓冲以提高效率. 您链接的解决方案是面向行的. bufsize=1表示行缓冲".与bufsize=-1的差异最小.

This answer reads in chunks and therefore the stream is fully buffered for efficiency. The solution you've linked is line-oriented. bufsize=1 means "line buffered". There is minimal difference from bufsize=-1 otherwise.

还有read_chunk = partial(p.stdout.read,1<<<<<<<<<<< 15)

and also what the read_chunk = partial(p.stdout.read, 1 << 15) buys us ?

它等效于read_chunk = lambda: p.stdout.read(1<<15),但通常提供更多自省.它用于实现wc -l在Python中有效.

It is equivalent to read_chunk = lambda: p.stdout.read(1<<15) but provides more introspection in general. It is used to implement wc -l in Python efficiently.

这篇关于如何使用python以编程方式计算档案中的文件数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆