如何在Python中检查空的gzip文件 [英] How to check empty gzip file in Python

查看:188
本文介绍了如何在Python中检查空的gzip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不想使用OS命令,因为这使它依赖于OS.

I don't want to use OS commands as that makes it is OS dependent.

tarfiletarfile.is_tarfile(filename)中可用,以检查文件是否为tar文件.

This is available in tarfile, tarfile.is_tarfile(filename), to check if a file is a tar file or not.

gzip模块中找不到任何相关命令.

I am not able to find any relevant commands in the gzip module.

编辑: 为什么需要此文件:我有gzip文件列表,它们的大小各不相同(1-10 GB),有些是空的.在读取文件之前(使用pandas.read_csv),我想检查文件是否为空,因为对于空文件,我在pandas.read_csv中遇到错误. (错误:预期15列,发现-1)

EDIT: Why do I need this: I have list of gzip files, these vary in sizes (1-10 GB) and some are empty. Before reading a file (using pandas.read_csv), I want to check if the file is empty or not, because for empty files I get an error in pandas.read_csv. (Error like: Expected 15 columns and found -1)

有错误的示例命令:

import pandas as pd
pd.read_csv('C:\Users\...\File.txt.gz', compression='gzip', names={'a', 'b', 'c'}, header=False)
Too many columns specified: expected 3 and found -1

熊猫版本为0.16.2

pandas version is 0.16.2

用于测试的文件,它只是空文件的gzip.

file used for testing, it is just a gzip of empty file.

推荐答案

不幸的是,gzip模块没有提供与gzip程序的-l list选项等效的任何功能.但是在Python 3中,您可以通过使用whence参数2调用.seek方法来轻松获得未压缩数据的大小,该参数表示相对于(未压缩)数据流末端的位置.

Unfortunately, the gzip module does not expose any functionality equivalent to the -l list option of the gzip program. But in Python 3 you can easily get the size of the uncompressed data by calling the .seek method with a whence argument of 2, which signifies positioning relative to the end of the (uncompressed) data stream.

.seek返回新的字节位置,因此.seek(0, 2)返回未压缩文件末尾的字节偏移量,即文件大小.因此,如果未压缩的文件为空,则.seek调用将返回0.

.seek returns the new byte position, so .seek(0, 2) returns the byte offset of the end of the uncompressed file, i.e., the file size. Thus if the uncompressed file is empty the .seek call will return 0.

import gzip

def gz_size(fname):
    with gzip.open(fname, 'rb') as f:
        return f.seek(0, whence=2)

这是一个将在Python 2上运行的函数,已在Python 2.6.6上进行了测试.

Here's a function that will work on Python 2, tested on Python 2.6.6.

def gz_size(fname):
    f = gzip.open(fname, 'rb')
    data = f.read()
    f.close()
    return len(data)

您可以使用pydoc程序阅读有关.seekGzipFile类的其他方法的信息.只需在外壳中运行pydoc gzip.

You can read about .seek and other methods of the GzipFile class using the pydoc program. Just run pydoc gzip in the shell.

或者,如果您希望避免对文件进行解压缩,则可以(某种)直接从.gz文件读取未压缩的数据大小.大小以小尾数无符号长的形式存储在文件的最后4个字节中,因此它的大小实际上是2 ** 32的模数,因此,如果未压缩的数据大小> = 4GB,则它不是真实大小.

Alternatively, if you wish to avoid decompressing the file you can (sort of) read the uncompressed data size directly from the .gz file. The size is stored in the last 4 bytes of the file as a little-endian unsigned long, so it's actually the size modulo 2**32, therefore it will not be the true size if the uncompressed data size is >= 4GB.

此代码可在Python 2和Python 3上使用.

This code works on both Python 2 and Python 3.

import gzip
import struct

def gz_size(fname):
    with open(fname, 'rb') as f:
        f.seek(-4, 2)
        data = f.read(4)
    size = struct.unpack('<L', data)[0]
    return size

但是,这种方法并不可靠,正如Mark Adler( gzip 的合著者)在评论中提到的:

However, this method is not reliable, as Mark Adler (gzip co-author) mentions in the comments:

gzip文件末尾的长度还有其他原因 不会代表未压缩数据的长度. (级联 gzip流,在gzip文件末尾填充.)不应 用于此目的.仅作为对完整性的检查 数据.

There are other reasons that the length at the end of the gzip file would not represent the length of the uncompressed data. (Concatenated gzip streams, padding at the end of the gzip file.) It should not be used for this purpose. It's only there as an integrity check on the data.


这是另一种解决方案.它不会解压缩整个文件.如果输入文件中的未压缩数据长度为零,则返回True,但是如果输入文件本身的长度为零,则返回True.如果输入文件的长度不为零且不是gzip文件,则引发OSError.


Here is another solution. It does not decompress the whole file. It returns True if the uncompressed data in the input file is of zero length, but it also returns True if the input file itself is of zero length. If the input file is not of zero length and is not a gzip file then OSError is raised.

import gzip

def gz_is_empty(fname):
    ''' Test if gzip file fname is empty
        Return True if the uncompressed data in fname has zero length
        or if fname itself has zero length
        Raises OSError if fname has non-zero length and is not a gzip file
    '''
    with gzip.open(fname, 'rb') as f:
        data = f.read(1)
    return len(data) == 0

这篇关于如何在Python中检查空的gzip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆