压缩内存中的文件,计算校验和,然后在python中将其写为“ gzip” [英] Compress a file in memory, compute checksum and write it as `gzip` in python

查看:114
本文介绍了压缩内存中的文件,计算校验和,然后在python中将其写为“ gzip”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用python压缩文件并计算压缩文件的校验和。我的第一次尝试是使用2个函数:

I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:

def compress_file(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    f_out = gzip.open(output_filename, 'wb')
    f_out.writelines(f_in)
    f_out.close()
    f_in.close()


def md5sum(filename):
    with open(filename) as f:
        md5 = hashlib.md5(f.read()).hexdigest()
    return md5

但是,这导致压缩文件被写入然后重新读取。在NFS挂载的驱动器中,有许多文件(> 10 000),每压缩一个MB,速度很慢。

However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.

如何压缩缓冲区中的文件并然后在写入输出文件之前从该缓冲区计算校验和?

How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?

文件不是那么大,所以我有能力将所有内容存储在内存中。但是,一个不错的增量版本也可能很好。

The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.

最后一个要求是它应该与多处理一起工作(以便并行压缩多个文件)。

The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).

我尝试使用 zlib.compress ,但是返回的字符串缺少gzip文件的标题。

I have tried to use zlib.compress but the returned string miss the header of a gzip file.

编辑:按照 @abarnert sggestion ,我使用了python3 gzip.compress

following @abarnert sggestion, I used python3 gzip.compress:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    # Read in buffer
    buff = f_in.read()
    f_in.close()
    # Compress this buffer
    c_buff = gzip.compress(buff)
    # Compute MD5
    md5 = hashlib.md5(c_buff).hexdigest()
    # Write compressed buffer
    f_out = open(output_filename, 'wb')
    f_out.write(c_buff)
    f_out.close()

    return md5

此p选择正确的gzip文件,但每次运行的输出都不同(md5有所不同)。

This produce a correct gzip file but the output is different at each run (the md5 is different):

>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'

gzip 程序没有此问题:

 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -
 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -

我想这是因为 gzip 模块在创建文件时默认使用当前时间( gzip 程序使用输入文件的修改)。我无法使用 gzip.compress 来更改它。

I guess it's because the gzip module use the current time by default when creating a file (the gzip program use the modification of the input file I guess). There is no way to change that with gzip.compress.

我想创建一个 gzip.GzipFile 处于读/写模式,控制mtime,但 gzip.GzipFile 没有这种模式。

I was thinking to create a gzip.GzipFile in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile.

@zwol建议的启发,我编写了以下函数,可以正确设置文件名和标题中的操作系统(Unix):

Inspired by @zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')    
    # Read data in buffer
    buff = f_in.read()
    # Create output buffer
    c_buff = cStringIO.StringIO()
    # Create gzip file
    input_file_stat = os.stat(input_filename)
    mtime = input_file_stat[8]
    gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
    # Compress data in memory
    gzip_obj.write(buff)
    # Close files
    f_in.close()
    gzip_obj.close()
    # Retrieve compressed data
    c_data = c_buff.getvalue()
    # Change OS value
    c_data = c_data[0:9] + '\003' + c_data[10:]
    # Really write compressed data
    f_out = open(output_filename, "wb")
    f_out.write(c_data)
    # Compute MD5
    md5 = hashlib.md5(c_data).hexdigest()
    return md5

在不同的运行中输出是相同的。此外, file 的输出与 gzip 相同:

The output is the same at different run. Moreover the output of file is the same than gzip:

$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz 
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz 
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression

但是md5不同:

$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz 
39dc3e5a52c71a25c53fcbc02e2702d5  4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4  ref_max/4327_010.pdf.gz

gzip -l <​​/ code >也不同:

gzip -l is also different:

$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz 
     compressed        uncompressed  ratio uncompressed_name
        7286404             7600522   4.1% ref_max/4327_010.pdf
        7297310             7600522   4.0% 4327_010.pdf

我想这是因为 gzip 程序和python gzip 模块(基于C库 zlib )的算法稍有不同。

I guess it's because the gzip program and the python gzip module (which is based on the C library zlib) have a slightly different algorithm.

推荐答案

包装 gzip.GzipFile 对象 io.BytesIO 对象。 (在Python 2中,请使用 cStringIO.StringIO 。)关闭 GzipFile 之后,可以检索压缩的数据。从 BytesIO 对象(使用 getvalue ),对其进行哈希处理并将其写到实际文件中。

Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO instead.) After you close the GzipFile, you can retrieve the compressed data from the BytesIO object (using getvalue), hash it, and write it out to a real file.

顺便说一句,您确实不应该使用完全是MD5。

这篇关于压缩内存中的文件,计算校验和,然后在python中将其写为“ gzip”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆