压缩内存中的文件,计算校验和,然后在python中将其写为“ gzip” [英] Compress a file in memory, compute checksum and write it as `gzip` in python
问题描述
我想使用python压缩文件并计算压缩文件的校验和。我的第一次尝试是使用2个函数:
I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:
def compress_file(input_filename, output_filename):
f_in = open(input_filename, 'rb')
f_out = gzip.open(output_filename, 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
def md5sum(filename):
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
return md5
但是,这导致压缩文件被写入然后重新读取。在NFS挂载的驱动器中,有许多文件(> 10 000),每压缩一个MB,速度很慢。
However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
如何压缩缓冲区中的文件并然后在写入输出文件之前从该缓冲区计算校验和?
How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?
文件不是那么大,所以我有能力将所有内容存储在内存中。但是,一个不错的增量版本也可能很好。
The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.
最后一个要求是它应该与多处理一起工作(以便并行压缩多个文件)。
The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).
我尝试使用 zlib.compress
,但是返回的字符串缺少gzip文件的标题。
I have tried to use zlib.compress
but the returned string miss the header of a gzip file.
编辑:按照 @abarnert sggestion ,我使用了python3 gzip.compress
:
following @abarnert sggestion, I used python3 gzip.compress
:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read in buffer
buff = f_in.read()
f_in.close()
# Compress this buffer
c_buff = gzip.compress(buff)
# Compute MD5
md5 = hashlib.md5(c_buff).hexdigest()
# Write compressed buffer
f_out = open(output_filename, 'wb')
f_out.write(c_buff)
f_out.close()
return md5
此p选择正确的gzip文件,但每次运行的输出都不同(md5有所不同)。
This produce a correct gzip file but the output is different at each run (the md5 is different):
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'
gzip
程序没有此问题:
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
我想这是因为 gzip
模块在创建文件时默认使用当前时间( gzip
程序使用输入文件的修改)。我无法使用 gzip.compress
来更改它。
I guess it's because the gzip
module use the current time by default when creating a file (the gzip
program use the modification of the input file I guess). There is no way to change that with gzip.compress
.
我想创建一个 gzip.GzipFile
处于读/写模式,控制mtime,但 gzip.GzipFile
没有这种模式。
I was thinking to create a gzip.GzipFile
in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile
.
受 @zwol建议的启发,我编写了以下函数,可以正确设置文件名和标题中的操作系统(Unix):
Inspired by @zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read data in buffer
buff = f_in.read()
# Create output buffer
c_buff = cStringIO.StringIO()
# Create gzip file
input_file_stat = os.stat(input_filename)
mtime = input_file_stat[8]
gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
# Compress data in memory
gzip_obj.write(buff)
# Close files
f_in.close()
gzip_obj.close()
# Retrieve compressed data
c_data = c_buff.getvalue()
# Change OS value
c_data = c_data[0:9] + '\003' + c_data[10:]
# Really write compressed data
f_out = open(output_filename, "wb")
f_out.write(c_data)
# Compute MD5
md5 = hashlib.md5(c_data).hexdigest()
return md5
在不同的运行中输出是相同的。此外, file
的输出与 gzip
相同:
The output is the same at different run. Moreover the output of file
is the same than gzip
:
$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
但是md5不同:
$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz
39dc3e5a52c71a25c53fcbc02e2702d5 4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4 ref_max/4327_010.pdf.gz
gzip -l </ code >也不同:
gzip -l
is also different:
$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz
compressed uncompressed ratio uncompressed_name
7286404 7600522 4.1% ref_max/4327_010.pdf
7297310 7600522 4.0% 4327_010.pdf
我想这是因为 gzip
程序和python gzip
模块(基于C库 zlib
)的算法稍有不同。
I guess it's because the gzip
program and the python gzip
module (which is based on the C library zlib
) have a slightly different algorithm.
推荐答案
包装 gzip.GzipFile
对象 io.BytesIO
对象。 (在Python 2中,请使用 cStringIO.StringIO
。)关闭 GzipFile
之后,可以检索压缩的数据。从 BytesIO
对象(使用 getvalue
),对其进行哈希处理并将其写到实际文件中。
Wrap a gzip.GzipFile
object around an io.BytesIO
object. (In Python 2, use cStringIO.StringIO
instead.) After you close the GzipFile
, you can retrieve the compressed data from the BytesIO
object (using getvalue
), hash it, and write it out to a real file.
顺便说一句,您确实不应该使用完全是MD5。
这篇关于压缩内存中的文件,计算校验和,然后在python中将其写为“ gzip”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!