不会减少使用python gzip模块压缩的文件的大小 [英] Size of files compressed with python gzip module is not reduced

查看:133
本文介绍了不会减少使用python gzip模块压缩的文件的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个简单的测试代码,该代码生成许多整数,并使用

I made a simple test code that generates a lot of integers and writes them into a compressed file using the gzip module.

import gzip
for idx in range(100000):
    with gzip.open('output.gz', 'ab') as f:
        line = (str(idx) + '\n').encode()
        f.write(line)

已创建压缩文件,但是当我对其进行解压缩时,原始数据实际上要小得多:

The compressed file is created but when I decompress it, the raw data are actually a lot smaller:

$ ls -l
  588890 output
 3288710 output.gz

您能解释一下我在做什么错吗?

Can you please explain what am I doing wrong here?

推荐答案

附加模式将附加到现有流的假设是错误的.相反,它将新的流连接到现有的gzip文件.解压缩时,将它们透明地串联起来,就好像您压缩了单个文件一样.但是每个流都包含自己的页眉和页脚,它们加起来.检查文件可发现

The assumption that append mode would append to the existing stream is wrong. Instead it concatenates a new stream to the existing gzip file. When decompressing these are then concatenated transparently as if you had compressed a single file. But each stream contains its own header and footer and those add up. Inspecting your file reveals

 % hexdump -C output.gz|head -n5
00000000  1f 8b 08 08 2e e7 03 5b  02 ff 6f 75 74 70 75 74  |.......[..output|
00000010  00 33 e0 02 00 12 cd 4a  7e 02 00 00 00 1f 8b 08  |.3.....J~.......|
00000020  08 2e e7 03 5b 02 ff 6f  75 74 70 75 74 00 33 e4  |....[..output.3.|
00000030  02 00 53 fc 51 67 02 00  00 00 1f 8b 08 08 2e e7  |..S.Qg..........|
00000040  03 5b 02 ff 6f 75 74 70  75 74 00 33 e2 02 00 90  |.[..output.3....|

请注意重复魔术数字1f 8b,这标志着新流的开始.

Note the repetition of the magic number 1f 8b, which marks the beginning of a new stream.

通常,在循环中以附加模式重复打开文件通常是个坏主意.而是一次打开文件进行写入,然后将内容循环写入:

In general it's usually a bad idea to repeatedly open a file in append mode in a loop. Instead open the file once for writing and write the contents in a loop:

with gzip.open('output.gz', 'wb') as f:
    for idx in range(100000):
        line = (str(idx) + '\n').encode()
        f.write(line)

与原始的3 MiB相比,生成的文件约为200 kiB.

The resulting file is around 200 kiB, compared to the original 3 MiB.

这篇关于不会减少使用python gzip模块压缩的文件的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆