为什么从文件末尾查找BZip2文件而不是Gzip文件? [英] Why is seeking from the end of a file allowed for BZip2 files and not Gzip files?

查看:120
本文介绍了为什么从文件末尾查找BZip2文件而不是Gzip文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python 2.7.6解析大型压缩文件,并且想在开始之前了解未压缩文件的大小。我正在尝试使用此SO答案中介绍的第二种技术。它适用于bzip2格式的文件,但不适用于gzip格式的文件。导致这种情况的两种压缩算法有何不同?

I am parsing large compressed files in Python 2.7.6 and would like to know the uncompressed file size before starting. I am trying to use the second technique presented in this SO answer. It works for bzip2 formatted files but not gzip formatted files. What is different about the two compression algorithms that causes this?

该代码片段演示了行为,假设您当前的工作目录中包含 test.bz2和 test.gz:

This code snipped demonstrates the behavior, assuming you have "test.bz2" and "test.gz" present in your current working directory:

import os
import bz2
import gzip

bz = bz2.BZ2File('test.bz2', mode='r')
bz.seek(0, os.SEEK_END)
bz.close()

gz = gzip.GzipFile('test.gz', mode='r')
gz.seek(0, os.SEEK_END)
gz.close()

显示以下回溯:


Traceback(最近一次通话):

  文件 zip_test.py,第10行,在
$ b中$ b    gz.seek(0,os.SEEK_END)

  文件 /usr/lib64/python2.6/gzip.py,第420行, in seek

     raise ValueError('不支持从头开始搜索')

ValueError:不支持从头开始搜索

Traceback (most recent call last):
  File "zip_test.py", line 10, in
    gz.seek(0, os.SEEK_END)
  File "/usr/lib64/python2.6/gzip.py", line 420, in seek
    raise ValueError('Seek from end not supported')
ValueError: Seek from end not supported

为什么对* .bz2文件有效,但对* .gz文件无效?

Why does this work for *.bz2 files but not *.gz files?

推荐答案

简单来说,gzip是一种流压缩器,这意味着每个压缩的元素都取决于前一个。搜寻将毫无意义,因为无论如何都必须将整个文件解压缩。 gzip.py的作者可能认为最好提出一个错误而不是默默地解压缩文件,这样用户才能意识到查找效率低下。

In simple terms, gzip is a stream compressor, which means that each compressed element depends on the previous one. Seeking would be pointless, because whole file would have to be decompressed anyway. Probably the authors of gzip.py assumed it is better to raise an error instead of silently decompressing the file, so that the user can realize that seeking is inefficient.

另一方面,bzip2是一个块压缩器,每个块都是独立的。

On the other hand bzip2 is a block compressor, each block is independent.

如果您真的想随机访问压缩文件,请编写一个包装程序,将内容解压缩并返回一个提供寻找的缓冲区。不幸的是,那样做会使您的问题的链接中提到的优化失败。

If you really want random access to a gzipped file, then write a wrapper which decompresses the contents and returns a buffer which offers seeking. Unfortunately that would defeat the optimisation which is mentioned in the link from your question.

这篇关于为什么从文件末尾查找BZip2文件而不是Gzip文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆