如何列出gz文件的内容,而不提取它在python? [英] How do I list contents of a gz file without extracting it in python?

查看:441
本文介绍了如何列出gz文件的内容,而不提取它在python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 .gz 文件,我需要使用python获取文件的名称。



这个问题与这一个



唯一的区别是我的文件是 .gz 不是 .tar.gz 所以 tarfile 图书馆没有帮助我。



我使用请求库来请求URL。

这是我用来下载文件的代码

  response = requests.get(line.rstrip(),stream = True)
如果response.status_code == 200:
with open(str(base_output_dir)+/ + str(current_dir)+/+ str(count)+。gz,'wb')as out_file:
shutil.copyfileobj(response.raw,out_file)
del response

此代码下载名为 1.gz 的文件例。现在,如果我使用归档管理器打开文件,该文件将包含 my_latest_data.json



提取文件,输出为 my_latest_data.json


$ b

下面是我用来解压文件的代码

  inF = gzip.open(f,'rb')
outfilename = f.split(。)[0]
outF = open(outfilename,'wb')
outF.write (inF.read())
inF.close()
outF.close()


$ b b

outputfilename 变量是我在脚本中提供的字符串,但我需要真正的文件名( my_latest_data.json

解决方案

您不能,因为Gzip不是存档格式。



这是一个自己的一个胡扯的解释,所以让我把这一点比我在评论中有点多...



它只是压缩



只是一个压缩系统意味着Gzip对输入字节(通常来自文件)进行操作并输出压缩字节。您无法知道内部的字节是否代表多个文件或只是一个文件 - 它只是已压缩的字节流。这就是为什么你可以通过网络接受gzipped数据,例如。它的bytes_in - > bytes_out。



什么是清单?



归档中的头作为归档的内容表。注意,现在我使用术语存档而不是压缩的字节流。归档意味着它是由清单引用的文件或段的集合 - 压缩的字节流只是 字节流。


$ b



.gz文件的内容有点简化的描述是:


  1. 一个包含特殊数字的标题,用于指示其gzip,版本和时间戳(10字节)

  2. 可选标题;通常包括原始文件名(如果压缩目标是文件)

  3. 正文 - 某些压缩的有效内容

  4. CRC-结束(8字节)

就是这样。没有清单。



另一方面,归档格式中将有一个清单。这就是tar库进来的地方。Tar只是把一堆数据放在一个文件中的一种方法,并在前面放置一个清单,让你知道原始文件的名称和它们之前的大小连接到归档中。因此, .tar.gz 如此常见。



有一些实用程序可以解压缩gzip文件,或者只在内存中解压缩,然后让您检查清单或可能在内部的任何内容。但是任何清单的详细信息都是特定于其中包含的归档格式。



请注意,这不同于 zip 归档。 Zip 归档格式,因此包含清单。 Gzip是一个压缩库,如bzip2和朋友。


I have a .gz file and I need to get the name of files inside it using python.

This question is the same as this one

The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here

I am using requests library to request a URL. The response is a compressed file.

Here is the code I am using to download the file

response = requests.get(line.rstrip(), stream=True)
        if response.status_code == 200:
            with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
            del response

This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json

I need to extract the file and the output be my_latest_data.json.

Here is the code I am using to extract the file

inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()

The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)

解决方案

You can't, because Gzip is not an archive format.

That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...

Its just compression

Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.

What's a manifest?

A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.

What's inside a Gzip, anyway?

A somewhat simplified description of a .gz file's contents is:

  1. A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
  2. Optional headers; usually including the original filename (if the compression target was a file)
  3. The body -- some compressed payload
  4. A CRC-32 checksum at the end (8 bytes)

That's it. No manifest.

Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.

There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.

Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.

这篇关于如何列出gz文件的内容,而不提取它在python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆