Python中的Tarfile:是否可以仅提取一些数据来更有效地解压缩? [英] Tarfile in Python: Can I untar more efficiently by extracting only some of the data?

查看:103
本文介绍了Python中的Tarfile:是否可以仅提取一些数据来更有效地解压缩?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从USGS订购大量的landsat场景,这些场景是tar.gz档案.我正在编写一个简单的python脚本来解压它们.每个档案包含15张tiff图像,大小在60-120 mb之间,总计超过2 gb.我可以使用以下代码轻松提取整个档案:

I am ordering a huge pile landsat scenes from the USGS, which come as tar.gz archives. I am writing a simple python script to unpack them. Each archive contains 15 tiff images from 60-120 mb in size, totalling just over 2 gb. I can easily extract an entire archive with the following code:

import tarfile
fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
tfile.extractall("newfolder/")

我实际上只需要15个tiff中的6个,在标题中标识为"bands".这些是一些较大的文件,因此它们一起占大约一半的数据.因此,我认为可以通过如下修改代码来加快此过程:

I only actually need 6 of those 15 tiffs, identified as "bands" in the title. These are some of the larger files, so together they account for about half the data. So, I thought I could speed this process up by modifying the code as follows:

fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
membersList = tfile.getmembers()
namesList = tfile.getnames()
bandsList = [x for x, y in zip(membersList, namesList) if "band" in y]
print("extracting...")
tfile.extractall("newfolder/",members=bandsList)

但是,在两个脚本中都添加计时器并不会显着提高第二个脚本的效率(在我的系统上,两个脚本都在一个场景中运行约一分钟).虽然提取速度较快,但似乎可以确定从头开始要提取哪些文件所需的时间,从而抵消了收益.

However, adding a timer to both scripts reveals no significant efficiency gain of the second script (on my system, both run in about a minute on a single scene). While the extraction is somewhat faster, it seems like that gain is offset by the time it takes to figure out which files need to be extracted int he first place.

问题是,这种权衡是我正在做的事情固有的,还是仅仅是我的代码效率低下的结果?我是python的新手,今天才发现tarfile,所以如果后者是true,也不会令我感到惊讶,但是我没有找到任何建议仅能有效地提取档案的一部分.

The question is, is this tradeoff inherant in what I am doing, or just the result of my code being inefficient? I'm relatively new to python and only discovered tarfile today, so it wouldn't surprise me if the latter were true, but I haven't been able to find any recommendations for efficient extraction of only part of an archive.

谢谢!

推荐答案

问题是tar文件没有中央文件列表,但是使用对于某些压缩格式有效,但对于其他压缩格式,则需要您将介于两者之间的所有内容解压缩.

The problem is that a tar file does not have a central file list, but stores files sequentially with a header before each file. The tar file is then compressed via gzip to give you tar.gz. With a tar file, if you don't want to extract a certain file, you simply skip the next header->size bytes in an archive and then read the next header. If the archive is additionally compressed, you'll still have to skip that many bytes, only not within the archive file but within the decompressed data stream - which for some compression formats works, but for others requires you to decompress everything in between.

gzip属于压缩方案的后一类.因此,尽管您通过不将不需要的文件写入磁盘节省了一些时间,但是您的代码仍然对它们进行了解压缩.您可以通过覆盖 _Stream来解决该问题. class 用于非gzip存档,但是对于您的gz文件,您无能为力.

gzip belongs to the latter class of compression schemes. So while you save some time by not writing the undesired files to the disk, your code still decompresses them. You might be able to overcome that problem by overriding the _Stream class for non-gzip archives, but for your gz files, there is nothing you can do about it.

这篇关于Python中的Tarfile:是否可以仅提取一些数据来更有效地解压缩?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆