Python - 从大型(6GB +)zip文件中提取文件 [英] Python - Extracting files from a large (6GB+) zip file

查看:2925
本文介绍了Python - 从大型(6GB +)zip文件中提取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Python 脚本,其中我需要提取ZIP文件的内容。



有很多关于 zlib zipfile 模块,但是,我找不到一个单一的方法,在我的情况下。
我有代码:

 使用zipfile.ZipFile(fname,r)as z:
try:
log.info(Extracting%s%fname)
head,tail = os.path.split(fname)
z.extractall(folder +/+ tail )
除了zipfile.BadZipfile:
log.error(Bad Zip file)
,除了zipfile.LargeZipFile:
log.error(Zip文件需要ZIP64功能,未被启用(即,太大))
除了zipfile.error:
log.error(解压缩ZIP文件时出错)

我知道我需要将 allowZip64 设置为 true 但我不知道如何做到这一点。然而,即使如此, LargeZipFile 异常不会抛出,而是 BadZipFile 异常。我不知道为什么。



此外,这是处理提取6GB zip存档的最佳方法



更新:
修改 BadZipfile 例外:

 除了zipfile.BadZipfile as inst:
log.error(Bad Zip file)
打印类型(inst)#异常实例
print inst.args#arguments存储在。 args
print inst

显示:

 < class'zipfile.BadZipfile'> 
('file header'的错误的魔法数字)

更新#2: p>

完整的追踪显示

  BadZipfile追踪b $ b< ipython-input-1-8d34a9f58f6a>在< module>()
6为z.infolist()中的成员:
7 print member.filename [-70:],
----> 8 f = z.open(member,'r')
9 size = 0
10 while True:

/Users/brspurri/anaconda/python.app/Contents/ lib / python2.7 / zipfile.pyc in open(self,name,mode,pwd)
965 fheader = struct.unpack(structFileHeader,fheader)
966 if fheader [_FH_SIGNATURE]!= stringFileHeader:
- > 967 raise badZipfile(bad magic number for file header)
968
969 fname = zef_file.read(fheader [_FH_FILENAME_LENGTH])

BadZipfile:

运行代码:

  import sys 
import zipfile

open(zip_filename,'rb')as zf:
z = zipfile.ZipFile(zf,allowZip64 = True)
z.testzip()

不输出任何内容。


解决方案

问题是,你有一个损坏的zip文件。我可以添加更多关于腐败的细节,但首先是实用的东西:



您可以使用此代码段,告诉您归档中的哪个成员已损坏。但是, print z.testzip()已经告诉你同样的事情。和 zip -T unzip 在命令行也应该给你的信息与适当的冗长。 >




那么,你做什么呢?



,如果你可以得到一个未损坏的文件副本,这样做。



如果不是,如果你想跳过坏文件,提取一切,容易 - 大多与上面链接的代码片段相同的代码:

  with open(sys.argv [1],'rb') as zf:
z = zipfile.ZipFile(zf,allowZip64 = True)
for z.infolist()中的成员:
try:
z.extract(member)
除了zipfile.error为e:
#log错误,member.filename,无论



< hr>

文件头的坏魔术数字异常消息表示 zipfile 能够成功打开zip文件,解析其目录,查找成员的信息,寻找存档中的该成员,并读取该成员的标题 - 所有这意味着您可能没有zip64相关的问题。但是,当它读取的头,它没有预期的魔术签名 PK\003\004 。这意味着归档已损坏。



事实上,损坏刚好在4294967296意味着非常强烈的,你有一个64位的问题在链的某处,因为

命令行 zip > / unzip 工具有一些解决方法来处理导致此类问题的常见腐败原因。它看起来像这些解决方法可能正在这个存档,因为你得到一个警告,但所有的文件显然恢复。 Python的 zipfile 库没有这些解决方法,我怀疑你自己写自己的 zip - 处理代码...



但是,这打开了两个可能性的门:



首先,使用 -F 可以为您 修改zip文件 FF 标志。 (请阅读联机帮助页或 zip -h ,如果您需要帮助,请询问SuperUser等网站。)



如果所有其他操作失败,您可以从Python运行 unzip 工具,而不是使用 zipfile this:

  subprocess.check_output(['unzip',fname])

当然,这比 zipfile 模块具有更少的灵活性和功能,不再使用任何灵活性;你只需调用 extractall


I have a Python script where I need to extract the contents of a ZIP file. However, the zip file is over 6GB in size.

There is a lot of information about zlib and zipfile modules, however, I can't find a single approach that works in my case. I have the code:

with zipfile.ZipFile(fname, "r") as z:
        try:
            log.info("Extracting %s " %fname)
            head, tail = os.path.split(fname)
            z.extractall(folder + "/" + tail)
        except zipfile.BadZipfile:
            log.error("Bad Zip file")
        except zipfile.LargeZipFile:
            log.error("Zip file requires ZIP64 functionality but that has not been enabled (i.e., too large)")
        except zipfile.error:
            log.error("Error decompressing ZIP file")

I know that I need to set the allowZip64 to true but I'm unsure of how to do this. Yet, even as is, the LargeZipFile exception is not thrown, but instead the BadZipFile exception is. I have no idea why.

Also, is this the best approach to handle extracting a 6GB zip archive???

Update: Modifying the BadZipfile exception to this:

except zipfile.BadZipfile as inst:
        log.error("Bad Zip file")
        print type(inst)     # the exception instance
        print inst.args      # arguments stored in .args
        print inst

shows:

<class 'zipfile.BadZipfile'>
('Bad magic number for file header',)

Update #2:

The full traceback shows

BadZipfile                                Traceback (most recent call last)
<ipython-input-1-8d34a9f58f6a> in <module>()
      6     for member in z.infolist():
      7         print member.filename[-70:],
----> 8         f = z.open(member, 'r')
      9         size = 0
     10         while True:

/Users/brspurri/anaconda/python.app/Contents/lib/python2.7/zipfile.pyc in open(self, name, mode, pwd)
    965             fheader = struct.unpack(structFileHeader, fheader)
    966             if fheader[_FH_SIGNATURE] != stringFileHeader:
--> 967                 raise BadZipfile("Bad magic number for file header")
    968 
    969             fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])

BadZipfile: Bad magic number for file header

Running the code:

import sys
import zipfile

with open(zip_filename, 'rb') as zf:
    z = zipfile.ZipFile(zf, allowZip64=True)
    z.testzip()

doesn't output anything.

解决方案

The problem is that you have a corrupted zip file. I can add more details about the corruption below, but first the practical stuff:

You can use this code snippet to tell you which member within the archive is corrupted. However, print z.testzip() would already tell you the same thing. And zip -T or unzip on the command line should also give you that info with the appropriate verbosity.


So, what do you do about it?

Well, obviously, if you can get an uncorrupted copy of the file, do that.

If not, if you want to just skip over the bad file and extract everything else, that's pretty easy—mostly the same code as the snippet linked above:

with open(sys.argv[1], 'rb') as zf:
    z = zipfile.ZipFile(zf, allowZip64=True)
    for member in z.infolist():
        try:
            z.extract(member)
        except zipfile.error as e:
            # log the error, the member.filename, whatever


The Bad magic number for file header exception message means that zipfile was able to successfully open the zipfile, parse its directory, find the information for a member, seek to that member within the archive, and read the header of that member—all of which means you probably have no zip64-related problems here. However, when it read that header, it did not have the expected "magic" signature of PK\003\004. That means the archive is corrupted.

The fact that the corruption happens to be at exactly 4294967296 implies very strongly that you had a 64-bit problem somewhere along the chain, because that's exactly 2**32.


The command-line zip/unzip tool has some workarounds to deal with common causes of corruption that lead to problems like this. it looks like those workarounds may be working for this archive, given that you get a warning, but all of the files are apparently recovered. Python's zipfile library does not have those workarounds, and I doubt you want to write your own zip-handling code yourself…

However, that does open the door for two more possibilities:

First, zip might be able to repair the zipfile for you, using the -F of -FF flag. (Read the manpage, or zip -h, or ask at a site like SuperUser if you need help with that.)

And if all else fails, you can run the unzip tool from Python, instead of using zipfile, like this:

subprocess.check_output(['unzip', fname])

That gives you a lot less flexibility and power than the zipfile module, of course—but you're not using any of that flexibility anyway; you're just calling extractall.

这篇关于Python - 从大型(6GB +)zip文件中提取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆