Python和zlib:解压缩级联流的速度非常慢 [英] Python and zlib: Terribly slow decompressing concatenated streams

查看:196
本文介绍了Python和zlib:解压缩级联流的速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我提供了一个压缩文件,其中包含多个单独的压缩XML流。压缩后的文件为833 mb。



如果我尝试将其作为单个对象解压缩,我只会得到第一个流(大约19 kb)。



我已经修改了以下代码,作为对较旧的问题,将每个流解压缩并将其写入文件:

  import zlib 

outfile = open('output.xml','w')

def zipstreams(filename):
返回所有zip流,并它们在文件中的位置。
以open(filename,'rb')作为fh:
data = fh.read()
i = 0
print get it
而我< len(data):
try:
zo = zlib.decompressobj()
dat = zo.decompress(data [i:])
outfile.write(dat)
zo.flush()
i + = len(data [i:])-len(zo.unused_data)
,除了zlib.error:
i + = 1
外文件。 close()

zipstreams('payload')
infile.close()

此代码运行并产生所需的结果(将所有XML数据解压缩为单个文件)。问题是工作需要几天的时间!



即使压缩文件中有成千上万个流,但看来这应该是一个更快的过程。大约8天的时间解压缩833mb(估计为3gb原始数据)表明我做错了什么。



还有另一种方法可以更有效地执行此操作,或者速度较慢



感谢您提供的任何指针或建议!

解决方案

如果不更确切地了解您实际处理的文件格式,很难说太多,但是很明显,您的算法正在处理子字符串是二次的-当您拥有成千上万的子字符串时,这不是一件好事。因此,让我们看看我们知道什么:



您说供应商声明他们是


使用标准zlib压缩库。这些压缩程序与构建gzip实用程序的压缩例程相同。


可以得出结论,组件流采用原始zlib格式,并且封装在gzip包装器(或PKZIP归档文件或其他文件)中。有关ZLIB格式的权威文档位于: http://tools.ietf.org/html/rfc1950



因此,我们假设您的文件与您描述的完全相同: 32字节的标头,后跟原始的ZLIB流并在一起,没有任何其他内容编辑:毕竟不是这样)。



Python的 zlib文档提供了一个解压缩类,实际上非常适合翻阅您的文件。它包含属性 unused_data ,其属性文档明确指出:


唯一确定压缩数据字符串结束位置的方法实际上是解压缩它。这意味着,当压缩数据包含在较大文件的一部分中时,您只能通过读取数据并将其后跟一些非空字符串的数据馈入解压缩对象的decompress()方法中来找到它的末尾,直到不再使用空字符串。


因此,您可以执行以下操作:编写一个循环读取 data ,一次一次(不需要将整个800MB文件读入内存)。将每个块推送到解压缩对象,然后检查 unused_data 属性。当它变为非空时,您将拥有一个完整的对象。将其写入磁盘,创建一个新的解压缩对象,并使用上一个中的 unused_data 初始化iw。

编辑:由于您确实在数据流中有其他数据,因此, ve添加了与下一个ZLIB开始对齐的例程。您需要查找并填写用于标识您的数据中的ZLIB流的两字节序列。 (可以随意使用旧代码来发现它。)尽管通常没有固定的ZLIB标头,但每个流都应该相同,因为它由协议选项和标志,在整个运行过程中大概是相同的。

  import zlib 

#填写:ZHEAD是两个字节,输入中包含实际的ZLIB设置
ZHEAD = CMF + FLG

def findstart(header,buf,source):
在str`buf`中找到`header`,从`source`中了解更多信息如果有必要,

而buf.find(header)== -1:
more = source.read(2 ** 12)
如果len(more)= = 0:#EOF找不到标题
return
buf + =更多

偏移量= buf.find(标题)
返回buf [offset:]

然后您可以前进到下一个流的开头。我添加了 try / 对,因为相同的字节序列可能在流外部发生:

  source = open(datafile,'rb')
skip_ = source.read(32)#跳过非zlib标头

buf =''
而True:
decomp = zlib.decompressobj()
#查找下一个流的开始
buf = findstart(ZHEAD, buf,源代码)
尝试:
流= decomp.decompress(buf)
,除了zlib.error:
打印输出偏移%d处的伪匹配(?)。 %outfile.tell(),
打印跳过2个字节
buf = buf [2:]
继续

#读取,直到zlib认为已完成。文件
而decomp.unused_data =='':
块= source.read(2 ** 12)
(如果len(block)> 0:
stream + = decomp.decompress(block)
else:
break#我们已经达到了EOF

outfile.write(stream)
buf = decomp.unused_data#如果len(block)== 0,则保存下一个流

break#EOF

outfile.close()

PS 1.如果是我,我会将每个XML流写入一个单独的文件中。



PS 2.您可以在文件的第一个MB上测试您所做的任何事情,直到获得足够的性能为止。


I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.

If I try to decompress it as a single object, I only get the first stream (about 19 kb).

I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:

import zlib

outfile = open('output.xml', 'w')

def zipstreams(filename):
    """Return all zip streams and their positions in file."""
    with open(filename, 'rb') as fh:
        data = fh.read()
    i = 0
    print "got it"
    while i < len(data):
        try:
            zo = zlib.decompressobj()
            dat =zo.decompress(data[i:])
            outfile.write(dat)
            zo.flush()
            i += len(data[i:]) - len(zo.unused_data)
        except zlib.error:
            i += 1
    outfile.close()

zipstreams('payload')
infile.close()

This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!

Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.

Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?

Thanks for any pointers or suggestions you have!

解决方案

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:

You say that the vendor states that they are

using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.

From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: http://tools.ietf.org/html/rfc1950

So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).

Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:

The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).

Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.

import zlib

# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG  

def findstart(header, buf, source):
    """Find `header` in str `buf`, reading more from `source` if necessary"""

    while buf.find(header) == -1:
        more = source.read(2**12)
        if len(more) == 0:  # EOF without finding the header
            return ''
        buf += more

    offset = buf.find(header)
    return buf[offset:]

You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:

source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header

buf = ''
while True:
    decomp = zlib.decompressobj()
    # Find the start of the next stream
    buf = findstart(ZHEAD, buf, source)
    try:    
        stream = decomp.decompress(buf)
    except zlib.error:
        print "Spurious match(?) at output offset %d." % outfile.tell(),
        print "Skipping 2 bytes"
        buf = buf[2:]
        continue

    # Read until zlib decides it's seen a complete file
    while decomp.unused_data == '':
        block = source.read(2**12)
        if len(block) > 0:       
            stream += decomp.decompress(block)
        else:
            break # We've reached EOF

    outfile.write(stream)
    buf = decomp.unused_data # Save for the next stream
    if len(block) == 0:
        break  # EOF

outfile.close()

PS 1. If I were you I'd write each XML stream into a separate file.

PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

这篇关于Python和zlib:解压缩级联流的速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆