如何在Python中复制zlib压缩器对象的内部状态 [英] How can one copy the internal state of zlib compressor object in Python

查看:186
本文介绍了如何在Python中复制zlib压缩器对象的内部状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须压缩一长串字符串。我必须单独压缩它们。每个字符串小于1000个字符长。然而,许多这些字符串有一个共同的前缀。因此,我想知道是否可以通过压缩公共前缀,然后存储压缩器的状态,并给它的字符串的后缀,摊销压缩成本。



如果你有任何建议,如何在Python中完成这将是巨大的。虽然我在标题中提到zlib任何其他标准模块也将工作。

解决方案

到zlib的Python接口比较薄弱,并且不提供对所有 zlib的功能。如果你可以构建自己的界面zlib,那么你可以做你正在问的,和更多。



和更多与事实有关你正在单独压缩非常短的字符串,这固有地限制了可以获得多少压缩。由于这些字符串有一些共同的内容,你应该使用zlib的 deflateSetDictionary() inflateSetDictionary()函数优点,并可能显着提高压缩率。公共内容可以是您提及的公共前缀,以及字符串中任何其他地方的常见内容。您将定义一个固定的字典,用于所有高达32K的字符串,这些字符串包含字符串中常见的字节序列。你会把最常见的序列放在32K的末尾,而较少的共同序列放在前面。如果你喜欢创建一组字典,并使用从 inflate()的第一次调用返回的字典id,选择字典。对于一个或多个字典,您只需要确保相同的字典存储在压缩和解压缩结束。



对于存储压缩状态,您可以使用 deflateCopy()。这是在Python中提供的 copy()方法。我不确定这会给你很大的速度优势,虽然对于小字符串。



更新:
$ b

从最近添加的注释,我相信你的使用案例是,你发送一些请求的许多字符串到接收器。在这种情况下,可能有一种方法来获得更好的压缩使用微薄的Python接口。您可以使用 flush 方法与 Z_SYNC_FLUSH 强制将到目前为止压缩的输出到输出。这将允许你做的是将请求的一系列字符串作为单个压缩流处理。



这个过程将是你开始一个压缩对象 compressobj(),在请求的第一个字符串的对象上使用 compress(),收集输出然后对对象执行 flush(Z_SYNC_FLUSH),收集剩余的输出。将 compress() flush()的组合输出发送到接收器,它启动了 decompressobj(),然后对该对象使用 decompress(),并返回原始字符串。 (在解压缩端不需要刷新。)



到目前为止,结果没有什么不同,只是压缩第一个字符串。好的部分是,你重复这个过程,而不创建新的压缩或解压缩对象。只需对下一个字符串使用 compress() flush(),然后使用 在另一端得到它。第二个字符串和所有后续字符串的优点是,它们可以使用先前字符串的历史记录进行压缩。然后你不需要构造或使用任何固定的字典。您可以使用先前请求的字符串的历史记录来提供良好压缩所需的饲料。如果你的字符串平均长度为1000字节,最终每个字符串发送将受益于最近发送的32个字符串的历史,因为压缩的滑动窗口是32K长。



完成后,只需关闭对象。


I have to compress a long list of strings. I have to compress them individually. Each string is less than 1000 chars long. However many of these strings have a common prefix. Therefore I was wondering if I could amortize the compression cost, by compressing the common prefix first and then storing the state of the compressor and feed it the suffix of the strings.

If you have any suggestions about how to accomplish this in Python that would be great. Although I mention zlib in the title any other standard module will work too. In this application speed of decompression does not matter much, so I can afford decompression to be quite slow.

解决方案

The Python interface to zlib is rather meager, and does not provide access to all of zlib's capabilities. If you can construct your own interface to zlib, then you can do what you're asking, and more.

The "and more" has to do with the fact that you are compressing very short strings individually, which inherently limits how much compression you can get. Since these strings have some common content, you should use the deflateSetDictionary() and inflateSetDictionary() functions of zlib to take advantage of that fact, and potentially improve the compression significantly. The common content can be the common prefix you mention, as well as common content anywhere else in the string. You would define a fixed dictionary to use for all strings of up to 32K that contains sequences of bytes that appear commonly in the strings. You would put the most common sequences at the end of the 32K, and less common sequences earlier. If there are several classes of these strings with different common sequences, you can if you like create a set of dictionaries and use the dictionary id returned from the first call of inflate() to select the dictionary. For one or several dictionaries, you just need to make sure that the same dictionaries are stored on both the compression and decompression ends.

As for storing the compression state, you can do that with deflateCopy(). This is provided in Python with the copy() method. I'm not sure that that will give you much of a speed advantage though for small strings.

Update:

From recently added comments, I believe that your use case is that you send some of many strings on request to a receiver. There may be a way to get much better compression using the meager Python interface in this case. You can use the flush method with Z_SYNC_FLUSH to force what has been compressed so far to the output. What this would allow you to do is treat the series of strings requested as a single compressed stream.

The process would be that you start a compression object with compressobj(), use compress() on that object with the first string requested, collect the output of that (if any), and then do a flush(Z_SYNC_FLUSH) on the object, collecting the remaining output. Send the combined output of compress() and flush() to the receiver, which has started a decompressobj() and it then uses decompress() on that object with what it was sent, which will return the original string. (No flush is needed on the decompression end.)

So far, the result is not much different than just compressing that first string. The good part is that you repeat that process without creating new compress or decompress objects. Just use compress() and flush() for the next string, and decompress() on the other end to get it. The advantage for the second string, and all subsequent strings, is that they get to use the history of the previous strings for compression. Then you do not need to construct or use any fixed dictionaries. You can just use the history of previously requested strings to provide the fodder needed for good compression. If your strings average 1000 bytes in length, eventually each string sent will benefit from the history of the most recently sent 32 strings, since the sliding window for compression is 32K long.

When you're done, just close the objects.

这篇关于如何在Python中复制zlib压缩器对象的内部状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆